No Arabic abstract
We study a data analysts problem of acquiring data from self-interested individuals to obtain an accurate estimation of some statistic of a population, subject to an expected budget constraint. Each data holder incurs a cost, which is unknown to the data analyst, to acquire and report his data. The cost can be arbitrarily correlated with the data. The data analyst has an expected budget that she can use to incentivize individuals to provide their data. The goal is to design a joint acquisition-estimation mechanism to optimize the performance of the produced estimator, without any prior information on the underlying distribution of cost and data. We investigate two types of estimations: unbiased point estimation and confidence interval estimation. Unbiased estimators: We design a truthful, individually rational, online mechanism to acquire data from individuals and output an unbiased estimator of the population mean when the data analyst has no prior information on the cost-data distribution and individuals arrive in a random order. The performance of this mechanism matches that of the optimal mechanism, which knows the true cost distribution, within a constant factor. The performance of an estimator is evaluated by its variance under the worst-case cost-data correlation. Confidence intervals: We characterize an approximately optimal (within a factor $2$) mechanism for obtaining a confidence interval of the population mean when the data analyst knows the true cost distribution at the beginning. This mechanism is efficiently computable. We then design a truthful, individually rational, online algorithm that is only worse than the approximately optimal mechanism by a constant factor. The performance of an estimator is evaluated by its expected length under the worst-case cost-data correlation.
We study the problem of selling a good to a group of bidders with interdependent values in a prior-free setting. Each bidder has a signal that can take one of $k$ different values, and her value for the good is a weakly increasing function of all the bidders signals. The bidders are partitioned into $ell$ expertise-groups, based on how their signal can impact the values for the good, and we prove upper and lower bounds regarding the approximability of social welfare and revenue for a variety of settings, parameterized by $k$ and $ell$. Our lower bounds apply to all ex-post incentive compatible mechanisms and our upper bounds are all within a small constant of the lower bounds. Our main results take the appealing form of ascending clock auctions and provide strong incentives by admitting the desired outcomes as obvious ex-post equilibria.
We consider the problem of purchasing data for machine learning or statistical estimation. The data analyst has a budget to purchase datasets from multiple data providers. She does not have any test data that can be used to evaluate the collected data and can assign payments to data providers solely based on the collected datasets. We consider the problem in the standard Bayesian paradigm and in two settings: (1) data are only collected once; (2) data are collected repeatedly and each days data are drawn independently from the same distribution. For both settings, our mechanisms guarantee that truthfully reporting ones dataset is always an equilibrium by adopting techniques from peer prediction: pay each provider the mutual information between his reported data and other providers reported data. Depending on the data distribution, the mechanisms can also discourage misreports that would lead to inaccurate predictions. Our mechanisms also guarantee individual rationality and budget feasibility for certain underlying distributions in the first setting and for all distributions in the second setting.
When the data are stored in a distributed manner, direct application of traditional statistical inference procedures is often prohibitive due to communication cost and privacy concerns. This paper develops and investigates two Communication-Efficient Accurate Statistical Estimators (CEASE), implemented through iterative algorithms for distributed optimization. In each iteration, node machines carry out computation in parallel and communicate with the central processor, which then broadcasts aggregated information to node machines for new updates. The algorithms adapt to the similarity among loss functions on node machines, and converge rapidly when each node machine has large enough sample size. Moreover, they do not require good initialization and enjoy linear converge guarantees under general conditions. The contraction rate of optimization errors is presented explicitly, with dependence on the local sample size unveiled. In addition, the improved statistical accuracy per iteration is derived. By regarding the proposed method as a multi-step statistical estimator, we show that statistical efficiency can be achieved in finite steps in typical statistical applications. In addition, we give the conditions under which the one-step CEASE estimator is statistically efficient. Extensive numerical experiments on both synthetic and real data validate the theoretical results and demonstrate the superior performance of our algorithms.
We study the problem of repeatedly auctioning off an item to one of $k$ bidders where: a) bidders have a per-round individual rationality constraint, b) bidders may leave the mechanism at any point, and c) the bidders valuations are adversarially chosen (the prior-free setting). Without these constraints, the auctioneer can run a second-price auction to sell the business and receive the second highest total value for the entire stream of items. We show that under these constraints, the auctioneer can attain a constant fraction of the sell the business benchmark, but no more than $2/e$ of this benchmark. In the course of doing so, we design mechanisms for a single bidder problem of independent interest: how should you repeatedly sell an item to a (per-round IR) buyer with adversarial valuations if you know their total value over all rounds is $V$ but not how their value changes over time? We demonstrate a mechanism that achieves revenue $V/e$ and show that this is tight.
Quantization has emerged as one of the most prevalent approaches to compress and accelerate neural networks. Recently, data-free quantization has been widely studied as a practical and promising solution. It synthesizes data for calibrating the quantized model according to the batch normalization (BN) statistics of FP32 ones and significantly relieves the heavy dependency on real training data in traditional quantization methods. Unfortunately, we find that in practice, the synthetic data identically constrained by BN statistics suffers serious homogenization at both distribution level and sample level and further causes a significant performance drop of the quantized model. We propose Diverse Sample Generation (DSG) scheme to mitigate the adverse effects caused by homogenization. Specifically, we slack the alignment of feature statistics in the BN layer to relax the constraint at the distribution level and design a layerwise enhancement to reinforce specific layers for different data samples. Our DSG scheme is versatile and even able to be applied to the state-of-the-art post-training quantization method like AdaRound. We evaluate the DSG scheme on the large-scale image classification task and consistently obtain significant improvements over various network architectures and quantization methods, especially when quantized to lower bits (e.g., up to 22% improvement on W4A4). Moreover, benefiting from the enhanced diversity, models calibrated by synthetic data perform close to those calibrated by real data and even outperform them on W4A4.