No Arabic abstract
In many settings, an effective way of evaluating objects of interest is to collect evaluations from dispersed individuals and to aggregate these evaluations together. Some examples are categorizing online content and evaluating student assignments via peer grading. For this data science problem, one challenge is to motivate participants to conduct such evaluations carefully and to report them honestly, particularly when doing so is costly. Existing approaches, notably peer-prediction mechanisms, can incentivize truth telling in equilibrium. However, they also give rise to equilibria in which agents do not pay the costs required to evaluate accurately, and hence fail to elicit useful information. We show that this problem is unavoidable whenever agents are able to coordinate using low-cost signals about the items being evaluated (e.g., text labels or pictures). We then consider ways of circumventing this problem by comparing agents reports to ground truth, which is available in practice when there exist trusted evaluators---such as teaching assistants in the peer grading scenario---who can perform a limited number of unbiased (but noisy) evaluations. Of course, when such ground truth is available, a simpler approach is also possible: rewarding each agent based on agreement with ground truth with some probability, and unconditionally rewarding the agent otherwise. Surprisingly, we show that the simpler mechanism achieves stronger incentive guarantees given less access to ground truth than a large set of peer-prediction mechanisms.
We consider the problem of purchasing data for machine learning or statistical estimation. The data analyst has a budget to purchase datasets from multiple data providers. She does not have any test data that can be used to evaluate the collected data and can assign payments to data providers solely based on the collected datasets. We consider the problem in the standard Bayesian paradigm and in two settings: (1) data are only collected once; (2) data are collected repeatedly and each days data are drawn independently from the same distribution. For both settings, our mechanisms guarantee that truthfully reporting ones dataset is always an equilibrium by adopting techniques from peer prediction: pay each provider the mutual information between his reported data and other providers reported data. Depending on the data distribution, the mechanisms can also discourage misreports that would lead to inaccurate predictions. Our mechanisms also guarantee individual rationality and budget feasibility for certain underlying distributions in the first setting and for all distributions in the second setting.
In the setting where we ask participants multiple similar possibly subjective multi-choice questions (e.g. Do you like Bulbasaur? Y/N; do you like Squirtle? Y/N), peer prediction aims to design mechanisms that encourage honest feedback without verification. A series of works have successfully designed multi-task peer prediction mechanisms where reporting truthfully is better than any other strategy (dominantly truthful), while they require an infinite number of tasks. A recent work proposes the first multi-task peer prediction mechanism, Determinant Mutual Information (DMI)-Mechanism, where not only is dominantly truthful but also works for a finite number of tasks (practical). However, few works consider how to optimize the multi-task peer prediction mechanisms. In addition to the definition of optimization goal, the biggest challenge is we do not have space for optimization since there is only a single practical and dominantly truthful mechanism. This work addresses this problem by proposing a tractable effort incentive optimization goal and generalizing DMI-Mechanism to a new family of practical, dominantly truthful mechanisms, Volume Mutual Information (VMI)-Mechanisms. We show that DMI-Mechanism may not be optimal. But we can construct a sequence of VMI-Mechanisms that are approximately optimal. The main technical tool is a novel family of mutual information measures, Volume Mutual Information, which generalizes Determinant Mutual Information. We construct VMI by a simple geometric idea: we measure how informative a distribution is by measuring the volume of distributions that is less informative than it (inappropriately, its similar to measuring how clever a person is by counting the number of people that are less clever than he/she).
We propose measurement integrity, a property related to ex post reward fairness, as a novel desideratum for peer prediction mechanisms in many applications, including peer assessment. We operationalize this notion to evaluate the measurement integrity of different mechanisms in computational experiments. Our evaluations simulate the application of peer prediction mechanisms to peer assessment---a setting in which realistic models have been validated on real data and in which ex post fairness concerns are quite salient. We find that peer prediction mechanisms, as proposed in the literature, largely fail to demonstrate measurement integrity in our experiments. However, we also find that certain mechanisms can be supplemented with realistic parametric statistical models to improve their measurement integrity. In the same setting, we also evaluate an empirical notion of robustness against strategic behavior to complement the theoretical analyses of robustness against strategic behavior that have been the main focus of the peer prediction literature. In this dimension of analysis, we again find that supplementing certain mechanisms with parametric statistical models can improve their empirical performance. Even so, though, we find that theoretical guarantees of robustness against strategic behavior are somewhat noisy predictors of empirical robustness. As a whole, our empirical methodology for quantifying desirable mechanism properties facilitates a more nuanced comparison between mechanisms than theoretical analysis alone. Ultimately, we find there is a trade-off between our two dimensions of analysis. The best performing mechanisms for measurement integrity are highly susceptible to strategic behavior. On the other hand, certain parametric peer prediction mechanisms are robust against all the strategic manipulations we consider while still achieving reasonable measurement integrity.
Recent advances in multi-task peer prediction have greatly expanded our knowledge about the power of multi-task peer prediction mechanisms. Various mechanisms have been proposed in different settings to elicit different types of information. But we still lack understanding about when desirable mechanisms will exist for a multi-task peer prediction problem. In this work, we study the elicitability of multi-task peer prediction problems. We consider a designer who has certain knowledge about the underlying information structure and wants to elicit certain information from a group of participants. Our goal is to infer the possibility of having a desirable mechanism based on the primitives of the problem. Our contribution is twofold. First, we provide a characterization of the elicitable multi-task peer prediction problems, assuming that the designer only uses scoring mechanisms. Scoring mechanisms are the mechanisms that reward participants reports for different tasks separately. The characterization uses a geometric approach based on the power diagram characterization in the single-task setting ([Lambert and Shoham, 2009, Frongillo and Witkowski, 2017]). For general mechanisms, we also give a necessary condition for a multi-task problem to be elicitable. Second, we consider the case when the designer aims to elicit some properties that are linear in the participants posterior about the state of the world. We first show that in some cases, the designer basically can only elicit the posterior itself. We then look into the case when the designer aims to elicit the participants posteriors. We give a necessary condition for the posterior to be elicitable. This condition implies that the mechanisms proposed by Kong and Schoenebeck are already the best we can hope for in their setting, in the sense that their mechanisms can solve any problem instance that can possibly be elicitable.
Crowdsourcing is a popular paradigm for soliciting forecasts on future events. As people may have different forecasts, how to aggregate solicited forecasts into a single accurate prediction remains to be an important challenge, especially when no historical accuracy information is available for identifying experts. In this paper, we borrow ideas from the peer prediction literature and assess the prediction accuracy of participants using solely the collected forecasts. This approach leverages the correlations among peer reports to cross-validate each participants forecasts and allows us to assign a peer assessment score (PAS) for each agent as a proxy for the agents prediction accuracy. We identify several empirically effective methods to generate PAS and propose an aggregation framework that uses PAS to identify experts and to boost existing aggregators prediction accuracy. We evaluate our methods over 14 real-world datasets and show that i) PAS generated from peer prediction methods can approximately reflect the prediction accuracy of agents, and ii) our aggregation framework demonstrates consistent and significant improvement in the prediction accuracy over existing aggregators for both binary and multi-choice questions under three popular accuracy measures: Brier score (mean square error), log score (cross-entropy loss) and AUC-ROC.