ترغب بنشر مسار تعليمي؟ اضغط هنا

Using Simpsons Paradox to Discover Interesting Patterns in Behavioral Data

141   0   0.0 ( 0 )
 نشر من قبل Kristina Lerman
 تاريخ النشر 2018
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

We describe a data-driven discovery method that leverages Simpsons paradox to uncover interesting patterns in behavioral data. Our method systematically disaggregates data to identify subgroups within a population whose behavior deviates significantly from the rest of the population. Given an outcome of interest and a set of covariates, the method follows three steps. First, it disaggregates data into subgroups, by conditioning on a particular covariate, so as minimize the variation of the outcome within the subgroups. Next, it models the outcome as a linear function of another covariate, both in the subgroups and in the aggregate data. Finally, it compares trends to identify disaggregations that produce subgroups with different behaviors from the aggregate. We illustrate the method by applying it to three real-world behavioral datasets, including Q&A site Stack Exchange and online learning platforms Khan Academy and Duolingo.



قيم البحث

اقرأ أيضاً

Identifying the factors that influence academic performance is an essential part of educational research. Previous studies have documented the importance of personality traits, class attendance, and social network structure. Because most of these ana lyses were based on a single behavioral aspect and/or small sample sizes, there is currently no quantification of the interplay of these factors. Here, we study the academic performance among a cohort of 538 undergraduate students forming a single, densely connected social network. Our work is based on data collected using smartphones, which the students used as their primary phones for two years. The availability of multi-channel data from a single population allows us to directly compare the explanatory power of individual and social characteristics. We find that the most informative indicators of performance are based on social ties and that network indicators result in better model performance than individual characteristics (including both personality and class attendance). We confirm earlier findings that class attendance is the most important predictor among individual characteristics. Finally, our results suggest the presence of strong homophily and/or peer effects among university students.
Simpsons paradox, or Yule-Simpson effect, arises when a trend appears in different subsets of data but disappears or reverses when these subsets are combined. We describe here seven cases of this phenomenon for chemo-kinematical relations believed to constrain the Milky Way disk formation and evolution. We show that interpreting trends in relations, such as the radial and vertical chemical abundance gradients, the age-metallicity relation, and the metallicity-rotational velocity relation (MVR), can lead to conflicting conclusions about the Galaxy past if analyses marginalize over stellar age and/or birth radius. It is demonstrated that the MVR in RAVE giants is consistent with being always strongly negative, when narrow bins of [Mg/Fe] are considered. This is directly related to the negative radial metallicity gradients of stars grouped by common age (mono-age populations) due to the inside out disk formation. The effect of the asymmetric drift can then give rise to a positive MVR trend in high-[alpha/Fe] stars, with a slope dependent on a given surveys selection function and observational uncertainties. We also study the variation of lithium abundance, A(Li), with [Fe/H] of AMBRE:HARPS dwarfs. A strong reversal in the positive A(Li)-[Fe/H] trend of the total sample is found for mono-age populations, flattening for younger groups of stars. Dissecting by birth radius shows strengthening in the positive A(Li)-[Fe/H] trend, shifting to higher [Fe/H] with decreasing birth radius; these observational results suggest new constraints on chemical evolution models. This work highlights the necessity for precise age estimates for large stellar samples covering wide spatial regions.
We investigate how Simpsons paradox affects analysis of trends in social data. According to the paradox, the trends observed in data that has been aggregated over an entire population may be different from, and even opposite to, those of the underlyi ng subgroups. Failure to take this effect into account can lead analysis to wrong conclusions. We present a statistical method to automatically identify Simpsons paradox in data by comparing statistical trends in the aggregate data to those in the disaggregated subgroups. We apply the approach to data from Stack Exchange, a popular question-answering platform, to analyze factors affecting answerer performance, specifically, the likelihood that an answer written by a user will be accepted by the asker as the best answer to his or her question. Our analysis confirms a known Simpsons paradox and identifies several new instances. These paradoxes provide novel insights into user behavior on Stack Exchange.
Recommendation systems are often evaluated based on users interactions that were collected from an existing, already deployed recommendation system. In this situation, users only provide feedback on the exposed items and they may not leave feedback o n other items since they have not been exposed to them by the deployed system. As a result, the collected feedback dataset that is used to evaluate a new model is influenced by the deployed system, as a form of closed loop feedback. In this paper, we show that the typical offline evaluation of recommender systems suffers from the so-called Simpsons paradox. Simpsons paradox is the name given to a phenomenon observed when a significant trend appears in several different sub-populations of observational data but disappears or is even reversed when these sub-populations are combined together. Our in-depth experiments based on stratified sampling reveal that a very small minority of items that are frequently exposed by the deployed system plays a confounding factor in the offline evaluation of recommendation systems. In addition, we propose a novel evaluation methodology that takes into account the confounder, i.e the deployed systems characteristics. Using the relative comparison of many recommendation models as in the typical offline evaluation of recommender systems, and based on the Kendall rank correlation coefficient, we show that our proposed evaluation methodology exhibits statistically significant improvements of 14% and 40% on the examined open loop datasets (Yahoo! and Coat), respectively, in reflecting the true ranking of systems with an open loop (randomised) evaluation in comparison to the standard evaluation.
303 - Andrew Fowlie 2020
We consider the Jeffreys-Lindley paradox from an objective Bayesian perspective by attempting to find priors representing complete indifference to sample size in the problem. This means that we ensure that the prior for the unknown mean and the prior predictive for the $t$-statistic are independent of the sample size. If successful, this would lead to Bayesian model comparison that was independent of sample size and ameliorate the paradox. Unfortunately, it leads to an improper scale-invariant prior for the unknown mean. We show, however, that a truncated scale-invariant prior delays the dependence on sample size, which could be practically significant. Lastly, we shed light on the paradox by relating it to the fact that the scale-invariant prior is improper.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا