Automatic Classifiers as Scientific Instruments: One Step Further Away from Ground-Truth


Abstract in English

Automatic machine learning-based detectors of various psychological and social phenomena (e.g., emotion, stress, engagement) have great potential to advance basic science. However, when a detector $d$ is trained to approximate an existing measurement tool (e.g., a questionnaire, observation protocol), then care must be taken when interpreting measurements collected using $d$ since they are one step further removed from the underlying construct. We examine how the accuracy of $d$, as quantified by the correlation $q$ of $d$s outputs with the ground-truth construct $U$, impacts the estimated correlation between $U$ (e.g., stress) and some other phenomenon $V$ (e.g., academic performance). In particular: (1) We show that if the true correlation between $U$ and $V$ is $r$, then the expected sample correlation, over all vectors $mathcal{T}^n$ whose correlation with $U$ is $q$, is $qr$. (2) We derive a formula for the probability that the sample correlation (over $n$ subjects) using $d$ is positive given that the true correlation is negative (and vice-versa); this probability can be substantial (around $20-30%$) for values of $n$ and $q$ that have been used in recent affective computing studies. %We also show that this probability decreases monotonically in $n$ and in $q$. (3) With the goal to reduce the variance of correlations estimated by an automatic detector, we show that training multiple neural networks $d^{(1)},ldots,d^{(m)}$ using different training architectures and hyperparameters for the same detection task provides only limited ``coverage of $mathcal{T}^n$.

Download