No Arabic abstract
Conceptual tests are widely used by physics instructors to assess students conceptual understanding and compare teaching methods. It is common to look at students changes in their answers between a pre-test and a post-test to quantify a transition in students conceptions. This is often done by looking at the proportion of incorrect answers in the pre-test that changes to correct answers in the post-test -- the gain -- and the proportion of correct answers that changes to incorrect answers -- the loss. By comparing theoretical predictions to experimental data on the Force Concept Inventory, we shown that Item Response Theory (IRT) is able to fairly well predict the observed gains and losses. We then use IRT to quantify the students changes in a test-retest situation when no learning occurs and show that $i)$ up to 25% of total answers can change due to the non-deterministic nature of students answer and that $ii)$ gains and losses can go from 0% to 100%. Still using IRT, we highlight the conditions that must satisfy a test in order to minimize gains and losses when no learning occurs. Finally, recommandations on the interpretation of such pre/post-test progression with respect to the initial level of students are proposed.
Ishimoto, Davenport, and Wittmann have previously reported analyses of data from student responses to the Force and Motion Conceptual Evaluation (FMCE), in which they used item response curves (IRCs) to make claims about American and Japanese students relative likelihood to choose certain incorrect responses to some questions. We have used an independent data set of over 6,500 American students responses to the FMCE to generate IRCs to test their claims. Converting the IRCs to vectors, we used dot product analysis to compare each response item quantitatively. For most questions, our analyses are consistent with Ishimoto, Davenport, and Wittmann, with some results suggesting more minor differences between American and Japanese students than previously reported. We also highlight the pedagogical advantages of using IRCs to determine the differences in response patterns for different populations to better understand student thinking prior to instruction.
Measures of face identification proficiency are essential to ensure accurate and consistent performance by professional forensic face examiners and others who perform face identification tasks in applied scenarios. Current proficiency tests rely on static sets of stimulus items, and so, cannot be administered validly to the same individual multiple times. To create a proficiency test, a large number of items of known difficulty must be assembled. Multiple tests of equal difficulty can be constructed then using subsets of items. Here, we introduce a proficiency test, the Triad Identity Matching (TIM) test, based on stimulus difficulty measures based on Item Response Theory (IRT). Participants view face-image triads (N=225) (two images of one identity and one image of a different identity) and select the different identity. In Experiment 1, university students (N=197) showed wide-ranging accuracy on the TIM test. Furthermore, IRT modeling demonstrated that the TIM test produces items of various difficulty levels. In Experiment 2, IRT-based item difficulty measures were used to partition the TIM test into three equally easy and three equally difficult subsets. Simulation results indicated that the full set, as well as curated subsets, of the TIM items yielded reliable estimates of subject ability. In summary, the TIM test can provide a starting point for developing a framework that is flexible, calibrated, and adaptive to measure proficiency across various ability levels (e.g., professionals or populations with face processing deficits)
Research-based assessment instruments (RBAIs) are ubiquitous throughout both physics instruction and physics education research. The vast majority of analyses involving student responses to RBAI questions have focused on whether or not a student selects correct answers and using correctness to measure growth. This approach often undervalues the rich information that may be obtained by examining students particular choices of incorrect answers. In the present study, we aim to reveal some of this valuable information by quantitatively determining the relative correctness of various incorrect responses. To accomplish this, we propose an assumption that allow us to define relative correctness: students who have a high understanding of Newtonian physics are likely to answer more questions correctly and also more likely to choose better incorrect responses, than students who have a low understanding. Analyses using item response theory align with this assumption, and Bocks nominal response model allows us to uniquely rank each incorrect response. We present results from over 7,000 students responses to the Force and Motion Conceptual Evaluation.
Recent years have seen numerous NLP datasets introduced to evaluate the performance of fine-tuned models on natural language understanding tasks. Recent results from large pretrained models, though, show that many of these datasets are largely saturated and unlikely to be able to detect further progress. What kind of datasets are still effective at discriminating among strong models, and what kind of datasets should we expect to be able to detect future improvements? To measure this uniformly across datasets, we draw on Item Response Theory and evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models, while SNLI, MNLI, and CommitmentBank seem to be saturated for current strong models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
This paper presents the first item response theory (IRT) analysis of the national data set on introductory, general education, college-level astronomy teaching using the Light and Spectroscopy Concept Inventory (LSCI). We used the difference between students pre- and post-instruction IRT-estimated abilities as a measure of learning gain. This analysis provides deeper insights than prior publications into both the LSCI as an instrument and into the effectiveness of teaching and learning in introductory astronomy courses. Our IRT analysis supports the classical test theory findings of prior studies using the LSCI with this population. In particular, we found that students in classes that used active learning strategies at least 25% of the time had average IRT-estimated learning gains that were approximately 1 logit larger than students in classes that spent less time on active learning strategies. We also found that instructors who want their classes to achieve an improvement in abilities of average $Delta theta = 1$ logit must spend at least 25% of class time on active learning strategies. However, our analysis also powerfully illustrates the lack of insight into student learning that is revealed by looking at a single measure of learning gain, such as average $Delta theta$. Educators and researchers should also examine the distributions of students abilities pre- and post-instruction in order to understand how many students actually achieved an improvement in their abilities and whether or not a majority of students have moved to post-abilities significantly greater than the national average.