No Arabic abstract
Automated decision support can accelerate tedious tasks as users can focus their attention where it is needed most. However, a key concern is whether users overly trust or cede agency to automation. In this paper, we investigate the effects of introducing automation to annotating clinical texts--a multi-step, error-prone task of identifying clinical concepts (e.g., procedures) in medical notes, and mapping them to labels in a large ontology. We consider two forms of decision aid: recommending which labels to map concepts to, and pre-populating annotation suggestions. Through laboratory studies, we find that 18 clinicians generally build intuition of when to rely on automation and when to exercise their own judgement. However, when presented with fully pre-populated suggestions, these expert users exhibit less agency: accepting improper mentions, and taking less initiative in creating additional annotations. Our findings inform how systems and algorithms should be designed to mitigate the observed issues.
Training individuals to make accurate decisions from medical images is a critical component of education in diagnostic pathology. We describe a joint experimental and computational modeling approach to examine the similarities and differences in the cognitive processes of novice participants and experienced participants (pathology residents and pathology faculty) in cancer cell image identification. For this study we collected a bank of hundreds of digital images that were identified by cell type and classified by difficulty by a panel of expert hematopathologists. The key manipulations in our study included examining the speed-accuracy tradeoff as well as the impact of prior expectations on decisions. In addition, our study examined individual differences in decision-making by comparing task performance to domain general visual ability (as measured using the Novel Object Memory Test (NOMT) (Richler et al., 2017). Using Signal Detection Theory (SDT) and the Diffusion Decision Model (DDM), we found many similarities between expert and novices in our task. While experts tended to have better discriminability, the two groups responded similarly to time pressure (i.e., reduced caution under speed instructions in the DDM) and to the introduction of a probabilistic cue (i.e., increased response bias in the DDM). These results have important implications for training in this area as well as using novice participants in research on medical image perception and decision-making.
Clinical decision support tools (DST) promise improved healthcare outcomes by offering data-driven insights. While effective in lab settings, almost all DSTs have failed in practice. Empirical research diagnosed poor contextual fit as the cause. This paper describes the design and field evaluation of a radically new form of DST. It automatically generates slides for clinicians decision meetings with subtly embedded machine prognostics. This design took inspiration from the notion of Unremarkable Computing, that by augmenting the users routines technology/AI can have significant importance for the users yet remain unobtrusive. Our field evaluation suggests clinicians are more likely to encounter and embrace such a DST. Drawing on their responses, we discuss the importance and intricacies of finding the right level of unremarkableness in DST design, and share lessons learned in prototyping critical AI systems as a situated experience.
An important role carried out by cyber-security experts is the assessment of proposed computer systems, during their design stage. This task is fraught with difficulties and uncertainty, making the knowledge provided by human experts essential for successful assessment. Today, the increasing number of progressively complex systems has led to an urgent need to produce tools that support the expert-led process of system-security assessment. In this research, we use weighted averages (WAs) and ordered weighted averages (OWAs) with evolutionary algorithms (EAs) to create aggregation operators that model parts of the assessment process. We show how individual overall ratings for security components can be produced from ratings of their characteristics, and how these individual overall ratings can be aggregated to produce overall rankings of potential attacks on a system. As well as the identification of salient attacks and weak points in a prospective system, the proposed method also highlights which factors and security components contribute most to a components difficulty and attack ranking respectively. A real world scenario is used in which experts were asked to rank a set of technical attacks, and to answer a series of questions about the security components that are the subject of the attacks. The work shows how finding good aggregation operators, and identifying important components and factors of a cyber-security problem can be automated. The resulting operators have the potential for use as decision aids for systems designers and cyber-security experts, increasing the amount of assessment that can be achieved with the limited resources available.
The frequency with which people interact with technology means that users may develop interface habits, i.e. fast, automatic responses to stable interface cues. Design guidelines often assume that interface habits are beneficial. However, we lack quantitative evidence of how the development of habits actually affect user performance and an understanding of how changes in the interface design may affect habit development. Our work quantifies the effect of habit formation and disruption on user performance in interaction. Through a forced choice lab study task (n=19) and in the wild deployment (n=18) of a notificationdialog experiment on smartphones, we show that people become more accurate and faster at option selection as they develop an interface habit. Crucially this performance gain is entirely eliminated once the habit is disrupted. We discuss reasons for this performance shift and analyse some disadvantages of interface habits, outlining general design patterns on how to both support and disrupt them.Keywords: Interface habits, user behaviour, breaking habit, interaction science, quantitative research.
We present an in-depth analysis of the impact of multi-word suggestion choices from a neural language model on user behaviour regarding input and text composition in email writing. Our study for the first time compares different numbers of parallel suggestions, and use by native and non-native English writers, to explore a trade-off of efficiency vs ideation, emerging from recent literature. We built a text editor prototype with a neural language model (GPT-2), refined in a prestudy with 30 people. In an online study (N=156), people composed emails in four conditions (0/1/3/6 parallel suggestions). Our results reveal (1) benefits for ideation, and costs for efficiency, when suggesting multiple phrases; (2) that non-native speakers benefit more from more suggestions; and (3) further insights into behaviour patterns. We discuss implications for research, the design of interactive suggestion systems, and the vision of supporting writers with AI instead of replacing them.