No Arabic abstract
The use of machine learning to guide clinical decision making has the potential to worsen existing health disparities. Several recent works frame the problem as that of algorithmic fairness, a framework that has attracted considerable attention and criticism. However, the appropriateness of this framework is unclear due to both ethical as well as technical considerations, the latter of which include trade-offs between measures of fairness and model performance that are not well-understood for predictive models of clinical outcomes. To inform the ongoing debate, we conduct an empirical study to characterize the impact of penalizing group fairness violations on an array of measures of model performance and group fairness. We repeat the analyses across multiple observational healthcare databases, clinical outcomes, and sensitive attributes. We find that procedures that penalize differences between the distributions of predictions across groups induce nearly-universal degradation of multiple performance metrics within groups. On examining the secondary impact of these procedures, we observe heterogeneity of the effect of these procedures on measures of fairness in calibration and ranking across experimental conditions. Beyond the reported trade-offs, we emphasize that analyses of algorithmic fairness in healthcare lack the contextual grounding and causal awareness necessary to reason about the mechanisms that lead to health disparities, as well as about the potential of algorithmic fairness methods to counteract those mechanisms. In light of these limitations, we encourage researchers building predictive models for clinical use to step outside the algorithmic fairness frame and engage critically with the broader sociotechnical context surrounding the use of machine learning in healthcare.
The use of machine learning systems to support decision making in healthcare raises questions as to what extent these systems may introduce or exacerbate disparities in care for historically underrepresented and mistreated groups, due to biases implicitly embedded in observational data in electronic health records. To address this problem in the context of clinical risk prediction models, we develop an augmented counterfactual fairness criteria to extend the group fairness criteria of equalized odds to an individual level. We do so by requiring that the same prediction be made for a patient, and a counterfactual patient resulting from changing a sensitive attribute, if the factual and counterfactual outcomes do not differ. We investigate the extent to which the augmented counterfactual fairness criteria may be applied to develop fair models for prolonged inpatient length of stay and mortality with observational electronic health records data. As the fairness criteria is ill-defined without knowledge of the data generating process, we use a variational autoencoder to perform counterfactual inference in the context of an assumed causal graph. While our technique provides a means to trade off maintenance of fairness with reduction in predictive performance in the context of a learned generative model, further work is needed to assess the generality of this approach.
Many in-hospital mortality risk prediction scores dichotomize predictive variables to simplify the score calculation. However, hard thresholding in these additive stepwise scores of the form add x points if variable v is above/below threshold t may lead to critical failures. In this paper, we seek to develop risk prediction scores that preserve clinical knowledge embedded in features and structure of the existing additive stepwise scores while addressing limitations caused by variable dichotomization. To this end, we propose a novel score structure that relies on a transformation of predictive variables by means of nonlinear logistic functions facilitating smooth differentiation between critical and normal values of the variables. We develop an optimization framework for inferring parameters of the logistic functions for a given patient population via cyclic block coordinate descent. The parameters may readily be updated as the patient population and standards of care evolve. We tested the proposed methodology on two populations: (1) brain trauma patients admitted to the intensive care unit of the Dell Childrens Medical Center of Central Texas between 2007 and 2012, and (2) adult ICU patient data from the MIMIC II database. The results are compared with those obtained by the widely used PRISM III and SOFA scores. The prediction power of a score is evaluated using area under ROC curve, Youdens index, and precision-recall balance in a cross-validation study. The results demonstrate that the new framework enables significant performance improvements over PRISM III and SOFA in terms of all three criteria.
Artificial Intelligence is one of the fastest growing technologies of the 21st century and accompanies us in our daily lives when interacting with technical applications. However, reliance on such technical systems is crucial for their widespread applicability and acceptance. The societal tools to express reliance are usually formalized by lawful regulations, i.e., standards, norms, accreditations, and certificates. Therefore, the TUV AUSTRIA Group in cooperation with the Institute for Machine Learning at the Johannes Kepler University Linz, proposes a certification process and an audit catalog for Machine Learning applications. We are convinced that our approach can serve as the foundation for the certification of applications that use Machine Learning and Deep Learning, the techniques that drive the current revolution in Artificial Intelligence. While certain high-risk areas, such as fully autonomous robots in workspaces shared with humans, are still some time away from certification, we aim to cover low-risk applications with our certification procedure. Our holistic approach attempts to analyze Machine Learning applications from multiple perspectives to evaluate and verify the aspects of secure software development, functional requirements, data quality, data protection, and ethics. Inspired by existing work, we introduce four criticality levels to map the criticality of a Machine Learning application regarding the impact of its decisions on people, environment, and organizations. Currently, the audit catalog can be applied to low-risk applications within the scope of supervised learning as commonly encountered in industry. Guided by field experience, scientific developments, and market demands, the audit catalog will be extended and modified accordingly.
Recently developed machine learning techniques, in association with the Internet of Things (IoT) allow for the implementation of a method of increasing oil production from heavy-oil wells. Steam flood injection, a widely used enhanced oil recovery technique, uses thermal and gravitational potential to mobilize and dilute heavy oil in situ to increase oil production. In contrast to traditional steam flood simulations based on principles of classic physics, we introduce here an approach using cutting-edge machine learning techniques that have the potential to provide a better way to describe the performance of steam flood. We propose a workflow to address a category of time-series data that can be analyzed with supervised machine learning algorithms and IoT. We demonstrate the effectiveness of the technique for forecasting oil production in steam flood scenarios. Moreover, we build an optimization system that recommends an optimal steam allocation plan, and show that it leads to a 3% improvement in oil production. We develop a minimum viable product on a cloud platform that can implement real-time data collection, transfer, and storage, as well as the training and implementation of a cloud-based machine learning model. This workflow also offers an applicable solution to other problems with similar time-series data structures, like predictive maintenance.
Despite the intense attention and investment into clinical machine learning (CML) research, relatively few applications convert to clinical practice. While research is important in advancing the state-of-the-art, translation is equally important in bringing these technologies into a position to ultimately impact patient care and live up to extensive expectations surrounding AI in healthcare. To better characterize a holistic perspective among researchers and practitioners, we survey several participants with experience in developing CML for clinical deployment about their learned experiences. We collate these insights and identify several main categories of barriers and pitfalls in order to better design and develop clinical machine learning applications.