ترغب بنشر مسار تعليمي؟ اضغط هنا

Its easy to fool yourself: Case studies on identifying bias and confounding in bio-medical datasets

59   0   0.0 ( 0 )
 نشر من قبل Subhashini Venugopalan
 تاريخ النشر 2019
والبحث باللغة English




اسأل ChatGPT حول البحث

Confounding variables are a well known source of nuisance in biomedical studies. They present an even greater challenge when we combine them with black-box machine learning techniques that operate on raw data. This work presents two case studies. In one, we discovered biases arising from systematic errors in the data generation process. In the other, we found a spurious source of signal unrelated to the prediction task at hand. In both cases, our prediction models performed well but under careful examination hidden confounders and biases were revealed. These are cautionary tales on the limits of using machine learning techniques on raw data from scientific experiments.



قيم البحث

اقرأ أيضاً

In recent years, AI generated art has become very popular. From generating art works in the style of famous artists like Paul Cezanne and Claude Monet to simulating styles of art movements like Ukiyo-e, a variety of creative applications have been ex plored using AI. Looking from an art historical perspective, these applications raise some ethical questions. Can AI model artists styles without stereotyping them? Does AI do justice to the socio-cultural nuances of art movements? In this work, we take a first step towards analyzing these issues. Leveraging directed acyclic graphs to represent potential process of art creation, we propose a simple metric to quantify confounding bias due to the lack of modeling the influence of art movements in learning artists styles. As a case study, we consider the popular cycleGAN model and analyze confounding bias across various genres. The proposed metric is more effective than state-of-the-art outlier detection method in understanding the influence of art movements in artworks. We hope our work will elucidate important shortcomings of computationally modeling artists styles and trigger discussions related to accountability of AI generated art.
Since its renaissance, deep learning has been widely used in various medical imaging tasks and has achieved remarkable success in many medical imaging applications, thereby propelling us into the so-called artificial intelligence (AI) era. It is know n that the success of AI is mostly attributed to the availability of big data with annotations for a single task and the advances in high performance computing. However, medical imaging presents unique challenges that confront deep learning approaches. In this survey paper, we first present traits of medical imaging, highlight both clinical needs and technical challenges in medical imaging, and describe how emerging trends in deep learning are addressing these issues. We cover the topics of network architecture, sparse and noisy labels, federating learning, interpretability, uncertainty quantification, etc. Then, we present several case studies that are commonly found in clinical practice, including digital pathology and chest, brain, cardiovascular, and abdominal imaging. Rather than presenting an exhaustive literature survey, we instead describe some prominent research highlights related to these case study applications. We conclude with a discussion and presentation of promising future directions.
Traditional regression models do not generalize well when learning from small and noisy datasets. Here we propose a novel metamodel structure to improve the regression result. The metamodel is composed of multiple classification base models and a reg ression model built upon the base models. We test this structure on the prediction of autism spectrum disorder (ASD) severity as measured by the ADOS communication (ADOS COMM) score from resting-state fMRI data, using a variety of base models. The metamodel outperforms traditional regression models as measured by the Pearson correlation coefficient between true and predicted scores and stability. In addition, we found that the metamodel is more flexible and more generalizable.
144 - Xiang Wan , Can Yang , Qiang Yang 2010
Gene-gene interactions have long been recognized to be fundamentally important to understand genetic causes of complex disease traits. At present, identifying gene-gene interactions from genome-wide case-control studies is computationally and methodo logically challenging. In this paper, we introduce a simple but powerful method, named `BOolean Operation based Screening and Testing(BOOST). To discover unknown gene-gene interactions that underlie complex diseases, BOOST allows examining all pairwise interactions in genome-wide case-control studies in a remarkably fast manner. We have carried out interaction analyses on seven data sets from the Wellcome Trust Case Control Consortium (WTCCC). Each analysis took less than 60 hours on a standard 3.0 GHz desktop with 4G memory running Windows XP system. The interaction patterns identified from the type 1 diabetes data set display significant difference from those identified from the rheumatoid arthritis data set, while both data sets share a very similar hit region in the WTCCC report. BOOST has also identified many undiscovered interactions between genes in the major histocompatibility complex (MHC) region in the type 1 diabetes data set. In the coming era of large-scale interaction mapping in genome-wide case-control studies, our method can serve as a computationally and statistically useful tool.
111 - Xuehai He , Shu Chen , Zeqian Ju 2020
Medical dialogue systems are promising in assisting in telemedicine to increase access to healthcare services, improve the quality of patient care, and reduce medical costs. To facilitate the research and development of medical dialogue systems, we b uild two large-scale medical dialogue datasets: MedDialog-EN and MedDialog-CN. MedDialog-EN is an English dataset containing 0.3 million conversations between patients and doctors and 0.5 million utterances. MedDialog-CN is an Chinese dataset containing 1.1 million conversations and 4 million utterances. To our best knowledge, MedDialog-(EN,CN) are the largest medical dialogue datasets to date. The dataset is available at https://github.com/UCSD-AI4H/Medical-Dialogue-System

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا