Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes

98 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Danielle Braun

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Yujia Bao - Zhengyi Deng - Yan Wang

استرجاع المعلومات التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

PURPOSE: The medical literature relevant to germline genetics is growing exponentially. Clinicians need tools monitoring and prioritizing the literature to understand the clinical implications of the pathogenic genetic variants. We developed and evaluated two machine learning models to classify abstracts as relevant to the penetrance (risk of cancer for germline mutation carriers) or prevalence of germline genetic mutations. METHODS: We conducted literature searches in PubMed and retrieved paper titles and abstracts to create an annotated dataset for training and evaluating the two machine learning classification models. Our first model is a support vector machine (SVM) which learns a linear decision rule based on the bag-of-ngrams representation of each title and abstract. Our second model is a convolutional neural network (CNN) which learns a complex nonlinear decision rule based on the raw title and abstract. We evaluated the performance of the two models on the classification of papers as relevant to penetrance or prevalence. RESULTS: For penetrance classification, we annotated 3740 paper titles and abstracts and used 60% for training the model, 20% for tuning the model, and 20% for evaluating the model. The SVM model achieves 89.53% accuracy (percentage of papers that were correctly classified) while the CNN model achieves 88.95 % accuracy. For prevalence classification, we annotated 3753 paper titles and abstracts. The SVM model achieves 89.14% accuracy while the CNN model achieves 89.13 % accuracy. CONCLUSION: Our models achieve high accuracy in classifying abstracts as relevant to penetrance or prevalence. By facilitating literature review, this tool could help clinicians and researchers keep abreast of the burgeoning knowledge of gene-cancer associations and keep the knowledge bases for clinical decision support tools up to date.

قيم البحث

84 - Wolfgang E. Kerzendorf , Ferdinando Patat , Dominic Bordelon 2020

While ancient scientists often had patrons to fund their work, peer review of proposals for the allocation of resources is a foundation of modern science. A very common method is that proposals are evaluated by a small panel of experts (due to logist ics and funding limitations) nominated by the grant-giving institutions. The expert panel process introduces several issues - most notably: 1) biases introduced in the selection of the panel. 2) experts have to read a very large number of proposals. Distributed Peer Review promises to alleviate several of the described problems by distributing the task of reviewing among the proposers. Each proposer is given a limited number of proposals to review and rank. We present the result of an experiment running a machine-learning enhanced distributed peer review process for allocation of telescope time at the European Southern Observatory. In this work, we show that the distributed peer review is statistically the same as a `traditional panel, that our machine learning algorithm can predict expertise of reviewers with a high success rate, and we find that seniority and reviewer expertise have an influence on review quality. The general experience has been overwhelmingly praised from the participating community (using an anonymous feedback mechanism).

الأجهزة والأساليب للزيئات الفيزياء الفلكية الفيزياء والمجتمع

Bangla Natural Language Processing: A Comprehensive Review of Classical, Machine Learning, and Deep Learning Based Methods

102 - Ovishake Sen , Mohtasim Fuad , MD. Nazrul Islam 2021

The Bangla language is the seventh most spoken language, with 265 million native and non-native speakers worldwide. However, English is the predominant language for online resources and technical knowledge, journals, and documentation. Consequently, many Bangla-speaking people, who have limited command of English, face hurdles to utilize English resources. To bridge the gap between limited support and increasing demand, researchers conducted many experiments and developed valuable tools and techniques to create and process Bangla language materials. Many efforts are also ongoing to make it easy to use the Bangla language in the online and technical domains. There are some review papers to understand the past, previous, and future Bangla Natural Language Processing (BNLP) trends. The studies are mainly concentrated on the specific domains of BNLP, such as sentiment analysis, speech recognition, optical character recognition, and text summarization. There is an apparent scarcity of resources that contain a comprehensive study of the recent BNLP tools and methods. Therefore, in this paper, we present a thorough review of 71 BNLP research papers and categorize them into 11 categories, namely Information Extraction, Machine Translation, Named Entity Recognition, Parsing, Parts of Speech Tagging, Question Answering System, Sentiment Analysis, Spam and Fake Detection, Text Summarization, Word Sense Disambiguation, and Speech Processing and Recognition. We study articles published between 1999 to 2021, and 50% of the papers were published after 2015. We discuss Classical, Machine Learning and Deep Learning approaches with different datasets while addressing the limitations and current and future trends of the BNLP.

الحساب واللغة الذكاء الاصطناعي التعلم الآلي

Using Machine Learning to Generate Test Oracles: A Systematic Literature Review

91 - Afonso Fontes , Gregory Gay 2021

Machine learning may enable the automated generation of test oracles. We have characterized emerging research in this area through a systematic literature review examining oracle types, researcher goals, the ML techniques applied, how the generation process was assessed, and the open research challenges in this emerging field. Based on a sample of 22 relevant studies, we observed that ML algorithms generated test verdict, metamorphic relation, and - most commonly - expected output oracles. Almost all studies employ a supervised or semi-supervised approach, trained on labeled system executions or code metadata - including neural networks, support vector machines, adaptive boosting, and decision trees. Oracles are evaluated using the mutation score, correct classifications, accuracy, and ROC. Work-to-date show great promise, but there are significant open challenges regarding the requirements imposed on training data, the complexity of modeled functions, the ML algorithms employed - and how they are applied - the benchmarks used by researchers, and replicability of the studies. We hope that our findings will serve as a roadmap and inspiration for researchers in this field.

هندسة البرمجيات

A Semi-Supervised Machine Learning Approach to Detecting Recurrent Metastatic Breast Cancer Cases Using Linked Cancer Registry and Electronic Medical Record Data

104 - Albee Y. Ling , Allison W. Kurian , Jennifer L. Caswell-Jin 2019

Objectives: Most cancer data sources lack information on metastatic recurrence. Electronic medical records (EMRs) and population-based cancer registries contain complementary information on cancer treatment and outcomes, yet are rarely used synergist ically. To enable detection of metastatic breast cancer (MBC), we applied a semi-supervised machine learning framework to linked EMR-California Cancer Registry (CCR) data. Materials and Methods: We studied 11,459 female patients treated at Stanford Health Care who received an incident breast cancer diagnosis from 2000-2014. The dataset consisted of structured data and unstructured free-text clinical notes from EMR, linked to CCR, a component of the Surveillance, Epidemiology and End Results (SEER) database. We extracted information on metastatic disease from patient notes to infer a class label and then trained a regularized logistic regression model for MBC classification. We evaluated model performance on a gold standard set of set of 146 patients. Results: There are 495 patients with de novo stage IV MBC, 1,374 patients initially diagnosed with Stage 0-III disease had recurrent MBC, and 9,590 had no evidence of metastatis. The median follow-up time is 96.3 months (mean 97.8, standard deviation 46.7). The best-performing model incorporated both EMR and CCR features. The area under the receiver-operating characteristic curve=0.925 [95% confidence interval: 0.880-0.969], sensitivity=0.861, specificity=0.878 and overall accuracy=0.870. Discussion and Conclusion: A framework for MBC case detection combining EMR and CCR data achieved good sensitivity, specificity and discrimination without requiring expert-labeled examples. This approach enables population-based research on how patients die from cancer and may identify novel predictors of cancer recurrence.

الأساليب الكمية التعلم الآلي التعلم الالي

Deep Natural Language Processing for LinkedIn Search

104 - Weiwei Guo , Xiaowei Liu , Sida Wang 2021

Many search systems work with large amounts of natural language data, e.g., search queries, user profiles, and documents. Building a successful search system requires a thorough understanding of textual data semantics, where deep learning based natur al language processing techniques (deep NLP) can be of great help. In this paper, we introduce a comprehensive study for applying deep NLP techniques to five representative tasks in search systems: query intent prediction (classification), query tagging (sequential tagging), document ranking (ranking), query auto completion (language modeling), and query suggestion (sequence to sequence). We also introduce BERT pre-training as a sixth task that can be applied to many of the other tasks. Through the model design and experiments of the six tasks, readers can find answers to four important questions: (1). When is deep NLP helpful/not helpful in search systems? (2). How to address latency challenges? (3). How to ensure model robustness? This work builds on existing efforts of LinkedIn search, and is tested at scale on LinkedIns commercial search engines. We believe our experiences can provide useful insights for the industry and research communities.

استرجاع المعلومات الحساب واللغة