No Arabic abstract
Chronic Kidney Disease (CKD) is an increasingly prevalent condition affecting 13% of the US population. The disease is often a silent condition, making its diagnosis challenging. Identifying CKD stages from standard office visit records can help in early detection of the disease and lead to timely intervention. The dataset we use is highly imbalanced. We propose a hierarchical meta-classification method, aiming to stratify CKD by severity levels, employing simple quantitative non-text features gathered from office visit records, while addressing data imbalance. Our method effectively stratifies CKD severity levels obtaining high average sensitivity, precision and F-measure (~93%). We also conduct experiments in which the dimensionality of the data is significantly reduced to include only the most salient features. Our results show that the good performance of our system is retained even when using the reduced feature sets, as well as under much reduced training sets, indicating that our method is stable and generalizable.
Currently, Chronic Kidney Disease (CKD) is experiencing a globally increasing incidence and high cost to health systems. A delayed recognition implies premature mortality due to progressive loss of kidney function. The employment of data mining to discover subtle patterns in CKD indicators would contribute achieving early diagnosis. This work presents the development and evaluation of an explainable prediction model that would support clinicians in the early diagnosis of CKD patients. The model development is based on a data management pipeline that detects the best combination of ensemble trees algorithms and features selected concerning classification performance. The results obtained through the pipeline equals the performance of the best CKD prediction models identified in the literature. Furthermore, the main contribution of the paper involves an explainability-driven approach that allows selecting the best prediction model maintaining a balance between accuracy and explainability. Therefore, the most balanced explainable prediction model of CKD implements an XGBoost classifier over a group of 4 features (packed cell value, specific gravity, albumin, and hypertension), achieving an accuracy of 98.9% and 97.5% with cross-validation technique and with new unseen data respectively. In addition, by analysing the models explainability by means of different post-hoc techniques, the packed cell value and the specific gravity are determined as the most relevant features that influence the prediction results of the model. This small number of feature selected results in a reduced cost of the early diagnosis of CKD implying a promising solution for developing countries.
Cardiac auscultation is one of the most cost-effective techniques used to detect and identify many heart conditions. Computer-assisted decision systems based on auscultation can support physicians in their decisions. Unfortunately, the application of such systems in clinical trials is still minimal since most of them only aim to detect the presence of extra or abnormal waves in the phonocardiogram signal. This is mainly due to the lack of large publicly available datasets, where a more detailed description of such abnormal waves (e.g., cardiac murmurs) exists. As a result, current machine learning algorithms are unable to classify such waves. To pave the way to more effective research on healthcare recommendation systems based on auscultation, our team has prepared the currently largest pediatric heart sound dataset. A total of 5282 recordings have been collected from the four main auscultation locations of 1568 patients, in the process 215780 heart sounds have been manually annotated. Furthermore, and for the first time, each cardiac murmur has been manually annotated by an expert annotator according to its timing, shape, pitch, grading and quality. In addition, the auscultation locations where the murmur is present were identified as well as the auscultation location where the murmur is detected more intensively.
Biomedical data are widely accepted in developing prediction models for identifying a specific tumor, drug discovery and classification of human cancers. However, previous studies usually focused on different classifiers, and overlook the class imbalance problem in real-world biomedical datasets. There are a lack of studies on evaluation of data pre-processing techniques, such as resampling and feature selection, on imbalanced biomedical data learning. The relationship between data pre-processing techniques and the data distributions has never been analysed in previous studies. This article mainly focuses on reviewing and evaluating some popular and recently developed resampling and feature selection methods for class imbalance learning. We analyse the effectiveness of each technique from data distribution perspective. Extensive experiments have been done based on five classifiers, four performance measures, eight learning techniques across twenty real-world datasets. Experimental results show that: (1) resampling and feature selection techniques exhibit better performance using support vector machine (SVM) classifier. However, resampling and Feature Selection techniques perform poorly when using C4.5 decision tree and Linear discriminant analysis classifiers; (2) for datasets with different distributions, techniques such as Random undersampling and Feature Selection perform better than other data pre-processing methods with T Location-Scale distribution when using SVM and KNN (K-nearest neighbours) classifiers. Random oversampling outperforms other methods on Negative Binomial distribution using Random Forest classifier with lower level of imbalance ratio; (3) Feature Selection outperforms other data pre-processing methods in most cases, thus, Feature Selection with SVM classifier is the best choice for imbalanced biomedical data learning.
If Electronic Health Records contain a large amount of information about the patients condition and response to treatment, which can potentially revolutionize the clinical practice, such information is seldom considered due to the complexity of its extraction and analysis. We here report on a first integration of an NLP framework for the analysis of clinical records of lung cancer patients making use of a telephone assistance service of a major Spanish hospital. We specifically show how some relevant data, about patient demographics and health condition, can be extracted; and how some relevant analyses can be performed, aimed at improving the usefulness of the service. We thus demonstrate that the use of EHR texts, and their integration inside a data analysis framework, is technically feasible and worth of further study.
Agent-Based Models are a powerful class of computational models widely used to simulate complex phenomena in many different application areas. However, one of the most critical aspects, poorly investigated in the literature, regards an important step of the model credibility assessment: solution verification. This study overcomes this limitation by proposing a general verification framework for Agent-Based Models that aims at evaluating the numerical errors associated with the model. A step-by-step procedure, which consists of two main verification studies (deterministic and stochastic model verification), is described in detail and applied to a specific mission critical scenario: the quantification of the numerical approximation error for UISS-TB, an ABM of the human immune system developed to predict the progression of pulmonary tuberculosis. Results provide indications on the possibility to use the proposed model verification workflow to systematically identify and quantify numerical approximation errors associated with UISS-TB and, in general, with any other ABMs.