Accuracy and Robustness of Clustering Algorithms for Small-Size Applications in Bioinformatics

309 0 0.0 ( 0 )

Download Cite

Added by Fabio Rapallo

Publication date 2008

fields Mathematical Statistics

and research's language is English

Authors Pamela Minicozzi - Fabio Rapallo - Enrico Scalas

Applications

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

The performance (accuracy and robustness) of several clustering algorithms is studied for linearly dependent random variables in the presence of noise. It turns out that the error percentage quickly increases when the number of observations is less than the number of variables. This situation is common situation in experiments with DNA microarrays. Moreover, an {it a posteriori} criterion to choose between two discordant clustering algorithm is presented.

rate research

Abstractions, Algorithms and Data Structures for Structural Bioinformatics in PyCogent

601 - Marcin Cieslik , Zygmunt Derewenda , Cameron Mura 2014

To facilitate flexible and efficient structural bioinformatics analyses, new functionality for three-dimensional structure processing and analysis has been introduced into PyCogent -- a popular feature-rich framework for sequence-based bioinformatics, but one which has lacked equally powerful tools for handling stuctural/coordinate-based data. Extensible Python modules have been developed, which provide object-oriented abstractions (based on a hierarchical representation of macromolecules), efficient data structures (e.g. kD-trees), fast implementations of common algorithms (e.g. surface-area calculations), read/write support for Protein Data Bank-related file formats and wrappers for external command-line applications (e.g. Stride). Integration of this code into PyCogent is symbiotic, allowing sequence-based work to benefit from structure-derived data and, reciprocally, enabling structural studies to leverage PyCogents versatile tools for phylogenetic and evolutionary analyses.

Biomolecules Data Structures and Algorithms Software Engineering

Adversarial Robustness of Deep Learning: Theory, Algorithms, and Applications

112 - Wenjie Ruan , Xinping Yi , Xiaowei Huang 2021

This tutorial aims to introduce the fundamentals of adversarial robustness of deep learning, presenting a well-structured review of up-to-date techniques to assess the vulnerability of various types of deep learning models to adversarial examples. This tutorial will particularly highlight state-of-the-art techniques in adversarial attacks and robustness verification of deep neural networks (DNNs). We will also introduce some effective countermeasures to improve the robustness of deep learning models, with a particular focus on adversarial training. We aim to provide a comprehensive overall picture about this emerging direction and enable the community to be aware of the urgency and importance of designing robust deep learning models in safety-critical data analytical applications, ultimately enabling the end-users to trust deep learning classifiers. We will also summarize potential research directions concerning the adversarial robustness of deep learning, and its potential benefits to enable accountable and trustworthy deep learning-based data analytical systems and applications.

Machine Learning Artificial Intelligence

Dataset Lifecycle Framework and its applications in Bioinformatics

380 - Yiannis Gkoufas Technology 2021

Bioinformatics pipelines depend on shared POSIX filesystems for its input, output and intermediate data storage. Containerization makes it more difficult for the workloads to access the shared file systems. In our previous study, we were able to run both ML and non-ML pipelines on Kubeflow successfully. However, the storage solutions were complex and less optimal. This is because there are no established resource types to represent the concept of data source on Kubernetes. More and more applications are running on Kubernetes for batch processing. End users are burdened with configuring and optimising the data access, which is what we have experienced before. In this article, we are introducing a new concept of Dataset and its corresponding resource as a native Kubernetes object. We have leveraged the Dataset Lifecycle Framework which takes care of all the low-level details about data access in Kubernetes pods. Its pluggable architecture is designed for the development of caching, scheduling and governance plugins. Together, they manage the entire lifecycle of the custom resource Dataset. We use Dataset Lifecycle Framework to serve data from object stores to both ML and non-ML pipelines running on Kubeflow. With DLF, we make training data fed into ML models directly without being downloaded to the local disks, which makes the input scalable. We have enhanced the durability of training metadata by storing it into a dataset, which also simplifies the set up of the Tensorboard, separated from the notebook server. For the non-ML pipeline, we have simplified the 1000 Genome Project pipeline with datasets injected into the pipeline dynamically. In addition, our preliminary results indicate that the pluggable caching mechanism can improve the performance significantly.

Distributed Parallel and Cluster Computing Emerging Technologies

Accuracy, Repeatability, and Reproducibility of Firearm Comparisons Part 1: Accuracy

42 - L. Scott Chumbley , Max D. Morris , Stanley J. Bajic 2021

Researchers at the Ames Laboratory-USDOE and the Federal Bureau of Investigation (FBI) conducted a study to assess the performance of forensic examiners in firearm investigations. The study involved three different types of firearms and 173 volunteers who compared both bullets and cartridge cases. The total number of comparisons reported is 20,130, allocated to assess accuracy (8,640), repeatability (5,700), and reproducibility (5,790) of the evaluations made by participating examiners. The overall false positive error rate was estimated as 0.656% and 0.933% for bullets and cartridge cases, respectively, while the rate of false negatives was estimated as 2.87% and 1.87% for bullets and cartridge cases, respectively. Because chi-square tests of independence strongly suggest that error probabilities are not the same for each examiner, these are maximum likelihood estimates based on the beta-binomial probability model and do not depend on an assumption of equal examiner-specific error rates. Corresponding 95% confidence intervals are (0.305%,1.42%) and (0.548%,1.57%) for false positives for bullets and cartridge cases, respectively, and (1.89%,4.26%) and (1.16%,2.99%) for false negatives for bullets and cartridge cases, respectively. These results are based on data representing all controlled conditions considered, including different firearm manufacturers, sequence of manufacture, and firing separation between unknown and known comparison specimens. The results are consistent with those of prior studies, despite its more robust design and challenging specimens.

Applications

Accurate and Efficient Estimation of Small P-values with the Cross-Entropy Method: Applications in Genomic Data Analysis

104 - Yang Shi , Mengqiao Wang , Weiping Shi 2018

Small $p$-values are often required to be accurately estimated in large scale genomic studies for the adjustment of multiple hypothesis tests and the ranking of genomic features based on their statistical significance. For those complicated test statistics whose cumulative distribution functions are analytically intractable, existing methods usually do not work well with small $p$-values due to lack of accuracy or computational restrictions. We propose a general approach for accurately and efficiently calculating small $p$-values for a broad range of complicated test statistics based on the principle of the cross-entropy method and Markov chain Monte Carlo sampling techniques. We evaluate the performance of the proposed algorithm through simulations and demonstrate its application to three real examples in genomic studies. The results show that our approach can accurately evaluate small to extremely small $p$-values (e.g. $10^{-6}$ to $10^{-100}$). The proposed algorithm is helpful to the improvement of existing test procedures and the development of new test procedures in genomic studies.

Applications