The variable quality of metadata about biological samples used in biomedical experiments

48 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Rafael S. Gon\\c{c}alves

تاريخ النشر 2018

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Rafael S. Gonc{c}alves - Mark A. Musen

قواعد البيانات الذكاء الاصطناعي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well-known databases: BioSample---a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples---a repository managed by the European Bioinformatics Institute (EBI). We tested whether 11.4M sample metadata records in the two repositories are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the metadata. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. Overall, the metadata we analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The significant aberrancies that we found in the metadata are likely to impede search and secondary use of the associated datasets.

قيم البحث

81 - Rafael S. Gonc{c}alves , Martin J. OConnor , Marcos Martinez-Romero 2019

The Center for Expanded Data Annotation and Retrieval (CEDAR) aims to revolutionize the way that metadata describing scientific experiments are authored. The software we have developed--the CEDAR Workbench--is a suite of Web-based tools and REST APIs that allows users to construct metadata templates, to fill in templates to generate high-quality metadata, and to share and manage these resources. The CEDAR Workbench provides a versatile, REST-based environment for authoring metadata that are enriched with terms from ontologies. The metadata are available as JSON, JSON-LD, or RDF for easy integration in scientific applications and reusability on the Web. Users can leverage our APIs for validating and submitting metadata to external repositories. The CEDAR Workbench is freely available and open-source.

قواعد البيانات المكتبات الرقمية

The ground truth about metadata and community detection in networks

87 - Leto Peel , Daniel B. Larremore , 2016

Across many scientific domains, there is a common need to automatically extract a simplified view or coarse-graining of how a complex systems components interact. This general task is called community detection in networks and is analogous to searchi ng for clusters in independent vector data. It is common to evaluate the performance of community detection algorithms by their ability to find so-called ground truth communities. This works well in synthetic networks with planted communities because such networks links are formed explicitly based on those known communities. However, there are no planted communities in real world networks. Instead, it is standard practice to treat some observed discrete-valued node attributes, or metadata, as ground truth. Here, we show that metadata are not the same as ground truth, and that treating them as such induces severe theoretical and practical problems. We prove that no algorithm can uniquely solve community detection, and we prove a general No Free Lunch theorem for community detection, which implies that there can be no algorithm that is optimal for all possible community detection tasks. However, community detection remains a powerful tool and node metadata still have value so a careful exploration of their relationship with network structure can yield insights of genuine worth. We illustrate this point by introducing two statistical techniques that can quantify the relationship between metadata and community structure for a broad class of models. We demonstrate these techniques using both synthetic and real-world networks, and for multiple types of metadata and community structure.

الشبكات الاجتماعية والمعلومات تحليل البيانات والإحصاءات والاحتمال الفيزياء والمجتمع

Aligning Biomedical Metadata with Ontologies Using Clustering and Embeddings

86 - Rafael S. Gonc{c}alves , Maulik R. Kamdar , 2019

The metadata about scientific experiments published in online repositories have been shown to suffer from a high degree of representational heterogeneity---there are often many ways to represent the same type of information, such as a geographical lo cation via its latitude and longitude. To harness the potential that metadata have for discovering scientific data, it is crucial that they be represented in a uniform way that can be queried effectively. One step toward uniformly-represented metadata is to normalize the multiple, distinct field names used in metadata (e.g., lat lon, lat and long) to describe the same type of value. To that end, we present a new method based on clustering and embeddings (i.e., vector representations of words) to align metadata field names with ontology terms. We apply our method to biomedical metadata by generating embeddings for terms in biomedical ontologies from the BioPortal repository. We carried out a comparative study between our method and the NCBO Annotator, which revealed that our method yields more and substantially better alignments between metadata and ontology terms.

الحساب واللغة استرجاع المعلومات

On Learning Sparsely Used Dictionaries from Incomplete Samples

106 - Thanh V. Nguyen , Akshay Soni , Chinmay Hegde 2018

Most existing algorithms for dictionary learning assume that all entries of the (high-dimensional) input data are fully observed. However, in several practical applications (such as hyper-spectral imaging or blood glucose monitoring), only an incompl ete fraction of the data entries may be available. For incomplete settings, no provably correct and polynomial-time algorithm has been reported in the dictionary learning literature. In this paper, we provide provable approaches for learning - from incomplete samples - a family of dictionaries whose atoms have sufficiently spread-out mass. First, we propose a descent-style iterative algorithm that linearly converges to the true dictionary when provided a sufficiently coarse initial estimate. Second, we propose an initialization algorithm that utilizes a small number of extra fully observed samples to produce such a coarse initial estimate. Finally, we theoretically analyze their performance and provide asymptotic statistical and computational guarantees.

التعلم الالي التعلم الآلي

Metadata Challenge for Query Processing Over Heterogeneous Wireless Sensor Network

304 - C.Komalavalli Chetna Laroiya (Jagan Insitute of Management Studies 2011

Wireless sensor networks become integral part of our life. These networks can be used for monitoring the data in various domain due to their flexibility and functionality. Query processing and optimization in the WSN is a very challenging task becaus e of their energy and memory constraint. In this paper, first our focus is to review the different approaches that have significant impacts on the development of query processing techniques for WSN. Finally, we aim to illustrate the existing approach in popular query processing engines with future research challenges in query optimization.

قواعد البيانات