No Arabic abstract
Motivation: Capillary electrophoresis (CE) of nucleic acids is a workhorse technology underlying high-throughput genome analysis and large-scale chemical mapping for nucleic acid structural inference. Despite the wide availability of CE-based instruments, there remain challenges in leveraging their full power for quantitative analysis of RNA and DNA structure, thermodynamics, and kinetics. In particular, the slow rate and poor automation of available analysis tools have bottlenecked a new generation of studies involving hundreds of CE profiles per experiment. Results: We propose a computational method called high-throughput robust analysis for capillary electrophoresis (HiTRACE) to automate the key tasks in large-scale nucleic acid CE analysis, including the profile alignment that has heretofore been a rate-limiting step in the highest throughput experiments. We illustrate the application of HiTRACE on thirteen data sets representing 4 different RNAs, three chemical modification strategies, and up to 480 single mutant variants; the largest data sets each include 87,360 bands. By applying a series of robust dynamic programming algorithms, HiTRACE outperforms prior tools in terms of alignment and fitting quality, as assessed by measures including the correlation between quantified band intensities between replicate data sets. Furthermore, while the smallest of these data sets required 7 to 10 hours of manual intervention using prior approaches, HiTRACE quantitation of even the largest data sets herein was achieved in 3 to 12 minutes. The HiTRACE method therefore resolves a critical barrier to the efficient and accurate analysis of nucleic acid structure in experiments involving tens of thousands of electrophoretic bands.
To facilitate the analysis of large-scale high-throughput capillary electrophoresis data, we previously proposed a suite of efficient analysis software named HiTRACE (High Throughput Robust Analysis of Capillary Electrophoresis). HiTRACE has been used extensively for quantitating data from RNA and DNA structure mapping experiments, including mutate-and-map contact inference, chromatin footprinting, the EteRNA RNA design project and other high-throughput applications. However, HiTRACE is based on a suite of command-line MATLAB scripts that requires nontrivial efforts to learn, use, and extend. Here we present HiTRACE-Web, an online version of HiTRACE that includes standard features previously available in the command-line version as well as additional features such as automated band annotation and flexible adjustment of annotations, all via a user-friendly environment. By making use of parallelization, the on-line workflow is also faster than software implementations available to most users on their local computers. Free access: http://hitrace.org
One way to interject knowledge into clinically impactful forecasting is to use data assimilation, a nonlinear regression that projects data onto a mechanistic physiologic model, instead of a set of functions, such as neural networks. Such regressions have an advantage of being useful with particularly sparse, non-stationary clinical data. However, physiological models are often nonlinear and can have many parameters, leading to potential problems with parameter identifiability, or the ability to find a unique set of parameters that minimize forecasting error. The identifiability problems can be minimized or eliminated by reducing the number of parameters estimated, but reducing the number of estimated parameters also reduces the flexibility of the model and hence increases forecasting error. We propose a method, the parameter Houlihan, that combines traditional machine learning techniques with data assimilation, to select the right set of model parameters to minimize forecasting error while reducing identifiability problems. The method worked well: the data assimilation-based glucose forecasts and estimates for our cohort using the Houlihan-selected parameter sets generally also minimize forecasting errors compared to other parameter selection methods such as by-hand parameter selection. Nevertheless, the forecast with the lowest forecast error does not always accurately represent physiology, but further advancements of the algorithm provide a path for improving physiologic fidelity as well. Our hope is that this methodology represents a first step toward combining machine learning with data assimilation and provides a lower-threshold entry point for using data assimilation with clinical data by helping select the right parameters to estimate.
Scaffold based drug discovery (SBDD) is a technique for drug discovery which pins chemical scaffolds as the framework of design. Scaffolds, or molecular frameworks, organize the design of compounds into local neighborhoods. We formalize scaffold based drug discovery into a network design. Utilizing docking data from SARS-CoV-2 virtual screening studies and JAK2 kinase assay data, we showcase how a scaffold based conception of chemical space is intuitive for design. Lastly, we highlight the utility of scaffold based networks for chemical space as a potential solution to the intractable enumeration problem of chemical space by working inductively on local neighborhoods.
The interest in milk originating from donkeys is growing worldwide due to its claimed functional and nutritional properties, especially for sensitive population groups, such as infants with cow milk protein allergy. The current study aimed to assess the microbiological quality of donkey milk produced in a donkey farm in Cyprus using cultured-based and high-throughput sequencing (HTS) techniques. The culture-based microbiological analysis showed very low microbial counts, while important food-borne pathogens were not detected in any sample. In addition, HTS was applied to characterize the bacterial communities of donkey milk samples. Donkey milk was mostly comprised of: Gram-negative Proteobacteria, including Sphingomonas, Pseudomonas Mesorhizobium and Acinetobacter; lactic acid bacteria, including Lactobacillus and Streptococcus; the endospores forming Clostridium; and the environmental genera Flavobacterium and Ralstonia, detected in lower relative abundances. The results of the study support existing findings that donkey milk contains mostly Gram-negative bacteria. Moreover, it raises questions regarding the contribution: a) of antimicrobial agents (i.e. lysozyme, peptides) in shaping the microbial communities and b) of the bacterial microbiota to the functional value of donkey milk.
Orchestrating parametric fitting of multicomponent spectra at scale is an essential yet underappreciated task in high-throughput quantification of materials and chemical composition. To automate the annotation process for spectroscopic and diffraction data collected in counts of hundreds to thousands, we present a systematic approach compatible with high-performance computing infrastructures using the MapReduce model and task-based parallelization. We implement the approach in software and demonstrate linear computational scaling with respect to spectral components using multidimensional experimental materials characterization datasets from photoemission spectroscopy and powder electron diffraction as benchmarks. Our approach enables efficient generation of high-quality data annotation and online spectral analysis and is applicable to a variety of analytical techniques in materials science and chemistry as a building block for closed-loop experimental systems.