No Arabic abstract
Background SAMtools and BCFtools are widely used programs for processing and analysing high-throughput sequencing data. Findings The first version appeared online twelve years ago and has been maintained and further developed ever since, with many new features and improvements added over the years. The SAMtools and BCFtools packages represent a unique collection of tools that have been used in numerous other software projects and countless genomic pipelines. Conclusion Both SAMtools and BCFtools are freely available on GitHub under the permissive MIT licence, free for both non-commercial and commercial use. Both packages have been installed over a million times via Bioconda. The source code and documentation are available from http://www.htslib.org.
The Galaxy Zoo (GZ) project has provided quantitative visual morphologies for over a million galaxies, and has been part of a reinvigoration of interest in the morphologies of galaxies and what they reveal about galaxy evolution. Morphological information collected by GZ has shown itself to be a powerful tool for studying galaxy evolution, and GZ continues to collect classifications - currently serving imaging from DECaLS in its main site, and running a variety of related projects hosted by the Zooniverse; the citizen science platform which came out of the early success of GZ. I highlight some of the results from the last twelve years, with a particular emphasis on linking morphology and dynamics, look forward to future projects in the GZ family, and provide a quick start guide for how you can easily make use of citizen science techniques to analysis your own large and complex data sets.
We discuss the origin of the optical variations in the Narrow line Seyfert 1 galaxy NGC 4051 and present the results of a cross-correlation study using X-ray and optical light curves spanning more than 12 years. The emission is highly variable in all wavebands, and the amplitude of the optical variations is found to be smaller than that of the X-rays, even after correcting for the contaminating host galaxy flux falling inside the photometric aperture. The optical power spectrum is best described by an unbroken power law model with slope $alpha=1.4^{+0.6}_{-0.2}$ and displays lower variability power than the 2-10 keV X-rays on all time-scales probed. We find the light curves to be significantly correlated at an optical delay of $1.2^{+1.0}_{-0.3}$ days behind the X-rays. This time-scale is consistent with the light travel time to the optical emitting region of the accretion disc, suggesting that the optical variations are driven by X-ray reprocessing. We show, however, that a model whereby the optical variations arise from reprocessing by a flat accretion disc cannot account for all the optical variability. There is also a second significant peak in the cross-correlation function, at an optical delay of $39^{+2.7}_{-8.4}$ days. The lag is consistent with the dust sublimation radius in this source, suggesting that there is a measurable amount of optical flux coming from the dust torus. We discuss the origin of the additional optical flux in terms of reprocessing of X-rays and reflection of optical light by the dust.
The availability of genomic data is often essential to progress in biomedical research, personalized medicine, drug development, etc. However, its extreme sensitivity makes it problematic, if not outright impossible, to publish or share it. As a result, several initiatives have been launched to experiment with synthetic genomic data, e.g., using generative models to learn the underlying distribution of the real data and generate artificial datasets that preserve its salient characteristics without exposing it. This paper provides the first evaluation of the utility and the privacy protection of six state-of-the-art models for generating synthetic genomic data. We assess the performance of the synthetic data on several common tasks, such as allele population statistics and linkage disequilibrium. We then measure privacy through the lens of membership inference attacks, i.e., inferring whether a record was part of the training data. Our experiments show that no single approach to generate synthetic genomic data yields both high utility and strong privacy across the board. Also, the size and nature of the training dataset matter. Moreover, while some combinations of datasets and models produce synthetic data with distributions close to the real data, there often are target data points that are vulnerable to membership inference. Looking forward, our techniques can be used by practitioners to assess the risks of deploying synthetic genomic data in the wild and serve as a benchmark for future work.
Much evolutionary information is stored in the fluctuations of protein length distributions. The genome size and non-coding DNA content can be calculated based only on the protein length distributions. So there is intrinsic relationship between the coding DNA size and non-coding DNA size. According to the correlations and quasi-periodicity of protein length distributions, we can classify life into three domains. Strong evidences are found to support the order in the structures of protein length distributions.
We investigate the consequences of adopting the criteria used by the state of California, as described by Myers et al. (2011), for conducting familial searches. We carried out a simulation study of randomly generated profiles of related and unrelated individuals with 13-locus CODIS genotypes and YFiler Y-chromosome haplotypes, on which the Myers protocol for relative identification was carried out. For Y-chromosome sharing first degree relatives, the Myers protocol has a high probability (80 - 99%) of identifying their relationship. For unrelated individuals, there is a low probability that an unrelated person in the database will be identified as a first-degree relative. For more distant Y-haplotype sharing relatives (half-siblings, first cousins, half-first cousins or second cousins) there is a substantial probability that the more distant relative will be incorrectly identified as a first-degree relative. For example, there is a 3 - 18% probability that a first cousin will be identified as a full sibling, with the probability depending on the population background. Although the California familial search policy is likely to identify a first degree relative if his profile is in the database, and it poses little risk of falsely identifying an unrelated individual in a database as a first-degree relative, there is a substantial risk of falsely identifying a more distant Y-haplotype sharing relative in the database as a first-degree relative, with the consequence that their immediate family may become the target for further investigation. This risk falls disproportionately on those ethnic groups that are currently overrepresented in state and federal databases.