DP-DNA: A Digital Pattern-Aware DNA Storage System to Improve Encoding Density

63 0 0.0 ( 0 )

Download Cite

Added by Bingzhe Li

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Bingzhe Li - Li Ou - David Du

Emerging Technologies

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

With the rapid increase of available digital data, DNA storage is identified as a storage media with high density and capability of long-term preservation, especially for archival storage systems. However, the encoding density (i.e., how many binary bits can be encoded into one nucleotide) and error handling are two major factors intertwined in DNA storage. Considering encoding density, theoretically, one nucleotide can encode two binary bits (upper bound). However, due to biochemical constraints and other necessary information associated with payload, the encoding densities of various DNA storage systems are much less than this upper bound. Additionally, all existing studies of DNA encoding schemes are based on static analysis and really lack the awareness of dynamically changed digital patterns. Therefore, the gap between the static encoding and dynamic binary patterns prevents achieving a higher encoding density for DNA storage systems. In this paper, we propose a new Digital Pattern-Aware DNA storage system, called DP-DNA, which can efficiently store digital data in DNA storage with high encoding density. DP-DNA maintains a set of encoding codes and uses a digital pattern-aware code (DPAC) to analyze the patterns of a binary sequence for a DNA strand and selects an appropriate code for encoding the binary sequence to achieve a high encoding density. An additional encoding field is added to the DNA encoding format, which can distinguish the encoding scheme used for those DNA strands, and thus we can decode DNA data back to its original digital data. Moreover, to further improve the encoding density, a variable-length scheme is proposed to increase the feasibility of the coding scheme with a high encoding density. Finally, the experimental results indicate that the proposed DP-DNA achieves up to 103.5% higher encoding densities than prior work.

rate research

Single-Read Reconstruction for DNA Data Storage Using Transformers

608 - Yotam Nahum , Eyar Ben-Tolila , Leon Anavy 2021

As the global need for large-scale data storage is rising exponentially, existing storage technologies are approaching their theoretical and functional limits in terms of density and energy consumption, making DNA based storage a potential solution for the future of data storage. Several studies introduced DNA based storage systems with high information density (petabytes/gram). However, DNA synthesis and sequencing technologies yield erroneous outputs. Algorithmic approaches for correcting these errors depend on reading multiple copies of each sequence and result in excessive reading costs. The unprecedented success of Transformers as a deep learning architecture for language modeling has led to its repurposing for solving a variety of tasks across various domains. In this work, we propose a novel approach for single-read reconstruction using an encoder-decoder Transformer architecture for DNA based data storage. We address the error correction process as a self-supervised sequence-to-sequence task and use synthetic noise injection to train the model using only the decoded reads. Our approach exploits the inherent redundancy of each decoded file to learn its underlying structure. To demonstrate our proposed approach, we encode text, image and code-script files to DNA, produce errors with high-fidelity error simulator, and reconstruct the original files from the noisy reads. Our model achieves lower error rates when reconstructing the original data from a single read of each DNA strand compared to state-of-the-art algorithms using 2-3 copies. This is the first demonstration of using deep learning models for single-read reconstruction in DNA based storage which allows for the reduction of the overall cost of the process. We show that this approach is applicable for various domains and can be generalized to new domains as well.

Emerging Technologies Machine Learning Genomics

Graph DNA: Deep Neighborhood Aware Graph Encoding for Collaborative Filtering

81 - Liwei Wu , Hsiang-Fu Yu , Nikhil Rao 2019

In this paper, we consider recommender systems with side information in the form of graphs. Existing collaborative filtering algorithms mainly utilize only immediate neighborhood information and have a hard time taking advantage of deeper neighborhoods beyond 1-2 hops. The main caveat of exploiting deeper graph information is the rapidly growing time and space complexity when incorporating information from these neighborhoods. In this paper, we propose using Graph DNA, a novel Deep Neighborhood Aware graph encoding algorithm, for exploiting deeper neighborhood information. DNA encoding computes approximate deep neighborhood information in linear time using Bloom filters, a space-efficient probabilistic data structure and results in a per-node encoding that is logarithmic in the number of nodes in the graph. It can be used in conjunction with both feature-based and graph-regularization-based collaborative filtering algorithms. Graph DNA has the advantages of being memory and time efficient and providing additional regularization when compared to directly using higher order graph information. We conduct experiments on real-world datasets, showing graph DNA can be easily used with 4 popular collaborative filtering algorithms and consistently leads to a performance boost with little computational and memory overhead.

Machine Learning Artificial Intelligence Machine Learning

Training DNA Perceptrons via Fractional Coding

190 - Xingyi Liu , Keshab K. Parhi 2019

This paper describes a novel approach to synthesize molecular reactions to train a perceptron, i.e., a single-layered neural network, with sigmoidal activation function. The approach is based on fractional coding where a variable is represented by two molecules. The synergy between fractional coding in molecular computing and stochastic logic implementations in electronic computing is key to translating known stochastic logic circuits to molecular computing. In prior work, a DNA perceptron with bipolar inputs and unipolar output was proposed for inference. The focus of this paper is on synthesis of molecular reactions for training of the DNA perceptron. A new molecular scaler that performs multiplication by a factor greater than 1 is proposed based on fractional coding. The training of the perceptron proposed in this paper is based on a modified backpropagation equation as the exact equation cannot be easily mapped to molecular reactions using fractional coding.

Emerging Technologies Quantitative Methods

Encoding DNA sequences by integer chaos game representation

63 - Changchuan Yin 2017

DNA sequences are fundamental for encoding genetic information. The genetic information may not only be understood by symbolic sequences but also from the hidden signals inside the sequences. The symbolic sequences need to be transformed into numerical sequences so the hidden signals can be revealed by signal processing techniques. All current transformation methods encode DNA sequences into numerical values of the same length. These representations have limitations in the applications of genomic signal compression, encryption, and steganography. We propose an integer chaos game representation (iCGR) of DNA sequences and a lossless encoding method DNA sequences by the iCGR. In the iCGR method, a DNA sequence is represented by the iterated function of the nucleotides and their positions in the sequence. Then the DNA sequence can be uniquely encoded and recovered using three integers from iCGR. One integer is the sequence length and the other two integers represent the accumulated distributions of nucleotides in the sequence. The integer encoding scheme can compress a DNA sequence by 2 bits per nucleotide. The integer representation of DNA sequences provides a prospective tool for sequence compression, encryption, and steganography. The Python programs in this study are freely available to the public at https://github.com/cyinbox/iCGR

Computational Engineering Other Quantitative Biology

Molecular and DNA Artificial Neural Networks via Fractional Coding

93 - Xingyi Liu , Keshab K. Parhi 2019

This paper considers implementation of artificial neural networks (ANNs) using molecular computing and DNA based on fractional coding. Prior work had addressed molecular two-layer ANNs with binary inputs and arbitrary weights. In prior work using fractional coding, a simple molecular perceptron that computes sigmoid of scaled weighted sum of the inputs was presented where the inputs and the weights lie between [-1, 1]. Even for computing the perceptron, the prior approach suffers from two major limitations. First, it cannot compute the sigmoid of the weighted sum, but only the sigmoid of the scaled weighted sum. Second, many machine learning applications require the coefficients to be arbitrarily positive and negative numbers that are not bounded between [-1, 1]; such numbers cannot be handled by the prior perceptron using fractional coding. This paper makes four contributions. First molecular perceptrons that can handle arbitrary weights and can compute sigmoid of the weighted sums are presented. Thus, these molecular perceptrons are ideal for regression applications and multi-layer ANNs. A new molecular divider is introduced and is used to compute sigmoid(ax) where a > 1. Second, based on fractional coding, a molecular artificial neural network (ANN) with one hidden layer is presented. Third, a trained ANN classifier with one hidden layer from seizure prediction application from electroencephalogram is mapped to molecular reactions and DNA and their performances are presented. Fourth, molecular activation functions for rectified linear unit (ReLU) and softmax are also presented.

Emerging Technologies Quantitative Methods