[go: up one dir, main page]

CN113345521A - Coding and recovering method using large fragment DNA storage - Google Patents

Coding and recovering method using large fragment DNA storage Download PDF

Info

Publication number
CN113345521A
CN113345521A CN202110601792.8A CN202110601792A CN113345521A CN 113345521 A CN113345521 A CN 113345521A CN 202110601792 A CN202110601792 A CN 202110601792A CN 113345521 A CN113345521 A CN 113345521A
Authority
CN
China
Prior art keywords
data
sequence
replication
origin
dna
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110601792.8A
Other languages
Chinese (zh)
Inventor
陈为刚
郭健
葛奇
韩昌彩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202110601792.8A priority Critical patent/CN113345521A/en
Publication of CN113345521A publication Critical patent/CN113345521A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Detection And Correction Of Errors (AREA)

Abstract

本发明公开了一种采用大片段DNA存储的编码与恢复方法,属于DNA数据存储领域,本发明将二进制数据加扰后分为若干子块、对每个子块采用非常高码率的分组纠错码编码后,不同的子块进行符号交织并经过“符号/比特-碱基序列”的映射过程转码为数据DNA单元,将其与复制起始位点交替镶嵌,构成人工染色体序列;数据读出时,利用二代高通量测序,组合了读段从头组装、已知的生物序列如启动子序列导引的重叠群合并、高效纠错纠删译码,实现数据可靠恢复。本发明的优势在于,实现了交织的分组纠错码编码方法、大片段DNA逻辑结构、二代高通量测序、从头组装方法的多要素匹配,开销非常小,并且充分利用了高码率纠错码的纠错能力,可实现非常高的碱基逻辑密度。

Figure 202110601792

The invention discloses a coding and recovery method using large-segment DNA storage, belonging to the field of DNA data storage. The invention divides binary data into several sub-blocks after scrambled, and adopts very high code rate grouping error correction for each sub-block. After coding, different sub-blocks are symbol-interleaved and transcoded into data DNA units through the “symbol/bit-base sequence” mapping process, which is alternately mosaicked with the origin of replication to form an artificial chromosome sequence; data read At the time of publication, second-generation high-throughput sequencing was used to combine read de novo assembly, contig merging guided by known biological sequences such as promoter sequences, and efficient error-correction and erasure-correction decoding to achieve reliable data recovery. The advantage of the present invention is that it realizes the multi-element matching of the interleaved packet error correction code coding method, the large segment DNA logic structure, the second-generation high-throughput sequencing, and the de novo assembly method, with very low overhead, and makes full use of the high code rate correction method. The error correction ability of the error code can realize a very high base logic density.

Figure 202110601792

Description

Coding and recovering method using large fragment DNA storage
Technical Field
The invention relates to the field of DNA data storage, in particular to a coding and recovery method adopting large-fragment DNA storage.
Background
With the rapid development of contemporary information technology, the global data volume exhibits explosive growth. According to the forecast of international data corporation, the global data volume will increase from 44ZB in 2020 to 175ZB in 2025, and the existing storage methods have not been able to satisfy the human demand for storing massive digital information. Currently, storage media are used which mainly include: optical disc, magnetic tape, flash memory, etc., the storage mode mainly includes: computer-based stand-alone storage, network storage, distributed storage, cloud storage, and the like. However, there are many challenges with existing storage media and storage methods. First, the density of storage media is approaching a limit, making it difficult to meet the rapidly increasing demand for data. Second, the currently established data storage centers face many challenges of size, cost, power consumption, security, and the like. Third, the effective storage time of the conventional storage medium is short, generally within several decades, for example, the effective storage time of a magnetic tape is 30-50 years, which cannot meet the long-term storage and backup requirements of important data, and the storage medium such as a magnetic tape and an optical disc is easily damaged, resulting in the loss of stored data.
Artificially synthesized deoxyribonucleic acid (DNA) is used as a potential data storage medium, has high storage density, long service life and low storage energy consumption, and is expected to become one of important choices for storing mass offline data in the future. The american society for semiconductor industry and semiconductor research corporation promulgated "semiconductor decade program" in 2021, listed DNA data storage as one of the major storage modes for large amounts of data in parallel with hard disks, solid state disks, and magnetic tapes, and became an important direction for global storage industry competition in the future. The DNA storage mainly uses 4 different deoxyribonucleotides of oligonucleotide, namely A (adenine), T (thymine), G (guanine) and C (cytosine) to represent data, and the storage of mass data through artificially synthesized DNA sequences or oligonucleotide molecules is an emerging data storage mode. With the increasing data volume generated in human production and life, the importance of data storage is also highlighted, and DNA data storage becomes a large-scale mass data storage mode with great potential in the future.
DNA data storage data was stored using { A, T, G, C } of oligonucleotide molecules. The current modes of DNA data storage mainly include: short fragment oligonucleotide pool (Oligo pool) storage, intracellular DNA storage, and the like. The oligonucleotide pool storage of short fragments develops rapidly by means of high-throughput chip synthesis and sequencing technology of DNA, but still has great challenges in large-scale equilibrium amplification and replication cost. The method has the advantages that intracellular DNA data storage, particularly intracellular large-fragment DNA storage, short DNA fragments are assembled into long fragments by means of an in-vivo assembly method, efficient amplification is achieved by means of in-vivo replication, replication cost is low, and the method has potential application value in scenes such as large-scale data distribution and the like.
The logical structure design of large-fragment DNA data storage is different from the data storage based on oligonucleotide pool (Oligo pool), the ratio of index and primer (or similar units, such as skeleton in yeast artificial chromosome) is relatively low, and the logical structure design has certain advantages in base utilization rate. The data reading phase requires de novo (de novo) assembly of sequencing reads, similar to de novo (de novo) sequencing of the genome of new species. The reading of large-fragment DNA data storage has high requirement on algorithm real-time performance, and the traditional bioinformatics processing flow is not suitable. Aiming at the characteristic, when the data is recovered, a completely error-free DNA sequence is not required to be obtained in the step of reading assembly, and residual errors after assembly are corrected by using an error correcting code, so that a perfect DNA sequence without any base errors is obtained. Therefore, how to design a proper error correction coding scheme to ensure the reliability of large-fragment DNA data storage becomes an important issue.
The application of error correction coding technology in the field of DNA data storage is an important means for reducing writing cost and ensuring data reliability, and is a very potential technology. On one hand, due to the limitation of biochemical conditions, the error rate of oligonucleotide molecule synthesis is high, and generally, one synthesis error occurs in hundreds of bases; if the accuracy of oligonucleotide synthesis is further improved, the yield of oligonucleotide molecules is reduced, and the cost is greatly increased. Therefore, the adoption of an efficient error correction coding technique is an important method for reducing the writing cost and ensuring the data reliability.
However, according to the fundamental theory of channel capacity and channel coding of shannon information theory, the error correcting code needs to be matched with the error characteristics of writing/reading, and only reliable and efficient data storage can be realized. Several important error correcting codes in the field of digital communication have been tried in vitro DNA data storage. For example, digital fountain codes are used to correct deletion errors caused by oligonucleotide molecule loss, reed-solomon (RS) codes to correct base deletions and random errors, and product codes composed of Low Density Parity Check (LDPC) codes and RS codes to correct deletions and random errors, but these error correction coding methods do not fully match the error characteristics of writing/reading, because they do not use the characteristic that the error characteristics of a plurality of bits corresponding to each base symbol are correlated. The coding method for storing the in-vivo large-fragment DNA adopts the watermark code consisting of the LDPC code and the pseudorandom sequence, and aims at the high error rate of the third-generation nanopore sequencing, and the base insertion/deletion error which is difficult to process is mainly considered. The coding efficiency of the method is low, 1.19bit/bp, and the theoretical limit density of 2bit/bp of information represented by 4 bases { A, T, G, C } is still large in difference.
Disclosure of Invention
The invention provides a coding and recovery method by using large-segment DNA storage, which realizes the multi-element matching of interleaved grouped error correcting code coding, large-segment DNA logic structure, next-generation high-throughput sequencing and de novo assembly method, has very high base logic density, and is described in detail as follows:
a method of encoding and recovering using large fragment DNA storage, the method comprising:
(1) scrambling binary data, and uniformly processing P sub-blocks as a processing unit, wherein the size of each sub-block is k bits; each sub-block is grouped, error correcting code coding is carried out, different sub-blocks are interleaved, transcoding is carried out through the mapping process of 'symbol/bit-base sequence' to obtain P data DNA sequences, and the P data DNA sequences and P-1 replication starting site sequences are alternately embedded to form an artificial chromosome sequence;
(2) transferring the artificial chromosome into cells, and performing second-generation high-throughput sequencing after cell proliferation and nucleic acid extraction processes;
(3) assembling the sequencing read from the head to obtain a group of contigs, determining the positions of the data segments on the contigs according to the known replication start site, intercepting partial sequencing read corresponding to the data segments, then carrying out majority merging to obtain P data segments, deinterleaving the data segments, and sending the data segments to a decoder for decoding to obtain an information sequence.
The step (1) is specifically as follows:
(1.1) superimposing binary data with the length of N with a known pseudo-random sequence to realize scrambling, dividing the binary data into P sub-blocks, wherein the length of each sub-block is k, k is N/P, P is a positive integer and is a factor of N, and N/P is an integer;
(1.2) coding each sub-block by adopting grouped error correcting code coding, wherein the information bit length of the error correcting code is k bits, and the length of each sub-block after coding is n bits;
(1.3) sequencing the encoded subblocks according to a P row mode, wherein each row comprises n bits, decomposing the n bits into a plurality of groups of P-P bit or symbol units, and interweaving each unit respectively according to a shift cycle mode; if the bit cells are, the number of the bit cells is n; if the processing is carried out according to the symbols, the number of the symbols corresponding to n bits is n/s, and each symbol comprises s bits;
(1.4) obtaining P bit groups by the interleaved subblocks, wherein the size of each group is also n bits or n/s symbols, and the symbol groups are also converted into bit groups with the length of n;
(1.5) bit grouping according to the mapping rule: 00 → A, 01 → T, 10 → G, 11 → C to be transcoded into P data DNA sequences of length n/2.
The technical scheme provided by the invention has the beneficial effects that:
1. the coding method provided by the invention realizes that the data file is coded into a plurality of DNA data units, the DNA data units are further alternately embedded and combined with the replication starting site, a flexible intracellular DNA data storage structure is constructed, and the storage of data in large-fragment DNA is realized;
2. the invention is based on the second generation high-throughput sequencing, adopts the proposed data recovery method, and can realize reliable and high-efficiency DNA data reading;
3. the invention realizes the multi-element matching of the interleaved grouped error correcting code coding, the large-fragment DNA logic structure, the next generation high-throughput sequencing and the de novo assembly method, has very low overhead, fully utilizes the error correcting capability of the high-code-rate error correcting code and can realize very high base logic density.
Drawings
FIG. 1 is a block diagram of a system for implementing the solution of the present invention;
FIG. 2 is a flow chart of the encoding of a packet error correction code in the present invention;
FIG. 3 is a flowchart of recovery in data read-out according to the present invention;
FIG. 4 is a schematic diagram of a computer-based long DNA fragment data storage simulation platform according to the present invention;
FIG. 5 is a graph of the distribution of the number of base errors (edit distance) of simulated reads in accordance with the present invention;
FIG. 6 is a graph showing a distribution of the number of base errors (edit distance) of real reads according to the present invention;
FIG. 7 is a graph of the base error location distribution of simulated reads according to the present invention;
FIG. 8 is a base error location distribution diagram of real reads according to the present invention;
FIG. 9 is a schematic diagram showing the analysis of the base deletion result in each experiment in the present invention;
FIG. 10 is a schematic diagram showing analysis of a base error result in each experiment in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
Aiming at the error characteristics that the large-fragment DNA data storage in cells is adopted and the information coding method is suitable for sequencing and reading assembly, the embodiment of the invention provides a coding and recovery method adopting the large-fragment DNA storage. The binary data is divided into a plurality of sub-blocks after being scrambled, after each sub-block is coded by adopting a grouped error correcting code, different sub-blocks are subjected to symbol interleaving and are transcoded into data DNA units through the mapping process of 'symbol-bit-base sequence', and the data DNA units and the replication start sites are alternately inlaid to form an artificial chromosome sequence; when data is read out, the second-generation high-throughput sequencing is utilized, and read segment de novo (de novo) assembly, copy initial site guided contig merging and efficient error correction erasure decoding are combined, so that reliable data recovery is realized.
The method provided by the invention not only can realize reliable storage of data, but also realizes multi-element matching of an interleaved grouped error correction code encoding method, a large-fragment DNA logic structure, next-generation high-throughput sequencing and a de novo assembly method, has very high base logic density, and fully explains the advantages of large-fragment DNA storage. In addition, the method provided by the invention can construct a flexible intracellular DNA data structure and meet the requirement of designing a data file with the size of hundreds of Kb to Mb.
The invention provides a coding and recovery method adopting large-segment DNA storage, which aims at the field of DNA data storage and is basically characterized in that binary data are scrambled, subjected to grouped error correction code coding and symbol interleaving, mapped to obtain a data DNA unit, the data DNA unit and an ARS sequence are alternately embedded to obtain an artificial chromosome, and after second-generation high-throughput sequencing, read segment de novo assembly, ARS guided contig combination and efficient error correction erasure correction decoding are combined to realize reliable data recovery. The invention can ensure the reliable storage of data and has very high base logic density.
The specific embodiment of the invention is given below, the embodiment designs and simulates an 2,489,847bp circular artificial chromosome, a 589KB picture and a 2.65KB txt text are stored in the circular artificial chromosome by using a plurality of RS code symbol interleaving coding methods with extremely high code rates, and the total logic density reaches 1.947bit/bp, which is higher than the current main design method and is very close to the theoretical 2 bit/bp. Wherein the selected RS code is defined in the finite field GF (2)12) The RS code above (4095, 4040, t ═ 27), the length of the information bit portion is 4040 symbols, the code rate is 0.987, and each symbol comprises 12 bits.
The coding method for interleaving the RS codes with a plurality of extremely high code rates specifically comprises the following steps:
(1.1) superposing binary data with a known pseudo-random sequence to realize scrambling, and dividing the scrambled binary data into 100 subblocks with the same size;
(1.2) encoding each subblock by adopting RS code encoding, wherein the information bit length is 4040 symbols, and the length of each subblock after encoding is 4095 symbols;
(1.3) sorting the coded sub-blocks in a column mode, then decomposing the coded sub-blocks into a plurality of groups of 100 x 100 units, and interweaving each unit in a diagonal cycle mode;
(1.4) 4095 symbols per data segment, converted to 49,140 bits, as a basic packet;
(1.5) adjacent two bits in each packet are transcoded to 1 base according to the mapping rule of 00 → A, 01 → T, 10 → G, 11 → C. Each data segment is converted into 24,570bp DNA data sequence;
(1.6) alternately inlaying 100 DNA data segments with the length of 24,570bp and 99 ARSs with shorter lengths, and further adding a vector skeleton sequence to obtain a circular artificial chromosome with the length of 2,489,847 bp.
In the earlier stage, the embodiment of the invention designs and assembles an 254,886bp yeast artificial chromosome, adopts second-generation high-throughput sequencing to obtain a group of real sequencing data, and takes the group of real sequencing data as reference to carry out sequencing simulation on the artificial chromosome, and the specific steps are as follows:
(2.1) sending the real sequencing data into second-generation sequencing simulation software ART for parameter training to obtain a configuration simulation file;
and (2.2) inputting the configuration simulation file, the artificial chromosome sequence and other parameters into second-generation sequencing simulation software for sequencing simulation. The coverage selected in this embodiment is 20x, 25x, 30x, and each coverage generates 10 sets of independent test data with ART to obtain the corresponding sequencing read of the double-end read PE 150.
The specific recovery method during data reading is as follows:
(3.1) assembling the sequencing reads by using second-generation sequence assembling software Velvet and ABySS based on a Debrelain map to obtain a group of contigs;
(3.2) identifying ARS sequences in each contig, determining the position of a data read according to the ARS sequences and intercepting a corresponding part of sequencing reads;
(3.3) carrying out large number combination on the part of the sequencing reads corresponding to each intercepted data read to obtain a combined sequence of 100 data reads, and further converting the combined sequence into a symbol sequence;
(3.4) de-interleaving the 100 data segments according to the packet interleaving sequence to obtain 100 error-existing and deleted code words;
and (3.5) sending the 100 code words obtained by de-interleaving into a decoder for block error correction erasure decoding to obtain an information sequence.
The assembly of the second-generation sequencing reads is realized by using second-generation sequence assembly software such as Velvet or ABySS based on a Debrelain atlas under a plurality of k-mer values with different lengths, and a group of contigs is obtained specifically as follows:
(1) assembling the second-generation sequencing reads by Velvet or AbySS under the k-mer value with fixed length to obtain a group of contigs;
(2) assembling second-generation sequencing reads by Velvet or AbySS under a plurality of k-mer values with different lengths, and combining to obtain a plurality of groups of contigs;
(3) and assembling the second-generation sequencing reads by Velvet and AbySS under a plurality of k-mer values with different lengths, and combining to obtain a plurality of groups of contigs.
The specific steps of identifying the ARS sequence in each contig, determining the position of the data read according to the ARS sequence and intercepting the corresponding partial sequencing read are as follows:
(4.1) respectively comparing the ARS sequences subjected to forward and reverse complementation with the contig for an editing distance, wherein the editing distance is less than 20% of the length of the ARS sequence, judging that the ARS exists, and recording the state of the ARS at the moment and the position of the ARS on the contig;
(4.2) if the ARS is positive, placing sequencing reads on two sides of the ARS into a cache region corresponding to the data segment; if the ARS is reversed and complemented, the sequencing reads on the two sides of the ARS are reversed and complemented and then are placed into a cache region corresponding to the data segment;
(4.3) repeating steps (4.1) and (4.2) until all reads containing ARS sequences (or partial ARS sequences) are fully marked and assigned.
The method comprises the following specific steps of carrying out majority merging on partial sequencing reads corresponding to each intercepted data read to obtain a merged sequence of P data reads, and further converting the merged sequence into a symbol sequence:
and (5.1) the length of the data segment is n, and an x multiplied by n data block is constructed by assuming that the number of partial sequencing reads in the cache region corresponding to the data segment is x. Putting the intercepted sequencing read into a data block, and if the length of the intercepted data segment is less than n, completing by using 'X';
(5.2) performing majority judgment on the base at each position i of the filling block, namely taking one base with the largest number in { A, T, G, C } at each position in the filling block as the base at the position of the splicing sequence, marking the base as 'X' if the base is not filled in the position, obtaining a data segment with the length of n, and further converting the data segment into a symbol sequence.
The specific steps of sending the data segments into a decoder for decoding after deinterleaving to obtain an information sequence are as follows:
(6.1) de-interleaving 100 data segments according to the packet interleaving sequence to obtain 100 error-existing and deleted code words;
and (6.2) performing error correction and erasure correction decoding on 100 code words of the RS code obtained by de-interleaving to obtain data segments.
First, as shown in FIG. 4, the present invention establishes a computer-based long DNA fragment data storage simulation platform. In the simulation experiment, ART software was chosen to generate sequencing reads to simulate the sequencing process. Although direct synthesis and sequencing experiments are not carried out, the method is calibrated and verified by utilizing early wet experimental data, so that a simulation result has better reliability, the fusion of the wet experiments and simulation design is realized to a certain extent, and the simulation process is more reasonable.
Further, the present invention analyzes the simulated sequencing data and the end-to-end storage recovery performance. FIGS. 5, 6, 7 and 8 show the comparison of the generated simulated sequencing data with the 254Kb artificial chromosome sequencing data obtained in the previous experiment. FIGS. 5 and 6 compare the number of sequencing errors in reads. As can be seen from this figure, it is,the number of erroneous reads is around 20%. The quality of the simulated reads is slightly worse than the measured data, which better explains the error correction capability of the method. FIGS. 7 and 8 compare sequencing in reads versus position in reads. As can be seen, the second generation sequencing reads contain insertion, deletion and substitution errors, and the error rate is 10-4~10-3Left and right. The error rate of insertions and deletions is significantly lower than the substitution error probability. It can be seen from the figure that the generated sequencing data features are very consistent with the actual second generation sequencing data features, which indicates that the method of simulation for generating sequencing data has better feasibility.
Then, the present embodiment performs simulation test and analysis on the encoding scheme and the data recovery scheme under different coverage degrees. In the test, coverage degrees of 20x, 25x and 30x are selected, each coverage degree generates 10 groups of independent test data by ART, and 10 independent parallel assembling and decoding experiments are carried out. The assembly software selected was Velvet and ABySS, several different k-mer values were selected for each assembly method, and the specific simulation results are given in Table 1. Fig. 9 and 10 give an illustration of the assembled error behavior at different sequencing coverage, confirming the rationality of the proposed multiple RS code interleaving and performing erasure correction schemes. As can be seen from FIGS. 9 and 10, as sequencing coverage increases, the performance of the multi-software multi-parameter assembly method improves, the number of residual errors and deletions decreases rapidly, and reliable recovery can be achieved. At a sequencing coverage of 20x, the error rates of the Velvet and ABySS assemblies alone are both at the edge of the error correction capability of the scheme (correcting 1.34% of symbol erasures, or correcting 0.66% of symbol errors), and there are cases where data recovery fails, as detailed in table 1. Given the large variability of the assembly and contig merging strategy, fig. 9 and 10 show the specific results of each experiment (e.g., "x" and "□" in the figure).
TABLE 1 specific data recovery analysis with interleaving of multiple RS codes
Figure BDA0003093176260000081
Figure BDA0003093176260000091
From the simulation results in table 1, it can be seen that when the coverage is 25x and 30x, 10 parallel experiments of all schemes are successfully decoded, and the robustness of the encoding method and the data recovery method is verified. Meanwhile, compared with the results of adopting a single k-mer, the multi-method and multi-k-mer have the advantages that the performance of the read assembly is obviously improved, and the fragment deletion and the errors are obviously reduced. Table 1 lists only the error conditions of large segments, and these errors after assembling and ARS identification are interleaved, so that the error correction capability of multiple high-rate RS codes can be fully utilized to obtain a very high success rate, and data recovery is successful under 25x and 30x in experimental tests. Table 1 also lists all cases where each data segment cannot be successfully decoded by independently encoding it with RS codes of the same parameters without using the interleaving scheme. If multiple RS code interleaving is not adopted, each data block is encoded by using a separate RS code, and since the RS code with a high code rate has limited error correction and erasure correction capability (correcting Nerasure 55 erasures, or Nerror 27 errors, or 2 Nerror + Nerasure 55), there may be a case where some segments cannot be decoded correctly, and data cannot be completely recovered.
Under 20 × sequencing data, contigs assembled with single k-mers had high error rates and could not be decoded correctly, and the data size in this section was large and not listed in table 1. However, the case of combining multiple k-mer values still achieves better performance. Firstly, Velvet software multi-k-mer assembly is adopted, 10 independent experiments are carried out, decoding failure after de-interleaving occurs only for the third time, and decoding is correct under other conditions. Furthermore, in the mixed assembly of Velvet and ABySS, all 10 groups of independent experiments obtain gains, and all decoding succeeds after de-interleaving. Namely, the scheme of the invention is verified to realize reliable data storage and have very high base logic density.
In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (6)

1.一种采用大片段DNA存储的编码与恢复方法,其特征在于,所述方法包括:1. a coding and recovery method using large fragment DNA storage, is characterized in that, described method comprises: (1)将二进制数据加扰,P个子块作为一个处理单元进行统一处理,每个子块的大小是k比特;每个子块分组纠错码编码,不同的子块进行交织,并经过“符号/比特-碱基序列”的映射过程转码为P个数据DNA序列,并将其与P-1个复制起始位点序列交替镶嵌,构成人工染色体序列;(1) Scrambling the binary data, the P sub-blocks are processed uniformly as a processing unit, and the size of each sub-block is k bits; The mapping process of "bit-base sequence" is transcoded into P data DNA sequences, which are alternately mosaicked with P-1 origin of replication sequences to form artificial chromosome sequences; (2)将人工染色体转入细胞,经过细胞增殖、核酸提取过程后,进行二代高通量测序;(2) The artificial chromosome is transferred into the cell, and after the process of cell proliferation and nucleic acid extraction, second-generation high-throughput sequencing is performed; (3)将测序读段从头组装,得到一组重叠群,根据已知的复制起始位点确定数据段在重叠群上的位置,截取数据段对应的部分测序读段后进行大数合并得到P个数据段,将数据段解交织后送入译码器进行译码得到信息序列。(3) Assemble the sequencing reads from scratch to obtain a set of contigs, determine the position of the data segment on the contig according to the known origin of replication, cut off part of the sequencing reads corresponding to the data segment, and perform large number merging to obtain The P data segments are deinterleaved and then sent to a decoder for decoding to obtain an information sequence. 2.根据权利要求1所述的一种采用大片段DNA存储的编码与恢复方法,其特征在于,所述步骤(1)具体为:2. a kind of coding and recovery method adopting large fragment DNA storage according to claim 1, is characterized in that, described step (1) is specially: (1.1)长度为N的二进制数据在与已知的伪随机序列叠加实现扰码,分为P个子块,每个子块长度为k,其中,k=N/P,P为正整数,是N的因子,N/P为整数;(1.1) The binary data of length N is superimposed with a known pseudo-random sequence to realize scrambling code, and divided into P sub-blocks, each sub-block length is k, where k=N/P, P is a positive integer, is N The factor of , N/P is an integer; (1.2)采用分组纠错码编码对每个子块进行编码,纠错码的信息位长度为k比特,编码后每个子块长度为n比特;(1.2) each sub-block is encoded by using block error correction code encoding, the information bit length of the error correction code is k bits, and the length of each sub-block after encoding is n bits; (1.3)编码后的子块按照P行的方式进行排序,每一行包含n个比特,将其分解为若干组P*P的比特或者符号单元,对每一个单元分别按照移位循环的方式进行交织;如果是比特单元,则比特单元的数量是n;如果是按照符号进行处理,则是n比特所对应符号数n/s,每个符号包括s个比特;(1.3) The encoded sub-blocks are sorted in the manner of P lines, each line contains n bits, which are decomposed into several groups of P*P bits or symbol units, and each unit is performed in a shift cycle manner. Interleaving; if it is a bit unit, the number of bit units is n; if it is processed according to symbols, it is the number of symbols corresponding to n bits n/s, and each symbol includes s bits; (1.4)交织后的子块得到P个比特分组,每个分组的大小也是n个比特或者n/s符号,符号分组也转化为比特分组长度为n;(1.4) The interleaved sub-blocks obtain P bit groups, the size of each group is also n bits or n/s symbols, and the symbol groups are also converted into bit groups with a length of n; (1.5)比特分组按照映射规则:00→A,01→T,10→G,11→C转码为P个长度为n/2的数据DNA序列。(1.5) The bit grouping is transcoded into P data DNA sequences of length n/2 according to the mapping rules: 00→A, 01→T, 10→G, 11→C. 3.根据权利要求1所述的一种采用大片段DNA存储的编码与恢复方法,其特征在于,所述步骤(3)具体为:3. a kind of coding and recovery method adopting large fragment DNA storage according to claim 1, is characterized in that, described step (3) is specially: (2.1)利用基于德布莱英图的二代序列组装程序将测序读段进行组装,得到一组重叠群;(2.1) Assembling the sequenced reads using the second-generation sequence assembly program based on the DeBrain map to obtain a set of contigs; (2.2)识别每个重叠群中的复制起始位点序列,根据复制起始位点序列的位置确定数据读段位置并截取对应的测序数据读段;(2.2) Identify the replication origin sequence in each contig, determine the position of the data read segment according to the position of the replication origin sequence, and intercept the corresponding sequencing data read segment; (2.3)对截取后的每一个数据读段所对应的部分测序读段进行大数合并,得到P条数据读段的合并序列;(2.3) performing large number merging on the partial sequencing reads corresponding to each intercepted data read to obtain a combined sequence of P data reads; (2.4)根据分组交织顺序对P个数据段进行解交织,得到P个码字,每个码字存在比特或符号的错误与删除;(2.4) de-interleave P data segments according to the grouping interleaving sequence to obtain P code words, and each code word has errors and deletions of bits or symbols; (2.5)将解交织得到P个码字送入译码器进行分组纠错纠删译码,得到信息序列。(2.5) The P codewords obtained by deinterleaving are sent to the decoder for packet error correction and erasure correction decoding to obtain an information sequence. 4.根据权利要求3所述的一种采用大片段DNA存储的编码与恢复方法,其特征在于,所述步骤(2.1)具体有三种组装方法,分别为:4. a kind of coding and recovery method adopting large fragment DNA storage according to claim 3, is characterized in that, described step (2.1) specifically has three kinds of assembly methods, respectively: (1)使用相同的二代序列组装软件Velvet在某个固定长度的k-mer值下实现二代测序读段的组装,得到一组重叠群;(1) Use the same second-generation sequence assembly software Velvet to assemble the second-generation sequencing reads at a fixed-length k-mer value to obtain a set of contigs; (2)使用相同的二代组装软件在多个不同长度的k-mer值下实现二代测序读段的组装,合并得到的多组重叠群;(2) Use the same second-generation assembly software to assemble the second-generation sequencing reads under multiple k-mer values of different lengths, and merge the obtained groups of contigs; (3)使用不同的二代组装软件在多个不同长度的k-mer值下实现二代测序读段的组装,合并得到的多组重叠群。(3) Use different second-generation assembly software to assemble the second-generation sequencing reads under multiple k-mer values of different lengths, and merge the obtained multiple groups of contigs. 5.根据权利要求3所述的一种采用大片段DNA存储的编码与恢复方法,其特征在于,所述步骤(2.2)具体为:5. a kind of encoding and recovery method adopting large fragment DNA storage according to claim 3, is characterized in that, described step (2.2) is specifically: (3.1)将正向以及翻转互补后的复制起始位点分别与重叠群比对编辑距离,根据编辑距离的大小判断该复制起始位点是否存在;若存在,记录复制起始位点此时的状态以及在重叠群上的位置;(3.1) Compare the editing distance between the forward and reverse complementary origins of replication with the contigs respectively, and judge whether the origin of replication exists according to the size of the editing distance; if there is, record the origin of replication. state at time and position on the contig; (3.2)若复制起始位点为正向的,将复制起始位点两侧的测序读段放入该数据段对应的缓存区中;若复制起始位点为翻转互补后的,则将复制起始位点两侧的测序读段翻转互补后放入该数据段对应的缓存区;(3.2) If the origin of replication is forward, put the sequencing reads on both sides of the origin of replication into the buffer area corresponding to the data segment; if the origin of replication is inverted and complementary, then The sequencing reads on both sides of the origin of replication are flipped to complement each other and put into the buffer area corresponding to the data segment; (3.3)重复步骤(3.1)和(3.2),直到所有包含复制起始位点或部分复制起始位点的读段被全部标记与分配完毕。(3.3) Steps (3.1) and (3.2) are repeated until all the reads including the origin of replication or part of the origin of replication are all marked and assigned. 6.根据权利要求3所述的一种采用大片段DNA存储的编码与恢复方法,其特征在于,所述步骤(2.3)具体为:6. a kind of encoding and recovery method adopting large fragment DNA storage according to claim 3, is characterized in that, described step (2.3) is specifically: (4.1)DNA数据段长度为n/2,假设数据段对应的缓存区中部分测序读段共q条,构造一个qn的数据块,将截取的测序读段放入数据块中,若截取数据段长度不足n/2,则用‘X’表示,X表示删除,在后续过程中转化为比特删除;(4.1) The length of the DNA data segment is n/2. Assuming that there are q pieces of sequencing reads in the buffer corresponding to the data segment, a qn data block is constructed, and the intercepted sequencing reads are put into the data block. If the data is intercepted If the segment length is less than n/2, it is represented by 'X', and X represents deletion, which is converted into bit deletion in the subsequent process; (4.2)将填充块每个位置i上的碱基进行大数判决,即取填充块中每个位置上{A,T,G,C}中数目最多的一种碱基作为拼接序列该位置上的碱基,若该位置没有填入碱基则记为‘X’,得到长度为n/2的DNA数据序列,进一步将其转化为比特序列,‘X’将转化为比特删除。(4.2) Perform a large number judgment on the bases at each position i of the filling block, that is, take the largest number of bases in {A, T, G, C} at each position in the filling block as the position of the splicing sequence If there is no base in the position, it is marked as 'X', and a DNA data sequence of length n/2 is obtained, which is further converted into a bit sequence, and 'X' will be converted into bit deletion.
CN202110601792.8A 2021-05-31 2021-05-31 Coding and recovering method using large fragment DNA storage Pending CN113345521A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110601792.8A CN113345521A (en) 2021-05-31 2021-05-31 Coding and recovering method using large fragment DNA storage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110601792.8A CN113345521A (en) 2021-05-31 2021-05-31 Coding and recovering method using large fragment DNA storage

Publications (1)

Publication Number Publication Date
CN113345521A true CN113345521A (en) 2021-09-03

Family

ID=77472965

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110601792.8A Pending CN113345521A (en) 2021-05-31 2021-05-31 Coding and recovering method using large fragment DNA storage

Country Status (1)

Country Link
CN (1) CN113345521A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114254748A (en) * 2021-09-27 2022-03-29 清华大学 Extended coding method, system and related device for storage channel
CN114783530A (en) * 2022-05-16 2022-07-22 南京大学 An error correction scheme for protein storage coding
CN115459781A (en) * 2022-08-25 2022-12-09 东南大学 Long sequence DNA storage coding method based on static interleaving coding
CN115514375A (en) * 2022-11-18 2022-12-23 江苏网进科技股份有限公司 Cache data compression method
CN116257146A (en) * 2023-05-15 2023-06-13 北京一起教育科技发展有限公司 Encoding and decoding method and device, electronic equipment and storage medium
CN116418997A (en) * 2021-12-28 2023-07-11 中国电信股份有限公司 Characteristic data compression method, device and system, electronic equipment and storage medium
WO2023130676A1 (en) * 2022-01-10 2023-07-13 天津大学 Dna storage cascade encoding and decoding methods for type-1 and type-2 segmented error correction internal codes
CN119479733A (en) * 2024-11-08 2025-02-18 天津大学 A fast readout method for storing large DNA fragments with watermark superposition coding

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110060734A (en) * 2019-03-29 2019-07-26 天津大学 A kind of high robust DNA sequencing bar code generating at and read method
CN110442472A (en) * 2019-07-03 2019-11-12 天津大学 A kind of DNA data storage mixing error correcting and data reconstruction method
CN110569974A (en) * 2018-06-06 2019-12-13 天津大学 Hierarchical representation and interleaved encoding method for DNA storage that can contain artificial bases

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569974A (en) * 2018-06-06 2019-12-13 天津大学 Hierarchical representation and interleaved encoding method for DNA storage that can contain artificial bases
CN110060734A (en) * 2019-03-29 2019-07-26 天津大学 A kind of high robust DNA sequencing bar code generating at and read method
CN110442472A (en) * 2019-07-03 2019-11-12 天津大学 A kind of DNA data storage mixing error correcting and data reconstruction method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈为刚 等: ""细胞内大片段DNA数据存储的多RS码交织编码"", 《合成生物学》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114254748A (en) * 2021-09-27 2022-03-29 清华大学 Extended coding method, system and related device for storage channel
CN116418997A (en) * 2021-12-28 2023-07-11 中国电信股份有限公司 Characteristic data compression method, device and system, electronic equipment and storage medium
WO2023130676A1 (en) * 2022-01-10 2023-07-13 天津大学 Dna storage cascade encoding and decoding methods for type-1 and type-2 segmented error correction internal codes
CN114783530A (en) * 2022-05-16 2022-07-22 南京大学 An error correction scheme for protein storage coding
CN115459781A (en) * 2022-08-25 2022-12-09 东南大学 Long sequence DNA storage coding method based on static interleaving coding
CN115514375A (en) * 2022-11-18 2022-12-23 江苏网进科技股份有限公司 Cache data compression method
CN116257146A (en) * 2023-05-15 2023-06-13 北京一起教育科技发展有限公司 Encoding and decoding method and device, electronic equipment and storage medium
CN116257146B (en) * 2023-05-15 2023-07-14 北京一起教育科技发展有限公司 Encoding and decoding method and device, electronic equipment and storage medium
CN119479733A (en) * 2024-11-08 2025-02-18 天津大学 A fast readout method for storing large DNA fragments with watermark superposition coding

Similar Documents

Publication Publication Date Title
CN113345521A (en) Coding and recovering method using large fragment DNA storage
CN111600609B (en) A DNA Storage Coding Method for Optimizing Chinese Storage
CN110569974B (en) Hierarchical representation and interleaving encoding method for DNA storage that can contain artificial bases
CN116707541B (en) A DNA storage method with pseudo-noise sequence encoding
CN110706751A (en) A DNA storage encryption coding method
Wang et al. High capacity DNA data storage with variable-length Oligonucleotides using repeat accumulate code and hybrid mapping
CN107798219B (en) Methods of biologically storing and restoring data
WO2018148260A1 (en) Apparatus, method and system for digital information storage in deoxyribonucleic acid (dna)
CN105022935A (en) Encoding method and decoding method for performing information storage by means of DNA
CN110708076B (en) A Hybrid Model-Based Encoding and Decoding Method for DNA Storage
CN112802549B (en) Coding and decoding method for DNA sequence integrity check and error correction
CN110060734B (en) A Barcode Generation and Reading Method for Highly Robust DNA Sequencing
CN111858507A (en) DNA-based data storage method, decoding method, system and device
CN118841094B (en) Space layered DNA storage method of large-scale oligonucleotide pool
CN114974429A (en) DNA storage coding method and device based on decimal system and readable storage medium
Song et al. Super-robust data storage in DNA by de Bruijn graph-based decoding
CN111243670A (en) A DNA information storage and coding method that satisfies biological constraints
Conde-Canencia et al. Nanopore DNA sequencing channel modeling
Park et al. BIC codes: bit insertion-based constrained codes with error correction for DNA storage
JP2005522138A5 (en)
CN113687976B (en) Coding and decoding method and device for DNA information storage
CN1751443A (en) Error correction encoding device and method and error correction decoding device and method
CN110190858B (en) Polymer molecule information storage error correction coding and decoding system
CN116187435B (en) Method and system for storing information by utilizing DNA (deoxyribonucleic acid) based on large and small fountain codes and MRC (MRC) algorithm
CN115470035A (en) Soft-decision information decoding method and encoding method for DNA storage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210903

WD01 Invention patent application deemed withdrawn after publication