CN113345521A

CN113345521A - Coding and recovering method using large fragment DNA storage

Info

Publication number: CN113345521A
Application number: CN202110601792.8A
Authority: CN
Inventors: 陈为刚; 郭健; 葛奇; 韩昌彩
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-09-03

Abstract

The invention discloses a coding and recovery method using large-segment DNA storage, belonging to the field of DNA data storage. The invention divides binary data into several sub-blocks after scrambled, and adopts very high code rate grouping error correction for each sub-block. After coding, different sub-blocks are symbol-interleaved and transcoded into data DNA units through the “symbol/bit-base sequence” mapping process, which is alternately mosaicked with the origin of replication to form an artificial chromosome sequence; data read At the time of publication, second-generation high-throughput sequencing was used to combine read de novo assembly, contig merging guided by known biological sequences such as promoter sequences, and efficient error-correction and erasure-correction decoding to achieve reliable data recovery. The advantage of the present invention is that it realizes the multi-element matching of the interleaved packet error correction code coding method, the large segment DNA logic structure, the second-generation high-throughput sequencing, and the de novo assembly method, with very low overhead, and makes full use of the high code rate correction method. The error correction ability of the error code can realize a very high base logic density.

Description

Coding and recovering method using large fragment DNA storage

Technical Field

The invention relates to the field of DNA data storage, in particular to a coding and recovery method adopting large-fragment DNA storage.

Background

With the rapid development of contemporary information technology, the global data volume exhibits explosive growth. According to the forecast of international data corporation, the global data volume will increase from 44ZB in 2020 to 175ZB in 2025, and the existing storage methods have not been able to satisfy the human demand for storing massive digital information. Currently, storage media are used which mainly include: optical disc, magnetic tape, flash memory, etc., the storage mode mainly includes: computer-based stand-alone storage, network storage, distributed storage, cloud storage, and the like. However, there are many challenges with existing storage media and storage methods. First, the density of storage media is approaching a limit, making it difficult to meet the rapidly increasing demand for data. Second, the currently established data storage centers face many challenges of size, cost, power consumption, security, and the like. Third, the effective storage time of the conventional storage medium is short, generally within several decades, for example, the effective storage time of a magnetic tape is 30-50 years, which cannot meet the long-term storage and backup requirements of important data, and the storage medium such as a magnetic tape and an optical disc is easily damaged, resulting in the loss of stored data.

Artificially synthesized deoxyribonucleic acid (DNA) is used as a potential data storage medium, has high storage density, long service life and low storage energy consumption, and is expected to become one of important choices for storing mass offline data in the future. The american society for semiconductor industry and semiconductor research corporation promulgated "semiconductor decade program" in 2021, listed DNA data storage as one of the major storage modes for large amounts of data in parallel with hard disks, solid state disks, and magnetic tapes, and became an important direction for global storage industry competition in the future. The DNA storage mainly uses 4 different deoxyribonucleotides of oligonucleotide, namely A (adenine), T (thymine), G (guanine) and C (cytosine) to represent data, and the storage of mass data through artificially synthesized DNA sequences or oligonucleotide molecules is an emerging data storage mode. With the increasing data volume generated in human production and life, the importance of data storage is also highlighted, and DNA data storage becomes a large-scale mass data storage mode with great potential in the future.

DNA data storage data was stored using { A, T, G, C } of oligonucleotide molecules. The current modes of DNA data storage mainly include: short fragment oligonucleotide pool (Oligo pool) storage, intracellular DNA storage, and the like. The oligonucleotide pool storage of short fragments develops rapidly by means of high-throughput chip synthesis and sequencing technology of DNA, but still has great challenges in large-scale equilibrium amplification and replication cost. The method has the advantages that intracellular DNA data storage, particularly intracellular large-fragment DNA storage, short DNA fragments are assembled into long fragments by means of an in-vivo assembly method, efficient amplification is achieved by means of in-vivo replication, replication cost is low, and the method has potential application value in scenes such as large-scale data distribution and the like.

The logical structure design of large-fragment DNA data storage is different from the data storage based on oligonucleotide pool (Oligo pool), the ratio of index and primer (or similar units, such as skeleton in yeast artificial chromosome) is relatively low, and the logical structure design has certain advantages in base utilization rate. The data reading phase requires de novo (de novo) assembly of sequencing reads, similar to de novo (de novo) sequencing of the genome of new species. The reading of large-fragment DNA data storage has high requirement on algorithm real-time performance, and the traditional bioinformatics processing flow is not suitable. Aiming at the characteristic, when the data is recovered, a completely error-free DNA sequence is not required to be obtained in the step of reading assembly, and residual errors after assembly are corrected by using an error correcting code, so that a perfect DNA sequence without any base errors is obtained. Therefore, how to design a proper error correction coding scheme to ensure the reliability of large-fragment DNA data storage becomes an important issue.

The application of error correction coding technology in the field of DNA data storage is an important means for reducing writing cost and ensuring data reliability, and is a very potential technology. On one hand, due to the limitation of biochemical conditions, the error rate of oligonucleotide molecule synthesis is high, and generally, one synthesis error occurs in hundreds of bases; if the accuracy of oligonucleotide synthesis is further improved, the yield of oligonucleotide molecules is reduced, and the cost is greatly increased. Therefore, the adoption of an efficient error correction coding technique is an important method for reducing the writing cost and ensuring the data reliability.

However, according to the fundamental theory of channel capacity and channel coding of shannon information theory, the error correcting code needs to be matched with the error characteristics of writing/reading, and only reliable and efficient data storage can be realized. Several important error correcting codes in the field of digital communication have been tried in vitro DNA data storage. For example, digital fountain codes are used to correct deletion errors caused by oligonucleotide molecule loss, reed-solomon (RS) codes to correct base deletions and random errors, and product codes composed of Low Density Parity Check (LDPC) codes and RS codes to correct deletions and random errors, but these error correction coding methods do not fully match the error characteristics of writing/reading, because they do not use the characteristic that the error characteristics of a plurality of bits corresponding to each base symbol are correlated. The coding method for storing the in-vivo large-fragment DNA adopts the watermark code consisting of the LDPC code and the pseudorandom sequence, and aims at the high error rate of the third-generation nanopore sequencing, and the base insertion/deletion error which is difficult to process is mainly considered. The coding efficiency of the method is low, 1.19bit/bp, and the theoretical limit density of 2bit/bp of information represented by 4 bases { A, T, G, C } is still large in difference.

Disclosure of Invention

The invention provides a coding and recovery method by using large-segment DNA storage, which realizes the multi-element matching of interleaved grouped error correcting code coding, large-segment DNA logic structure, next-generation high-throughput sequencing and de novo assembly method, has very high base logic density, and is described in detail as follows:

a method of encoding and recovering using large fragment DNA storage, the method comprising:

(1) scrambling binary data, and uniformly processing P sub-blocks as a processing unit, wherein the size of each sub-block is k bits; each sub-block is grouped, error correcting code coding is carried out, different sub-blocks are interleaved, transcoding is carried out through the mapping process of 'symbol/bit-base sequence' to obtain P data DNA sequences, and the P data DNA sequences and P-1 replication starting site sequences are alternately embedded to form an artificial chromosome sequence;

(2) transferring the artificial chromosome into cells, and performing second-generation high-throughput sequencing after cell proliferation and nucleic acid extraction processes;

(3) assembling the sequencing read from the head to obtain a group of contigs, determining the positions of the data segments on the contigs according to the known replication start site, intercepting partial sequencing read corresponding to the data segments, then carrying out majority merging to obtain P data segments, deinterleaving the data segments, and sending the data segments to a decoder for decoding to obtain an information sequence.

The step (1) is specifically as follows:

(1.1) superimposing binary data with the length of N with a known pseudo-random sequence to realize scrambling, dividing the binary data into P sub-blocks, wherein the length of each sub-block is k, k is N/P, P is a positive integer and is a factor of N, and N/P is an integer;

(1.2) coding each sub-block by adopting grouped error correcting code coding, wherein the information bit length of the error correcting code is k bits, and the length of each sub-block after coding is n bits;

(1.3) sequencing the encoded subblocks according to a P row mode, wherein each row comprises n bits, decomposing the n bits into a plurality of groups of P-P bit or symbol units, and interweaving each unit respectively according to a shift cycle mode; if the bit cells are, the number of the bit cells is n; if the processing is carried out according to the symbols, the number of the symbols corresponding to n bits is n/s, and each symbol comprises s bits;

(1.4) obtaining P bit groups by the interleaved subblocks, wherein the size of each group is also n bits or n/s symbols, and the symbol groups are also converted into bit groups with the length of n;

(1.5) bit grouping according to the mapping rule: 00 → A, 01 → T, 10 → G, 11 → C to be transcoded into P data DNA sequences of length n/2.

The technical scheme provided by the invention has the beneficial effects that:

1. the coding method provided by the invention realizes that the data file is coded into a plurality of DNA data units, the DNA data units are further alternately embedded and combined with the replication starting site, a flexible intracellular DNA data storage structure is constructed, and the storage of data in large-fragment DNA is realized;

2. the invention is based on the second generation high-throughput sequencing, adopts the proposed data recovery method, and can realize reliable and high-efficiency DNA data reading;

3. the invention realizes the multi-element matching of the interleaved grouped error correcting code coding, the large-fragment DNA logic structure, the next generation high-throughput sequencing and the de novo assembly method, has very low overhead, fully utilizes the error correcting capability of the high-code-rate error correcting code and can realize very high base logic density.

Drawings

FIG. 1 is a block diagram of a system for implementing the solution of the present invention;

FIG. 2 is a flow chart of the encoding of a packet error correction code in the present invention;

FIG. 3 is a flowchart of recovery in data read-out according to the present invention;

FIG. 4 is a schematic diagram of a computer-based long DNA fragment data storage simulation platform according to the present invention;

FIG. 5 is a graph of the distribution of the number of base errors (edit distance) of simulated reads in accordance with the present invention;

FIG. 6 is a graph showing a distribution of the number of base errors (edit distance) of real reads according to the present invention;

FIG. 7 is a graph of the base error location distribution of simulated reads according to the present invention;

FIG. 8 is a base error location distribution diagram of real reads according to the present invention;

FIG. 9 is a schematic diagram showing the analysis of the base deletion result in each experiment in the present invention;

FIG. 10 is a schematic diagram showing analysis of a base error result in each experiment in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Aiming at the error characteristics that the large-fragment DNA data storage in cells is adopted and the information coding method is suitable for sequencing and reading assembly, the embodiment of the invention provides a coding and recovery method adopting the large-fragment DNA storage. The binary data is divided into a plurality of sub-blocks after being scrambled, after each sub-block is coded by adopting a grouped error correcting code, different sub-blocks are subjected to symbol interleaving and are transcoded into data DNA units through the mapping process of 'symbol-bit-base sequence', and the data DNA units and the replication start sites are alternately inlaid to form an artificial chromosome sequence; when data is read out, the second-generation high-throughput sequencing is utilized, and read segment de novo (de novo) assembly, copy initial site guided contig merging and efficient error correction erasure decoding are combined, so that reliable data recovery is realized.

The method provided by the invention not only can realize reliable storage of data, but also realizes multi-element matching of an interleaved grouped error correction code encoding method, a large-fragment DNA logic structure, next-generation high-throughput sequencing and a de novo assembly method, has very high base logic density, and fully explains the advantages of large-fragment DNA storage. In addition, the method provided by the invention can construct a flexible intracellular DNA data structure and meet the requirement of designing a data file with the size of hundreds of Kb to Mb.

The invention provides a coding and recovery method adopting large-segment DNA storage, which aims at the field of DNA data storage and is basically characterized in that binary data are scrambled, subjected to grouped error correction code coding and symbol interleaving, mapped to obtain a data DNA unit, the data DNA unit and an ARS sequence are alternately embedded to obtain an artificial chromosome, and after second-generation high-throughput sequencing, read segment de novo assembly, ARS guided contig combination and efficient error correction erasure correction decoding are combined to realize reliable data recovery. The invention can ensure the reliable storage of data and has very high base logic density.

The specific embodiment of the invention is given below, the embodiment designs and simulates an 2,489,847bp circular artificial chromosome, a 589KB picture and a 2.65KB txt text are stored in the circular artificial chromosome by using a plurality of RS code symbol interleaving coding methods with extremely high code rates, and the total logic density reaches 1.947bit/bp, which is higher than the current main design method and is very close to the theoretical 2 bit/bp. Wherein the selected RS code is defined in the finite field GF (2)¹²) The RS code above (4095, 4040, t ═ 27), the length of the information bit portion is 4040 symbols, the code rate is 0.987, and each symbol comprises 12 bits.

The coding method for interleaving the RS codes with a plurality of extremely high code rates specifically comprises the following steps:

(1.1) superposing binary data with a known pseudo-random sequence to realize scrambling, and dividing the scrambled binary data into 100 subblocks with the same size;

(1.2) encoding each subblock by adopting RS code encoding, wherein the information bit length is 4040 symbols, and the length of each subblock after encoding is 4095 symbols;

(1.3) sorting the coded sub-blocks in a column mode, then decomposing the coded sub-blocks into a plurality of groups of 100 x 100 units, and interweaving each unit in a diagonal cycle mode;

(1.4) 4095 symbols per data segment, converted to 49,140 bits, as a basic packet;

(1.5) adjacent two bits in each packet are transcoded to 1 base according to the mapping rule of 00 → A, 01 → T, 10 → G, 11 → C. Each data segment is converted into 24,570bp DNA data sequence;

(1.6) alternately inlaying 100 DNA data segments with the length of 24,570bp and 99 ARSs with shorter lengths, and further adding a vector skeleton sequence to obtain a circular artificial chromosome with the length of 2,489,847 bp.

In the earlier stage, the embodiment of the invention designs and assembles an 254,886bp yeast artificial chromosome, adopts second-generation high-throughput sequencing to obtain a group of real sequencing data, and takes the group of real sequencing data as reference to carry out sequencing simulation on the artificial chromosome, and the specific steps are as follows:

(2.1) sending the real sequencing data into second-generation sequencing simulation software ART for parameter training to obtain a configuration simulation file;

and (2.2) inputting the configuration simulation file, the artificial chromosome sequence and other parameters into second-generation sequencing simulation software for sequencing simulation. The coverage selected in this embodiment is 20x, 25x, 30x, and each coverage generates 10 sets of independent test data with ART to obtain the corresponding sequencing read of the double-end read PE 150.

The specific recovery method during data reading is as follows:

(3.1) assembling the sequencing reads by using second-generation sequence assembling software Velvet and ABySS based on a Debrelain map to obtain a group of contigs;

(3.2) identifying ARS sequences in each contig, determining the position of a data read according to the ARS sequences and intercepting a corresponding part of sequencing reads;

(3.3) carrying out large number combination on the part of the sequencing reads corresponding to each intercepted data read to obtain a combined sequence of 100 data reads, and further converting the combined sequence into a symbol sequence;

(3.4) de-interleaving the 100 data segments according to the packet interleaving sequence to obtain 100 error-existing and deleted code words;

and (3.5) sending the 100 code words obtained by de-interleaving into a decoder for block error correction erasure decoding to obtain an information sequence.

The assembly of the second-generation sequencing reads is realized by using second-generation sequence assembly software such as Velvet or ABySS based on a Debrelain atlas under a plurality of k-mer values with different lengths, and a group of contigs is obtained specifically as follows:

(1) assembling the second-generation sequencing reads by Velvet or AbySS under the k-mer value with fixed length to obtain a group of contigs;

(2) assembling second-generation sequencing reads by Velvet or AbySS under a plurality of k-mer values with different lengths, and combining to obtain a plurality of groups of contigs;

(3) and assembling the second-generation sequencing reads by Velvet and AbySS under a plurality of k-mer values with different lengths, and combining to obtain a plurality of groups of contigs.

The specific steps of identifying the ARS sequence in each contig, determining the position of the data read according to the ARS sequence and intercepting the corresponding partial sequencing read are as follows:

(4.1) respectively comparing the ARS sequences subjected to forward and reverse complementation with the contig for an editing distance, wherein the editing distance is less than 20% of the length of the ARS sequence, judging that the ARS exists, and recording the state of the ARS at the moment and the position of the ARS on the contig;

(4.2) if the ARS is positive, placing sequencing reads on two sides of the ARS into a cache region corresponding to the data segment; if the ARS is reversed and complemented, the sequencing reads on the two sides of the ARS are reversed and complemented and then are placed into a cache region corresponding to the data segment;

(4.3) repeating steps (4.1) and (4.2) until all reads containing ARS sequences (or partial ARS sequences) are fully marked and assigned.

The method comprises the following specific steps of carrying out majority merging on partial sequencing reads corresponding to each intercepted data read to obtain a merged sequence of P data reads, and further converting the merged sequence into a symbol sequence:

and (5.1) the length of the data segment is n, and an x multiplied by n data block is constructed by assuming that the number of partial sequencing reads in the cache region corresponding to the data segment is x. Putting the intercepted sequencing read into a data block, and if the length of the intercepted data segment is less than n, completing by using 'X';

(5.2) performing majority judgment on the base at each position i of the filling block, namely taking one base with the largest number in { A, T, G, C } at each position in the filling block as the base at the position of the splicing sequence, marking the base as 'X' if the base is not filled in the position, obtaining a data segment with the length of n, and further converting the data segment into a symbol sequence.

The specific steps of sending the data segments into a decoder for decoding after deinterleaving to obtain an information sequence are as follows:

(6.1) de-interleaving 100 data segments according to the packet interleaving sequence to obtain 100 error-existing and deleted code words;

and (6.2) performing error correction and erasure correction decoding on 100 code words of the RS code obtained by de-interleaving to obtain data segments.

First, as shown in FIG. 4, the present invention establishes a computer-based long DNA fragment data storage simulation platform. In the simulation experiment, ART software was chosen to generate sequencing reads to simulate the sequencing process. Although direct synthesis and sequencing experiments are not carried out, the method is calibrated and verified by utilizing early wet experimental data, so that a simulation result has better reliability, the fusion of the wet experiments and simulation design is realized to a certain extent, and the simulation process is more reasonable.

Further, the present invention analyzes the simulated sequencing data and the end-to-end storage recovery performance. FIGS. 5, 6, 7 and 8 show the comparison of the generated simulated sequencing data with the 254Kb artificial chromosome sequencing data obtained in the previous experiment. FIGS. 5 and 6 compare the number of sequencing errors in reads. As can be seen from this figure, it is,the number of erroneous reads is around 20%. The quality of the simulated reads is slightly worse than the measured data, which better explains the error correction capability of the method. FIGS. 7 and 8 compare sequencing in reads versus position in reads. As can be seen, the second generation sequencing reads contain insertion, deletion and substitution errors, and the error rate is 10^-4～10^-3Left and right. The error rate of insertions and deletions is significantly lower than the substitution error probability. It can be seen from the figure that the generated sequencing data features are very consistent with the actual second generation sequencing data features, which indicates that the method of simulation for generating sequencing data has better feasibility.

Then, the present embodiment performs simulation test and analysis on the encoding scheme and the data recovery scheme under different coverage degrees. In the test, coverage degrees of 20x, 25x and 30x are selected, each coverage degree generates 10 groups of independent test data by ART, and 10 independent parallel assembling and decoding experiments are carried out. The assembly software selected was Velvet and ABySS, several different k-mer values were selected for each assembly method, and the specific simulation results are given in Table 1. Fig. 9 and 10 give an illustration of the assembled error behavior at different sequencing coverage, confirming the rationality of the proposed multiple RS code interleaving and performing erasure correction schemes. As can be seen from FIGS. 9 and 10, as sequencing coverage increases, the performance of the multi-software multi-parameter assembly method improves, the number of residual errors and deletions decreases rapidly, and reliable recovery can be achieved. At a sequencing coverage of 20x, the error rates of the Velvet and ABySS assemblies alone are both at the edge of the error correction capability of the scheme (correcting 1.34% of symbol erasures, or correcting 0.66% of symbol errors), and there are cases where data recovery fails, as detailed in table 1. Given the large variability of the assembly and contig merging strategy, fig. 9 and 10 show the specific results of each experiment (e.g., "x" and "□" in the figure).

TABLE 1 specific data recovery analysis with interleaving of multiple RS codes

From the simulation results in table 1, it can be seen that when the coverage is 25x and 30x, 10 parallel experiments of all schemes are successfully decoded, and the robustness of the encoding method and the data recovery method is verified. Meanwhile, compared with the results of adopting a single k-mer, the multi-method and multi-k-mer have the advantages that the performance of the read assembly is obviously improved, and the fragment deletion and the errors are obviously reduced. Table 1 lists only the error conditions of large segments, and these errors after assembling and ARS identification are interleaved, so that the error correction capability of multiple high-rate RS codes can be fully utilized to obtain a very high success rate, and data recovery is successful under 25x and 30x in experimental tests. Table 1 also lists all cases where each data segment cannot be successfully decoded by independently encoding it with RS codes of the same parameters without using the interleaving scheme. If multiple RS code interleaving is not adopted, each data block is encoded by using a separate RS code, and since the RS code with a high code rate has limited error correction and erasure correction capability (correcting Nerasure 55 erasures, or Nerror 27 errors, or 2 Nerror + Nerasure 55), there may be a case where some segments cannot be decoded correctly, and data cannot be completely recovered.

Under 20 × sequencing data, contigs assembled with single k-mers had high error rates and could not be decoded correctly, and the data size in this section was large and not listed in table 1. However, the case of combining multiple k-mer values still achieves better performance. Firstly, Velvet software multi-k-mer assembly is adopted, 10 independent experiments are carried out, decoding failure after de-interleaving occurs only for the third time, and decoding is correct under other conditions. Furthermore, in the mixed assembly of Velvet and ABySS, all 10 groups of independent experiments obtain gains, and all decoding succeeds after de-interleaving. Namely, the scheme of the invention is verified to realize reliable data storage and have very high base logic density.

In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. a coding and recovery method using large fragment DNA storage, is characterized in that, described method comprises:

(1) Scrambling the binary data, the P sub-blocks are processed uniformly as a processing unit, and the size of each sub-block is k bits; The mapping process of "bit-base sequence" is transcoded into P data DNA sequences, which are alternately mosaicked with P-1 origin of replication sequences to form artificial chromosome sequences;

(2) The artificial chromosome is transferred into the cell, and after the process of cell proliferation and nucleic acid extraction, second-generation high-throughput sequencing is performed;

(3) Assemble the sequencing reads from scratch to obtain a set of contigs, determine the position of the data segment on the contig according to the known origin of replication, cut off part of the sequencing reads corresponding to the data segment, and perform large number merging to obtain The P data segments are deinterleaved and then sent to a decoder for decoding to obtain an information sequence.

2. a kind of coding and recovery method adopting large fragment DNA storage according to claim 1, is characterized in that, described step (1) is specially:

(1.1) The binary data of length N is superimposed with a known pseudo-random sequence to realize scrambling code, and divided into P sub-blocks, each sub-block length is k, where k=N/P, P is a positive integer, is N The factor of , N/P is an integer;

(1.2) each sub-block is encoded by using block error correction code encoding, the information bit length of the error correction code is k bits, and the length of each sub-block after encoding is n bits;

(1.3) The encoded sub-blocks are sorted in the manner of P lines, each line contains n bits, which are decomposed into several groups of P*P bits or symbol units, and each unit is performed in a shift cycle manner. Interleaving; if it is a bit unit, the number of bit units is n; if it is processed according to symbols, it is the number of symbols corresponding to n bits n/s, and each symbol includes s bits;

(1.4) The interleaved sub-blocks obtain P bit groups, the size of each group is also n bits or n/s symbols, and the symbol groups are also converted into bit groups with a length of n;

(1.5) The bit grouping is transcoded into P data DNA sequences of length n/2 according to the mapping rules: 00→A, 01→T, 10→G, 11→C.

3. a kind of coding and recovery method adopting large fragment DNA storage according to claim 1, is characterized in that, described step (3) is specially:

(2.1) Assembling the sequenced reads using the second-generation sequence assembly program based on the DeBrain map to obtain a set of contigs;

(2.2) Identify the replication origin sequence in each contig, determine the position of the data read segment according to the position of the replication origin sequence, and intercept the corresponding sequencing data read segment;

(2.3) performing large number merging on the partial sequencing reads corresponding to each intercepted data read to obtain a combined sequence of P data reads;

(2.4) de-interleave P data segments according to the grouping interleaving sequence to obtain P code words, and each code word has errors and deletions of bits or symbols;

(2.5) The P codewords obtained by deinterleaving are sent to the decoder for packet error correction and erasure correction decoding to obtain an information sequence.

4. a kind of coding and recovery method adopting large fragment DNA storage according to claim 3, is characterized in that, described step (2.1) specifically has three kinds of assembly methods, respectively:

(1) Use the same second-generation sequence assembly software Velvet to assemble the second-generation sequencing reads at a fixed-length k-mer value to obtain a set of contigs;

(2) Use the same second-generation assembly software to assemble the second-generation sequencing reads under multiple k-mer values of different lengths, and merge the obtained groups of contigs;

(3) Use different second-generation assembly software to assemble the second-generation sequencing reads under multiple k-mer values of different lengths, and merge the obtained multiple groups of contigs.

5. a kind of encoding and recovery method adopting large fragment DNA storage according to claim 3, is characterized in that, described step (2.2) is specifically:

(3.1) Compare the editing distance between the forward and reverse complementary origins of replication with the contigs respectively, and judge whether the origin of replication exists according to the size of the editing distance; if there is, record the origin of replication. state at time and position on the contig;

(3.2) If the origin of replication is forward, put the sequencing reads on both sides of the origin of replication into the buffer area corresponding to the data segment; if the origin of replication is inverted and complementary, then The sequencing reads on both sides of the origin of replication are flipped to complement each other and put into the buffer area corresponding to the data segment;

(3.3) Steps (3.1) and (3.2) are repeated until all the reads including the origin of replication or part of the origin of replication are all marked and assigned.

6. a kind of encoding and recovery method adopting large fragment DNA storage according to claim 3, is characterized in that, described step (2.3) is specifically:

(4.1) The length of the DNA data segment is n/2. Assuming that there are q pieces of sequencing reads in the buffer corresponding to the data segment, a qn data block is constructed, and the intercepted sequencing reads are put into the data block. If the data is intercepted If the segment length is less than n/2, it is represented by 'X', and X represents deletion, which is converted into bit deletion in the subsequent process;

(4.2) Perform a large number judgment on the bases at each position i of the filling block, that is, take the largest number of bases in {A, T, G, C} at each position in the filling block as the position of the splicing sequence If there is no base in the position, it is marked as 'X', and a DNA data sequence of length n/2 is obtained, which is further converted into a bit sequence, and 'X' will be converted into bit deletion.