CN112582030A

CN112582030A - Text storage method based on DNA storage medium

Info

Publication number: CN112582030A
Application number: CN202011508358.7A
Authority: CN
Inventors: 刘文斌; 昝乡镇; 姚祥宇; 许�鹏; 方刚; 陈智华; 石晓龙
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2021-03-30
Anticipated expiration: 2040-12-18
Also published as: CN112582030B

Abstract

The invention provides a text storage method based on a DNA storage medium, which comprises the following steps: acquiring an original text, and coding the original text to obtain a DNA storage sequence; synthesizing the DNA storage sequence to obtain a DNA molecule sequence, amplifying the DNA molecule sequence, and storing the amplified DNA molecule sequence; obtaining a stored DNA molecule sequence, and transcoding to obtain an original text; the transcoding to obtain the original text comprises the following steps: sequencing the stored DNA molecule sequence to obtain the read length of the DNA molecule sequence; and preprocessing the read length, removing noise data in the read length, and transcoding the preprocessed read length to obtain an original text. The method directly converts the stored DNA molecule sequence through the reading length of the sequence, removes more redundant codes, improves the storage efficiency, fully utilizes the semantic information in the original text in the transcoding and decoding processes of the method, has strong query processing capacity, and can be widely applied to the technical field of system biology research.

Description

Text storage method based on DNA storage medium

Technical Field

The invention relates to the technical field of system biology research, in particular to a text storage method based on a DNA storage medium.

Background

With the development of distributed, cloud computing and internet of things technologies, the total amount of data generated by human beings every day is exponentially and explosively increased. The traditional magnetic, optical, electric and other storage technologies cannot meet the storage requirement of exponential growth of mass data in the future. In addition, semiconductor-based general purpose processors (CPUs) and application specific processing chips (ASICs) have encountered endless difficulties in terms of power consumption, size, reliability, and the like. Therefore, the search for new information storage modes has become a key fundamental problem for the sustainable development of information technology. As a carrier of life genetic information, DNA molecules have the advantages of high density, small volume, good storage stability, low energy consumption in the aspect of storage, and possibility of fusion in biological calculation, thereby realizing a novel data processing mode integrating storage and calculation. The general procedure for DNA storage is: the binary file in the computer is firstly coded into a base sequence, then synthesis, amplification and sequencing are carried out, and original information is recovered from the base sequence. However, most of the current researches add a lot of redundant codes to the original input information, for example, the inner code solves the problem of base errors in the sequence, and the outer code solves the problem of deletion at the sequence level, because the DNA strand is prone to base deletion, insertion and substitution errors during synthesis, storage and sequencing. While the prior art does have its unique advantages, disadvantages are also apparent. For example, the storage efficiency is low, the decoding process is complex, semantic information is not utilized, and the information query processing capability is poor.

Disclosure of Invention

In view of the above, to at least partially solve one of the above technical problems, embodiments of the present invention provide a text storage method based on a DNA storage medium, which can achieve convenient and efficient text indifference storage.

In a first aspect, the present invention provides a text storage method based on a DNA storage medium, comprising the steps of:

acquiring an original text, and coding the original text to obtain a DNA storage sequence;

synthesizing the DNA storage sequence to obtain a DNA molecule sequence, amplifying the DNA molecule sequence, and storing the amplified DNA molecule sequence;

obtaining a stored DNA molecule sequence, and transcoding to obtain the original text;

the transcoding to obtain the original text comprises the following steps:

sequencing the stored DNA molecule sequence to obtain the read length of the DNA molecule sequence;

and preprocessing the read length, removing noise data in the read length, and transcoding the preprocessed read length to obtain the original text.

In a possible embodiment of the present disclosure, the step of obtaining an original text, and encoding the original text to obtain a DNA storage sequence includes:

generating a coding base sequence according to a coding rule and characters in the original text, and generating an index value according to the coding base sequence;

generating byte check codes according to characters in the original text;

and constructing the DNA storage sequence according to the index value, the byte check code and the text data consisting of the coding base sequence.

In a possible embodiment of the present disclosure, the step of generating a byte check code according to characters in the original text includes:

coding characters in the original text through the codes to obtain a binary character string;

and carrying out grouped base coding according to the binary character string to obtain the byte check code.

In a possible embodiment of the present disclosure, the step of preprocessing the read length, removing noise data from the read length, and transcoding the preprocessed read length to obtain the original text includes:

acquiring the read length, and performing reverse pushing according to a coding rule to obtain a decoded character line;

correcting the error of the decoded character line to obtain a decoded text character line;

and obtaining a plurality of groups according to the decoded text character lines and the text content, and decoding the groups to obtain the original text.

In a possible embodiment of the present application, the step of preprocessing the read length, removing noise data in the read length, and transcoding the preprocessed read length to obtain the original text further includes:

and determining the character with the minimum Hamming distance as a decoding character of the error base according to the read error base.

In a possible embodiment of the present disclosure, the step of obtaining a plurality of packets according to the decoded text character lines and the text content includes:

dividing according to the index value of the decoded text character line to obtain a plurality of groups, and determining the text similarity of group members;

performing secondary division on the group members according to the text similarity, wherein the secondary division comprises at least one of the following steps:

adding the members with the text similarity smaller than the first threshold value to other groups according to a preset first threshold value;

determining the average value of the text similarity, and deleting the group members according to the average value;

and clustering the members which do not belong to the group according to the text similarity to obtain a new group.

In a possible embodiment of the present disclosure, the step of decoding the packet to obtain an original text includes:

determining a weight value for a character in the decoded text character line in the packet;

determining a unique length value for the packet such that a length value of a decoded text character line in the packet is the same as the unique length value;

and determining characters of the original text according to the decoded text character lines with consistent length values and the weight values of the characters, and combining to obtain the original text.

Advantages and benefits of the present invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention:

the method comprises the steps of coding an original text into a base sequence, synthesizing and amplifying the base sequence, storing the amplified DNA molecular sequence, sequencing the stored DNA molecular sequence to obtain the read length of the sequence, deleting the read length of noise in the sequence, and recovering according to the read length to obtain the original text; the method directly converts the stored DNA molecule sequence through the reading length of the sequence, removes more redundant codes, improves the storage efficiency, fully utilizes the semantic information in the original text in the transcoding and decoding processes, and has strong query processing capability.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart illustrating steps of a method for storing text based on a DNA storage medium according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a DNA storage sequence according to an embodiment;

FIG. 3 is a flowchart illustrating the grouping steps according to the decoded text character lines and the text content in the embodiment;

FIG. 4 is a histogram showing the accuracy of reducing English text under different error rates and sequencing depths.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

In a first aspect, as shown in FIG. 1, the present application provides a text storage method based on a DNA storage medium, comprising steps S01-S03:

and S01, acquiring the original text, and coding the original text to obtain a DNA storage sequence.

Taking the selection of the english text as an example, the present embodiment encodes the characters of the english text according to the encoding rule to form the DNA storage sequence.

In this embodiment, the step of encoding the original text to obtain the DNA storage sequence specifically includes steps S011 to S013:

s011, generating a coding base sequence according to a coding rule and characters in an original text, and generating an index value according to the coding base sequence;

s012, generating byte check codes according to characters in the original text;

and S013, constructing a DNA storage sequence according to the index value, the byte check code and the text data consisting of the coding base sequence.

Specifically, various characters appearing in the English text are sequentially coded according to a character coding rule, and the coding base sequence of every M (M >0) text characters is a storage data unit. M text characters use Reed-Solomon codes (RS) to test and generate t-bit byte check codes, and according to the sequence generated by the data storage units, the corresponding base sequences are coded by n decimal numbers to be used as Index values (Index) of the data storage units. Thereby. A DNA storage sequence is composed of an index value part, an RS check code and a text data field.

Taking n as 5, t as 4, and M as 25 as an example; as shown in FIG. 2, a DNA storage sequence structure is shown.

In this example, the base sequences corresponding to the text characters are shown in table 1:

TABLE 1

In this example, the first part of the DNA memory sequence is Index, which is also a base sequence, and marks the order of the DNA memory lines in the original encoded text file. Each 6 bases of the Index base sequence is a unit and corresponds to n decimal numbers. The numerical code table corresponding to each digit of Index is shown in table 2:

TABLE 2

In this embodiment, the step S012 of generating the byte check code according to the characters in the original text can be further subdivided into steps S012a and S012 b:

s012a, coding the characters in the original text by the code of the Chinese character to obtain a binary character string;

s012b, grouping base coding is carried out according to the binary character string, and byte check codes are obtained.

Specifically, in the embodiment, after characters in the original English text are converted into binary character strings through RS inspection, grouping is carried out according to 4 bits as one group, and each group of 4-bit binary data is subjected to base coding according to the table 3.

TABLE 3

RS grouping	Encoding	RS grouping	Encoding	RS grouping	Encoding	RS grouping	Encoding
								0000	GTGT	0100	CACA	1000	TCAC	1100	ACTC
0001	GATG	0101	GTTC	1001	TACC	1101	AGCT
								0010	AGAC	0110	TGGT	1010	GAGA	1110	TCGA
0011	CTTG	0111	CAGT	1011	GAAC	1111	TGCA

In the examples of coding rules, i.e. coding in combination with the coding relationships provided in tables 1, 2 and 3, the length of the DNA reservoir sequence is fixed, with a length L of value n x 5+8 x t + M x 4. If the number of characters L (L >0) of the encoded english text in the memory sequence is smaller than M, the remaining base sequence units of the memory sequence may be constituted by base sequences corresponding to (M-L) space characters.

S02, synthesizing the DNA storage sequence to obtain a DNA molecule sequence, amplifying the DNA molecule sequence, and storing the amplified DNA molecule sequence.

Specifically, the DNA storage sequence obtained in step S01 is synthesized, amplified, and stored. The synthesis process is to obtain DNA storage sequence and artificially connect deoxynucleotides one by one through chemical reaction according to the sequence of preset nucleotides to synthesize DNA chain, namely DNA molecule sequence. The amplification process, i.e. generating multiple copies according to the sequence of the DNA molecule, in the example, the sequence of the DNA molecule is amplified by PCR (polymerase Chain reaction), i.e. polymerase Chain reaction. PCR amplification is a molecular biology technique for amplifying a specific DNA fragment, and can be regarded as special DNA replication in vitro, and the biggest characteristic of PCR is that a trace amount of DNA can be greatly increased. In the examples, the PCR process is divided into three steps: 1) DNA denaturation (90 ℃ -96 ℃): the double-stranded DNA template is broken by hydrogen bonds under the action of heat to form single-stranded DNA; 2) annealing (60 ℃ -65 ℃): the temperature of the system is lowered, and the primer is combined with the DNA template to form a local double strand. 3) Extension (70 ℃ -75 ℃): under the action of Taq enzyme (about 72 ℃ C., the activity is optimal), dNTP is used as a raw material, and a DNA strand complementary to the template is synthesized by extending from the 3 ' -end of the primer in the direction from the 5 ' → 3 ' -end. After each cycle of denaturation, annealing and extension, the DNA content was doubled.

Further, several DNA molecule sequences obtained after amplification are stored, for example, in a DNA molecule database.

And S03, acquiring the stored DNA molecule sequence, and transcoding to obtain the original text. The transcoding to obtain the original text comprises steps S031-S032:

s031, sequencing the stored DNA molecule sequence to obtain the read length of the DNA molecule sequence;

s032, preprocessing the read length, removing noise data in the read length, and transcoding the preprocessed read length to obtain an original text.

Specifically, the stored DNA molecule sequence is first sequenced, i.e. DNA sequencing (DNA sequencing), which means to analyze the base sequence of a specific DNA fragment, i.e. the (G) arrangement of adenine (a), thymine (T), cytosine (C) and guanine; for example, a second generation sequencer or a third generation sequencer is used for sequencing, and a result file output by the sequencer consists of reads; wherein reads is the judgment of the base composition of a DNA sequence molecule by a sequencer, namely the read length.

In step S032, before decoding reads obtained by sequencing to restore english text characters, data preprocessing is required, where the data preprocessing mainly includes deleting low-quality reads, that is, processing noise data in read length, and includes: deletions cannot correct reads that are inserted or deleted, and corrections can correct reads that are inserted or deleted. On the basis of data preprocessing, decoding, RS error correction and multi-sequence error correction are carried out on the obtained reads, and then the original coded English text is restored by using a word error correction technology.

More specifically, in an embodiment, the process of preprocessing reads includes at least one of the following steps:

1) the 'N' base in reads is replaced by a base 'A', wherein the 'N' character means that the sequencer cannot accurately give the specific base at the position, and the 'N' is adopted for replacement.

2) Low quality reads, i.e., reads with a Phred mass of less than 20 for four consecutive bases, are deleted. The quality value corresponding to each base in reads reflects the degree of accuracy of the base recognition, and when the length of one coding unit determined by the coding rule in step S01 is 6 and the number of coding units with Index values is 5, the phred value of 6 consecutive bases is determined to be low in the preprocessing, that is, the reads is determined to be of poor quality and should be deleted.

3) Deleting reads with an excessively small number of bases, i.e., reads with a length less than (L-5);

4) reads with an excessively large number of bases, i.e., reads with a length greater than (L +5), are deleted.

In some optional embodiments, the read length is preprocessed, and the process of deleting the read length with low quality may further include step 5):

5) reads with insertion/deletion errors of length between (L-2) and (L +2) are corrected. And determining the character with the minimum Hamming distance as the decoded character of the character aiming at the wrong base unit.

For example, the complete process for correcting the insertion/deletion errors of reads is:

a) setting a sliding window with a window size of

The number of decoding units, for example,

is 2.

b) reads are taken from left to right in sequence

Until the base sequence corresponding to each decoding unit can not be completely taken out

Calculating according to the coding table until each coding unit

The minimum hamming distance list of decoding units from the coding table. If the values of all elements of the list are more than or equal to 2, the first decoding unit is inserted or deleted, and the step c) is executed; otherwise, repeating the step b).

c) Inserting a proper character or deleting a character in each base of the first coding unit in sequence, wherein the inserted or deleted character must satisfy the following conditions: condition 1) is a character corresponding to a coding unit whose hamming distance is the smallest according to the coding table; condition 2) each element of the minimum hamming distance list of the sliding window after the character is inserted or deleted is less than 2; otherwise, executing the step b).

d) The length of the reads for deletion insertion/deletion correction is not equal to the reads for L.

In this embodiment, the process of transcoding in step S032 to obtain original text can be further subdivided into steps S032a-S032 c:

s032a, obtaining the read length after preprocessing, and performing reverse pushing according to the coding rule to obtain a decoded character line;

s032b, correcting errors of the decoded character lines to obtain decoded text character lines;

s032c, obtaining a plurality of groups according to the decoded text character lines and the text content, and decoding according to the groups to obtain the original text.

Specifically, regarding reads obtained through preprocessing, first, according to the character encoding table, the index encoding table, and the RS grouping encoding table adopted in step S01, the characters corresponding to the encoding units are obtained by using consecutive 6 bases as an encoding unit, and then the decoded character rows corresponding to the reads are obtained. Then, RS error correction is carried out on the decoded character line corresponding to each ready, and a decoded text character line formed by splicing text character strings only containing index information and error correction results is generated. And grouping according to the index value of the decoded text character line and the text content. And decoding the real text lines corresponding to the obtained packets, namely the original text lines according to a multiplicity principle, putting the decoded text lines into a set T, and sequencing the decoded text character lines in the set T according to index values. And sequentially removing the index values of the decoded text character lines in the T set, and outputting the text data region character strings to a decoded character file.

In this embodiment, in step S032c, the process of decoding text character lines and text content to obtain several groups can be further subdivided into steps S032c1-S032c 2:

s032c1, dividing the character lines according to the index values of the decoded text to obtain a plurality of groups, and determining the text similarity of the members in the groups;

S032C2, performing secondary division on the members in the group according to the text similarity, wherein the secondary division comprises at least one of the steps A-C:

A. and adding the members with the text similarity smaller than the first threshold value to other groups according to a preset first threshold value.

B. And determining the average value of the text similarity, and deleting the group members according to the average value.

C. And clustering the members which do not belong to the group according to the text similarity to obtain a new group.

Specifically, as shown in fig. 3, grouping is performed according to the index value of the decoded text character line and the text content. Grouping by adopting an index value during primary grouping;

after the preliminary grouping, for each decoded text character line of the group with the group member number less than 3, adopting a text data area and a central member of other groups (the group members are more than 3) ((The member in the center of the group means that the member has the highest average text similarity with other members in the group to which the member belongs, and the member can approximately represent the text similarity of the actual storage line corresponding to the group), if the similarity is more than a certain threshold value

(for example,

take 0.8), the decoded character line is deleted from the current packet and delivered to the packet with the highest similarity to the text. In an embodiment, the text similarity calculation method for two character strings includes: the two character strings s1 and s2 are subjected to sequence comparison by using a sequence comparison algorithm such as a Needle-Wunsch algorithm, the number of characters at the same position of the compared character strings is counted, the counted number is directly divided by the maximum value of the lengths of the character strings s1 and s2, and the divided result is the text similarity of the two character strings.

For each group member, according to the principle of text similarity

And deleting the members with larger text similarity difference with other members of the group. The text similarity between a member in the group and other members in the group is specifically the mean value of the text similarities between the member and other members in the group to which the member belongs.

And deleting the packets with illegal Index values in the unique decoding character line represented by the packets. The basis for judging whether the Index value is legal is as follows: and comparing the Index value with the number of the sequenced DNA storage sequences when the text file is coded, wherein if the Index value is small, the result is legal, and otherwise, the result is illegal.

For the decoded text character lines of the undetermined groups according to the text similarity

And clustering, and deleting clusters with illegal decoded text character line Index values corresponding to the clusters. Decoding of undetermined packetsAnd delivering the text character line to the packet with the maximum text similarity with the decoded text character line according to the text similarity.

In this embodiment, in step S032c, the process of decoding according to the packets to obtain the original text can also be subdivided into steps S032c3-S032c 5:

s032c3, determining the weight value of the characters in the decoded text character line in the packet;

s032c4, determining a unique length value of the packet, so that the length value of the decoded text character line in the packet is the same as the unique length value;

s032c5, determining characters of the original text according to the decoded text character lines with consistent length values and the weight values of the characters, and combining to obtain the original text.

Specifically, according to the multiplicity principle, the real text lines corresponding to the obtained packets are decoded, put into a set T, and the decoded text character lines in the set T are sorted according to the index values. The specific steps of the unique decoding character row represented by the multiplicity judgment grouping are as follows:

firstly, calculating an initial weight value of each decoded text character line of the group, wherein the weight calculation rule in the embodiment is as follows: english letter number decoded correctly by the decoded text character line/all English letter number decoded by the decoded text character line.

And determining a unique length value of the packet, wherein the unique length value corresponding to the packet is the value with the highest occurrence frequency of the length of the decoded text character line in the packet.

If the number of members of the group is less than τ (e.g., τ is 3), the spelling of the word in each decoded text character line of the group is checked and corrected on its own. And if the member number of the group is less than tau, carrying out sequence comparison on the decoded text character line with the character length not equal to the length theta of the unique decoded character line to be decoded in the group and any decoded text decoded character line with the length equal to theta in the group, and further carrying out proper expansion or stretching on the decoded text character line.

And sequentially calculating character values of corresponding columns in the unique decoding row corresponding to the grouping according to the data of each decoding text character row in the grouping. The calculation rule in the embodiment is as follows: determining characters of each column, and calculating the sum of weight values of the column in all rows and each character in each row in sequence; and selecting the weighted value and the maximum character.

In summary, the implementation process of this embodiment can be summarized as follows: and according to the coding rule, coding each character which sequentially appears in the English text, and sequentially adding index values to the base sequences of every N original text characters to obtain a series of DNA storage sequences. And combining the DNA storage sequences into a base sequence, performing biological storage, amplification and sequencing, and performing data cleaning on each reads in a sequencing file. And (4) after decoding, RS error correction and multi-sequence error correction are carried out on the obtained reads, and the original coded English text is recovered by using a word error correction technology.

As shown in fig. 4, it can be shown from the data that the english text can be completely restored when the sequencing depth is 25 in the cases of error rates of 0.01, 0.02 and 0.05 respectively in this embodiment; in the case of an error rate of 0.1, when the sequencing depth is 45, the original english text can be completely reproduced.

From the above specific implementation process, it can be concluded that the technical solution provided by the present invention has the following advantages or advantages compared to the prior art:

according to the technical scheme, the storage efficiency is improved, the semantic information in the original text is fully utilized in the transcoding and decoding processes, and the query processing capacity is high.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the functions and/or features may be integrated in a single physical device and/or software module, or one or more of the functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.

Wherein the functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. a text storage method based on DNA storage medium, is characterized in that, comprises the following steps:

Obtain the original text, and encode the original text to obtain the DNA storage sequence;

synthesizing the DNA storage sequence to obtain a DNA molecular sequence, amplifying the DNA molecular sequence, and storing the amplified DNA molecular sequence;

Obtain the stored DNA molecular sequence, and perform transcoding to obtain the original text;

The transcoding to obtain the original text includes the following steps:

Sequence the stored DNA molecular sequence to obtain the read length of the DNA molecular sequence;

The read length is preprocessed, noise data in the read length is removed, and the preprocessed read length is transcoded to obtain the original text.

2. a kind of text storage method based on DNA storage medium according to claim 1, is characterized in that, described obtaining original text, the step of encoding described original text to obtain DNA storage sequence, it comprises:

Generate an encoded base sequence according to the encoding rule and the characters in the original text, and generate an index value according to the encoded base sequence;

generating a byte check code according to the characters in the original text;

The DNA storage sequence is constructed according to the index value, the byte check code, and the text data composed of the encoded base sequence.

3. a kind of text storage method based on DNA storage medium according to claim 2, is characterized in that, the described step of generating byte check code according to the character in described original text, it comprises:

The characters in the original text are encoded to obtain a binary string by the code;

The grouped base encoding is performed according to the binary string to obtain the byte check code.

4. a kind of text storage method based on DNA storage medium according to claim 1, is characterized in that, described preprocessing described read length, removes the noise data in described read length, the read length after preprocessing The step of transcoding to obtain the original text includes:

Obtain the read length after the preprocessing, and inversely infer according to the encoding rule to obtain a decoded character line;

Error correction is performed on the decoded character row to obtain a decoded text character row;

Several groups are obtained according to the decoded text character line and the text content, and the groups are decoded to obtain the original text.

5. a kind of text storage method based on DNA storage medium according to claim 4 is characterized in that, described read length of described preprocessing, removes the noise data in described read length, the read length after preprocessing The step of transcoding to obtain the original text further includes:

According to the erroneous bases of the read length, it is determined that the character whose Hamming distance is the minimum value is the decoded character of the erroneous base.

6. a kind of text storage method based on DNA storage medium according to claim 4, is characterized in that, described according to described decoding text character line and text content to obtain this step of several groupings, it comprises:

According to the index value of the decoded text character line, several groups are obtained, and the text similarity of the group members is determined;

Perform secondary division on the group members according to the text similarity, and the secondary division includes at least one of the following steps:

According to a preset first threshold, members whose text similarity is less than the first threshold are added to other groups;

Determine the mean value of the text similarity, and delete the group member according to the mean value;

According to the text similarity, the members who do not belong to the group are clustered to obtain a new group.

7. a kind of text storage method based on DNA storage medium according to claim 4, is characterized in that, described decoding described grouping, obtains this step of original text, it comprises:

Determine the weight value of the character in the decoded text character line in the grouping;

determining the unique length value of the grouping, so that the length value of the decoded text character line in the grouping is the same as the unique length value;

The characters of the original text are determined according to the decoded text character line with the same length value and the weight value of the character, and the original text is obtained by combining.