CN112582030A - Text storage method based on DNA storage medium - Google Patents
Text storage method based on DNA storage medium Download PDFInfo
- Publication number
- CN112582030A CN112582030A CN202011508358.7A CN202011508358A CN112582030A CN 112582030 A CN112582030 A CN 112582030A CN 202011508358 A CN202011508358 A CN 202011508358A CN 112582030 A CN112582030 A CN 112582030A
- Authority
- CN
- China
- Prior art keywords
- text
- sequence
- dna
- original text
- decoded
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003860 storage Methods 0.000 title claims abstract description 71
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000007781 pre-processing Methods 0.000 claims abstract description 17
- 230000002194 synthesizing effect Effects 0.000 claims abstract description 5
- 238000012937 correction Methods 0.000 claims description 11
- 108020004414 DNA Proteins 0.000 abstract description 70
- 102000053602 DNA Human genes 0.000 abstract description 30
- 230000008569 process Effects 0.000 abstract description 15
- 238000012163 sequencing technique Methods 0.000 abstract description 14
- 238000012545 processing Methods 0.000 abstract description 7
- 238000011160 research Methods 0.000 abstract description 3
- 230000006870 function Effects 0.000 description 8
- 238000012217 deletion Methods 0.000 description 7
- 230000037430 deletion Effects 0.000 description 7
- 230000008901 benefit Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 238000003752 polymerase chain reaction Methods 0.000 description 5
- 230000003321 amplification Effects 0.000 description 4
- 238000003780 insertion Methods 0.000 description 4
- 230000037431 insertion Effects 0.000 description 4
- 238000003199 nucleic acid amplification method Methods 0.000 description 4
- 230000015572 biosynthetic process Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- 238000001712 DNA sequencing Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 2
- 238000000137 annealing Methods 0.000 description 2
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000004925 denaturation Methods 0.000 description 2
- 230000036425 denaturation Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 2
- 102100039819 Actin, alpha cardiac muscle 1 Human genes 0.000 description 1
- 229930024421 Adenine Natural products 0.000 description 1
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 1
- 230000004543 DNA replication Effects 0.000 description 1
- 102000004190 Enzymes Human genes 0.000 description 1
- 108090000790 Enzymes Proteins 0.000 description 1
- 241001123946 Gaga Species 0.000 description 1
- 241000282414 Homo sapiens Species 0.000 description 1
- 101000959247 Homo sapiens Actin, alpha cardiac muscle 1 Proteins 0.000 description 1
- 206010021703 Indifference Diseases 0.000 description 1
- 108091028043 Nucleic acid sequence Proteins 0.000 description 1
- 238000012408 PCR amplification Methods 0.000 description 1
- 108020004682 Single-Stranded DNA Proteins 0.000 description 1
- 229960000643 adenine Drugs 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 229940104302 cytosine Drugs 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 229910052739 hydrogen Inorganic materials 0.000 description 1
- 239000001257 hydrogen Substances 0.000 description 1
- 238000000338 in vitro Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000001823 molecular biology technique Methods 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 239000002994 raw material Substances 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 229940113082 thymine Drugs 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a text storage method based on a DNA storage medium, which comprises the following steps: acquiring an original text, and coding the original text to obtain a DNA storage sequence; synthesizing the DNA storage sequence to obtain a DNA molecule sequence, amplifying the DNA molecule sequence, and storing the amplified DNA molecule sequence; obtaining a stored DNA molecule sequence, and transcoding to obtain an original text; the transcoding to obtain the original text comprises the following steps: sequencing the stored DNA molecule sequence to obtain the read length of the DNA molecule sequence; and preprocessing the read length, removing noise data in the read length, and transcoding the preprocessed read length to obtain an original text. The method directly converts the stored DNA molecule sequence through the reading length of the sequence, removes more redundant codes, improves the storage efficiency, fully utilizes the semantic information in the original text in the transcoding and decoding processes of the method, has strong query processing capacity, and can be widely applied to the technical field of system biology research.
Description
Technical Field
The invention relates to the technical field of system biology research, in particular to a text storage method based on a DNA storage medium.
Background
With the development of distributed, cloud computing and internet of things technologies, the total amount of data generated by human beings every day is exponentially and explosively increased. The traditional magnetic, optical, electric and other storage technologies cannot meet the storage requirement of exponential growth of mass data in the future. In addition, semiconductor-based general purpose processors (CPUs) and application specific processing chips (ASICs) have encountered endless difficulties in terms of power consumption, size, reliability, and the like. Therefore, the search for new information storage modes has become a key fundamental problem for the sustainable development of information technology. As a carrier of life genetic information, DNA molecules have the advantages of high density, small volume, good storage stability, low energy consumption in the aspect of storage, and possibility of fusion in biological calculation, thereby realizing a novel data processing mode integrating storage and calculation. The general procedure for DNA storage is: the binary file in the computer is firstly coded into a base sequence, then synthesis, amplification and sequencing are carried out, and original information is recovered from the base sequence. However, most of the current researches add a lot of redundant codes to the original input information, for example, the inner code solves the problem of base errors in the sequence, and the outer code solves the problem of deletion at the sequence level, because the DNA strand is prone to base deletion, insertion and substitution errors during synthesis, storage and sequencing. While the prior art does have its unique advantages, disadvantages are also apparent. For example, the storage efficiency is low, the decoding process is complex, semantic information is not utilized, and the information query processing capability is poor.
Disclosure of Invention
In view of the above, to at least partially solve one of the above technical problems, embodiments of the present invention provide a text storage method based on a DNA storage medium, which can achieve convenient and efficient text indifference storage.
In a first aspect, the present invention provides a text storage method based on a DNA storage medium, comprising the steps of:
acquiring an original text, and coding the original text to obtain a DNA storage sequence;
synthesizing the DNA storage sequence to obtain a DNA molecule sequence, amplifying the DNA molecule sequence, and storing the amplified DNA molecule sequence;
obtaining a stored DNA molecule sequence, and transcoding to obtain the original text;
the transcoding to obtain the original text comprises the following steps:
sequencing the stored DNA molecule sequence to obtain the read length of the DNA molecule sequence;
and preprocessing the read length, removing noise data in the read length, and transcoding the preprocessed read length to obtain the original text.
In a possible embodiment of the present disclosure, the step of obtaining an original text, and encoding the original text to obtain a DNA storage sequence includes:
generating a coding base sequence according to a coding rule and characters in the original text, and generating an index value according to the coding base sequence;
generating byte check codes according to characters in the original text;
and constructing the DNA storage sequence according to the index value, the byte check code and the text data consisting of the coding base sequence.
In a possible embodiment of the present disclosure, the step of generating a byte check code according to characters in the original text includes:
coding characters in the original text through the codes to obtain a binary character string;
and carrying out grouped base coding according to the binary character string to obtain the byte check code.
In a possible embodiment of the present disclosure, the step of preprocessing the read length, removing noise data from the read length, and transcoding the preprocessed read length to obtain the original text includes:
acquiring the read length, and performing reverse pushing according to a coding rule to obtain a decoded character line;
correcting the error of the decoded character line to obtain a decoded text character line;
and obtaining a plurality of groups according to the decoded text character lines and the text content, and decoding the groups to obtain the original text.
In a possible embodiment of the present application, the step of preprocessing the read length, removing noise data in the read length, and transcoding the preprocessed read length to obtain the original text further includes:
and determining the character with the minimum Hamming distance as a decoding character of the error base according to the read error base.
In a possible embodiment of the present disclosure, the step of obtaining a plurality of packets according to the decoded text character lines and the text content includes:
dividing according to the index value of the decoded text character line to obtain a plurality of groups, and determining the text similarity of group members;
performing secondary division on the group members according to the text similarity, wherein the secondary division comprises at least one of the following steps:
adding the members with the text similarity smaller than the first threshold value to other groups according to a preset first threshold value;
determining the average value of the text similarity, and deleting the group members according to the average value;
and clustering the members which do not belong to the group according to the text similarity to obtain a new group.
In a possible embodiment of the present disclosure, the step of decoding the packet to obtain an original text includes:
determining a weight value for a character in the decoded text character line in the packet;
determining a unique length value for the packet such that a length value of a decoded text character line in the packet is the same as the unique length value;
and determining characters of the original text according to the decoded text character lines with consistent length values and the weight values of the characters, and combining to obtain the original text.
Advantages and benefits of the present invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention:
the method comprises the steps of coding an original text into a base sequence, synthesizing and amplifying the base sequence, storing the amplified DNA molecular sequence, sequencing the stored DNA molecular sequence to obtain the read length of the sequence, deleting the read length of noise in the sequence, and recovering according to the read length to obtain the original text; the method directly converts the stored DNA molecule sequence through the reading length of the sequence, removes more redundant codes, improves the storage efficiency, fully utilizes the semantic information in the original text in the transcoding and decoding processes, and has strong query processing capability.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flowchart illustrating steps of a method for storing text based on a DNA storage medium according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a DNA storage sequence according to an embodiment;
FIG. 3 is a flowchart illustrating the grouping steps according to the decoded text character lines and the text content in the embodiment;
FIG. 4 is a histogram showing the accuracy of reducing English text under different error rates and sequencing depths.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
In a first aspect, as shown in FIG. 1, the present application provides a text storage method based on a DNA storage medium, comprising steps S01-S03:
and S01, acquiring the original text, and coding the original text to obtain a DNA storage sequence.
Taking the selection of the english text as an example, the present embodiment encodes the characters of the english text according to the encoding rule to form the DNA storage sequence.
In this embodiment, the step of encoding the original text to obtain the DNA storage sequence specifically includes steps S011 to S013:
s011, generating a coding base sequence according to a coding rule and characters in an original text, and generating an index value according to the coding base sequence;
s012, generating byte check codes according to characters in the original text;
and S013, constructing a DNA storage sequence according to the index value, the byte check code and the text data consisting of the coding base sequence.
Specifically, various characters appearing in the English text are sequentially coded according to a character coding rule, and the coding base sequence of every M (M >0) text characters is a storage data unit. M text characters use Reed-Solomon codes (RS) to test and generate t-bit byte check codes, and according to the sequence generated by the data storage units, the corresponding base sequences are coded by n decimal numbers to be used as Index values (Index) of the data storage units. Thereby. A DNA storage sequence is composed of an index value part, an RS check code and a text data field.
Taking n as 5, t as 4, and M as 25 as an example; as shown in FIG. 2, a DNA storage sequence structure is shown.
In this example, the base sequences corresponding to the text characters are shown in table 1:
TABLE 1
In this example, the first part of the DNA memory sequence is Index, which is also a base sequence, and marks the order of the DNA memory lines in the original encoded text file. Each 6 bases of the Index base sequence is a unit and corresponds to n decimal numbers. The numerical code table corresponding to each digit of Index is shown in table 2:
TABLE 2
In this embodiment, the step S012 of generating the byte check code according to the characters in the original text can be further subdivided into steps S012a and S012 b:
s012a, coding the characters in the original text by the code of the Chinese character to obtain a binary character string;
s012b, grouping base coding is carried out according to the binary character string, and byte check codes are obtained.
Specifically, in the embodiment, after characters in the original English text are converted into binary character strings through RS inspection, grouping is carried out according to 4 bits as one group, and each group of 4-bit binary data is subjected to base coding according to the table 3.
TABLE 3
RS grouping | Encoding | RS grouping | Encoding | RS grouping | Encoding | RS grouping | Encoding |
0000 | GTGT | 0100 | CACA | 1000 | TCAC | 1100 | ACTC |
0001 | GATG | 0101 | GTTC | 1001 | TACC | 1101 | AGCT |
0010 | AGAC | 0110 | TGGT | 1010 | GAGA | 1110 | TCGA |
0011 | CTTG | 0111 | CAGT | 1011 | GAAC | 1111 | TGCA |
In the examples of coding rules, i.e. coding in combination with the coding relationships provided in tables 1, 2 and 3, the length of the DNA reservoir sequence is fixed, with a length L of value n x 5+8 x t + M x 4. If the number of characters L (L >0) of the encoded english text in the memory sequence is smaller than M, the remaining base sequence units of the memory sequence may be constituted by base sequences corresponding to (M-L) space characters.
S02, synthesizing the DNA storage sequence to obtain a DNA molecule sequence, amplifying the DNA molecule sequence, and storing the amplified DNA molecule sequence.
Specifically, the DNA storage sequence obtained in step S01 is synthesized, amplified, and stored. The synthesis process is to obtain DNA storage sequence and artificially connect deoxynucleotides one by one through chemical reaction according to the sequence of preset nucleotides to synthesize DNA chain, namely DNA molecule sequence. The amplification process, i.e. generating multiple copies according to the sequence of the DNA molecule, in the example, the sequence of the DNA molecule is amplified by PCR (polymerase Chain reaction), i.e. polymerase Chain reaction. PCR amplification is a molecular biology technique for amplifying a specific DNA fragment, and can be regarded as special DNA replication in vitro, and the biggest characteristic of PCR is that a trace amount of DNA can be greatly increased. In the examples, the PCR process is divided into three steps: 1) DNA denaturation (90 ℃ -96 ℃): the double-stranded DNA template is broken by hydrogen bonds under the action of heat to form single-stranded DNA; 2) annealing (60 ℃ -65 ℃): the temperature of the system is lowered, and the primer is combined with the DNA template to form a local double strand. 3) Extension (70 ℃ -75 ℃): under the action of Taq enzyme (about 72 ℃ C., the activity is optimal), dNTP is used as a raw material, and a DNA strand complementary to the template is synthesized by extending from the 3 ' -end of the primer in the direction from the 5 ' → 3 ' -end. After each cycle of denaturation, annealing and extension, the DNA content was doubled.
Further, several DNA molecule sequences obtained after amplification are stored, for example, in a DNA molecule database.
And S03, acquiring the stored DNA molecule sequence, and transcoding to obtain the original text. The transcoding to obtain the original text comprises steps S031-S032:
s031, sequencing the stored DNA molecule sequence to obtain the read length of the DNA molecule sequence;
s032, preprocessing the read length, removing noise data in the read length, and transcoding the preprocessed read length to obtain an original text.
Specifically, the stored DNA molecule sequence is first sequenced, i.e. DNA sequencing (DNA sequencing), which means to analyze the base sequence of a specific DNA fragment, i.e. the (G) arrangement of adenine (a), thymine (T), cytosine (C) and guanine; for example, a second generation sequencer or a third generation sequencer is used for sequencing, and a result file output by the sequencer consists of reads; wherein reads is the judgment of the base composition of a DNA sequence molecule by a sequencer, namely the read length.
In step S032, before decoding reads obtained by sequencing to restore english text characters, data preprocessing is required, where the data preprocessing mainly includes deleting low-quality reads, that is, processing noise data in read length, and includes: deletions cannot correct reads that are inserted or deleted, and corrections can correct reads that are inserted or deleted. On the basis of data preprocessing, decoding, RS error correction and multi-sequence error correction are carried out on the obtained reads, and then the original coded English text is restored by using a word error correction technology.
More specifically, in an embodiment, the process of preprocessing reads includes at least one of the following steps:
1) the 'N' base in reads is replaced by a base 'A', wherein the 'N' character means that the sequencer cannot accurately give the specific base at the position, and the 'N' is adopted for replacement.
2) Low quality reads, i.e., reads with a Phred mass of less than 20 for four consecutive bases, are deleted. The quality value corresponding to each base in reads reflects the degree of accuracy of the base recognition, and when the length of one coding unit determined by the coding rule in step S01 is 6 and the number of coding units with Index values is 5, the phred value of 6 consecutive bases is determined to be low in the preprocessing, that is, the reads is determined to be of poor quality and should be deleted.
3) Deleting reads with an excessively small number of bases, i.e., reads with a length less than (L-5);
4) reads with an excessively large number of bases, i.e., reads with a length greater than (L +5), are deleted.
In some optional embodiments, the read length is preprocessed, and the process of deleting the read length with low quality may further include step 5):
5) reads with insertion/deletion errors of length between (L-2) and (L +2) are corrected. And determining the character with the minimum Hamming distance as the decoded character of the character aiming at the wrong base unit.
For example, the complete process for correcting the insertion/deletion errors of reads is:
b) reads are taken from left to right in sequenceUntil the base sequence corresponding to each decoding unit can not be completely taken outCalculating according to the coding table until each coding unitThe minimum hamming distance list of decoding units from the coding table. If the values of all elements of the list are more than or equal to 2, the first decoding unit is inserted or deleted, and the step c) is executed; otherwise, repeating the step b).
c) Inserting a proper character or deleting a character in each base of the first coding unit in sequence, wherein the inserted or deleted character must satisfy the following conditions: condition 1) is a character corresponding to a coding unit whose hamming distance is the smallest according to the coding table; condition 2) each element of the minimum hamming distance list of the sliding window after the character is inserted or deleted is less than 2; otherwise, executing the step b).
d) The length of the reads for deletion insertion/deletion correction is not equal to the reads for L.
In this embodiment, the process of transcoding in step S032 to obtain original text can be further subdivided into steps S032a-S032 c:
s032a, obtaining the read length after preprocessing, and performing reverse pushing according to the coding rule to obtain a decoded character line;
s032b, correcting errors of the decoded character lines to obtain decoded text character lines;
s032c, obtaining a plurality of groups according to the decoded text character lines and the text content, and decoding according to the groups to obtain the original text.
Specifically, regarding reads obtained through preprocessing, first, according to the character encoding table, the index encoding table, and the RS grouping encoding table adopted in step S01, the characters corresponding to the encoding units are obtained by using consecutive 6 bases as an encoding unit, and then the decoded character rows corresponding to the reads are obtained. Then, RS error correction is carried out on the decoded character line corresponding to each ready, and a decoded text character line formed by splicing text character strings only containing index information and error correction results is generated. And grouping according to the index value of the decoded text character line and the text content. And decoding the real text lines corresponding to the obtained packets, namely the original text lines according to a multiplicity principle, putting the decoded text lines into a set T, and sequencing the decoded text character lines in the set T according to index values. And sequentially removing the index values of the decoded text character lines in the T set, and outputting the text data region character strings to a decoded character file.
In this embodiment, in step S032c, the process of decoding text character lines and text content to obtain several groups can be further subdivided into steps S032c1-S032c 2:
s032c1, dividing the character lines according to the index values of the decoded text to obtain a plurality of groups, and determining the text similarity of the members in the groups;
S032C2, performing secondary division on the members in the group according to the text similarity, wherein the secondary division comprises at least one of the steps A-C:
A. and adding the members with the text similarity smaller than the first threshold value to other groups according to a preset first threshold value.
B. And determining the average value of the text similarity, and deleting the group members according to the average value.
C. And clustering the members which do not belong to the group according to the text similarity to obtain a new group.
Specifically, as shown in fig. 3, grouping is performed according to the index value of the decoded text character line and the text content. Grouping by adopting an index value during primary grouping;
after the preliminary grouping, for each decoded text character line of the group with the group member number less than 3, adopting a text data area and a central member of other groups (the group members are more than 3) ((The member in the center of the group means that the member has the highest average text similarity with other members in the group to which the member belongs, and the member can approximately represent the text similarity of the actual storage line corresponding to the group), if the similarity is more than a certain threshold value(for example,take 0.8), the decoded character line is deleted from the current packet and delivered to the packet with the highest similarity to the text. In an embodiment, the text similarity calculation method for two character strings includes: the two character strings s1 and s2 are subjected to sequence comparison by using a sequence comparison algorithm such as a Needle-Wunsch algorithm, the number of characters at the same position of the compared character strings is counted, the counted number is directly divided by the maximum value of the lengths of the character strings s1 and s2, and the divided result is the text similarity of the two character strings.
For each group member, according to the principle of text similarityAnd deleting the members with larger text similarity difference with other members of the group. The text similarity between a member in the group and other members in the group is specifically the mean value of the text similarities between the member and other members in the group to which the member belongs.
And deleting the packets with illegal Index values in the unique decoding character line represented by the packets. The basis for judging whether the Index value is legal is as follows: and comparing the Index value with the number of the sequenced DNA storage sequences when the text file is coded, wherein if the Index value is small, the result is legal, and otherwise, the result is illegal.
For the decoded text character lines of the undetermined groups according to the text similarityAnd clustering, and deleting clusters with illegal decoded text character line Index values corresponding to the clusters. Decoding of undetermined packetsAnd delivering the text character line to the packet with the maximum text similarity with the decoded text character line according to the text similarity.
In this embodiment, in step S032c, the process of decoding according to the packets to obtain the original text can also be subdivided into steps S032c3-S032c 5:
s032c3, determining the weight value of the characters in the decoded text character line in the packet;
s032c4, determining a unique length value of the packet, so that the length value of the decoded text character line in the packet is the same as the unique length value;
s032c5, determining characters of the original text according to the decoded text character lines with consistent length values and the weight values of the characters, and combining to obtain the original text.
Specifically, according to the multiplicity principle, the real text lines corresponding to the obtained packets are decoded, put into a set T, and the decoded text character lines in the set T are sorted according to the index values. The specific steps of the unique decoding character row represented by the multiplicity judgment grouping are as follows:
firstly, calculating an initial weight value of each decoded text character line of the group, wherein the weight calculation rule in the embodiment is as follows: english letter number decoded correctly by the decoded text character line/all English letter number decoded by the decoded text character line.
And determining a unique length value of the packet, wherein the unique length value corresponding to the packet is the value with the highest occurrence frequency of the length of the decoded text character line in the packet.
If the number of members of the group is less than τ (e.g., τ is 3), the spelling of the word in each decoded text character line of the group is checked and corrected on its own. And if the member number of the group is less than tau, carrying out sequence comparison on the decoded text character line with the character length not equal to the length theta of the unique decoded character line to be decoded in the group and any decoded text decoded character line with the length equal to theta in the group, and further carrying out proper expansion or stretching on the decoded text character line.
And sequentially calculating character values of corresponding columns in the unique decoding row corresponding to the grouping according to the data of each decoding text character row in the grouping. The calculation rule in the embodiment is as follows: determining characters of each column, and calculating the sum of weight values of the column in all rows and each character in each row in sequence; and selecting the weighted value and the maximum character.
In summary, the implementation process of this embodiment can be summarized as follows: and according to the coding rule, coding each character which sequentially appears in the English text, and sequentially adding index values to the base sequences of every N original text characters to obtain a series of DNA storage sequences. And combining the DNA storage sequences into a base sequence, performing biological storage, amplification and sequencing, and performing data cleaning on each reads in a sequencing file. And (4) after decoding, RS error correction and multi-sequence error correction are carried out on the obtained reads, and the original coded English text is recovered by using a word error correction technology.
As shown in fig. 4, it can be shown from the data that the english text can be completely restored when the sequencing depth is 25 in the cases of error rates of 0.01, 0.02 and 0.05 respectively in this embodiment; in the case of an error rate of 0.1, when the sequencing depth is 45, the original english text can be completely reproduced.
From the above specific implementation process, it can be concluded that the technical solution provided by the present invention has the following advantages or advantages compared to the prior art:
according to the technical scheme, the storage efficiency is improved, the semantic information in the original text is fully utilized in the transcoding and decoding processes, and the query processing capacity is high.
In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.
Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the functions and/or features may be integrated in a single physical device and/or software module, or one or more of the functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.
Wherein the functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (7)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011508358.7A CN112582030B (en) | 2020-12-18 | 2020-12-18 | A Text Storage Method Based on DNA Storage Medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011508358.7A CN112582030B (en) | 2020-12-18 | 2020-12-18 | A Text Storage Method Based on DNA Storage Medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112582030A true CN112582030A (en) | 2021-03-30 |
CN112582030B CN112582030B (en) | 2023-08-15 |
Family
ID=75136171
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011508358.7A Active CN112582030B (en) | 2020-12-18 | 2020-12-18 | A Text Storage Method Based on DNA Storage Medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112582030B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113299347A (en) * | 2021-05-21 | 2021-08-24 | 广州大学 | DNA storage method based on modulation coding |
CN113314187A (en) * | 2021-05-27 | 2021-08-27 | 广州大学 | Data storage method, decoding method, system, device and storage medium |
CN113315623A (en) * | 2021-05-21 | 2021-08-27 | 广州大学 | Symmetric encryption method for DNA storage |
CN114218937A (en) * | 2021-11-24 | 2022-03-22 | 中国科学院深圳先进技术研究院 | Data error correction method and device and electronic equipment |
CN114356220A (en) * | 2021-12-10 | 2022-04-15 | 深圳先进技术研究院 | Encoding method based on DNA storage, electronic device and readable storage medium |
WO2023272499A1 (en) * | 2021-06-29 | 2023-01-05 | 中国科学院深圳先进技术研究院 | Encoding method, decoding method, apparatus, terminal device, and readable storage medium |
CN117254819A (en) * | 2023-11-20 | 2023-12-19 | 深圳市瑞健医信科技有限公司 | Medical waste intelligent supervision system |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104850760A (en) * | 2015-03-27 | 2015-08-19 | 苏州泓迅生物科技有限公司 | Artificially synthesized DNA storage medium with coding information, storage reading method for information, and applications |
CN106845158A (en) * | 2017-02-17 | 2017-06-13 | 苏州泓迅生物科技股份有限公司 | A kind of method that information Store is carried out using DNA |
CN109416928A (en) * | 2016-06-07 | 2019-03-01 | 伊路米纳有限公司 | For carrying out the bioinformatics system, apparatus and method of second level and/or tertiary treatment |
CN110427786A (en) * | 2019-05-31 | 2019-11-08 | 西藏自治区人民政府驻成都办事处医院 | A method of use DNA as text information efficient storage medium |
CN110706751A (en) * | 2019-09-25 | 2020-01-17 | 东南大学 | A DNA storage encryption coding method |
CN111183233A (en) * | 2017-10-02 | 2020-05-19 | 皇家飞利浦有限公司 | Assessment of Notch cell signaling pathway activity using mathematical modeling of target gene expression |
CN111368132A (en) * | 2020-02-28 | 2020-07-03 | 元码基因科技(北京)股份有限公司 | Method for storing audio or video files based on DNA sequences and storage medium |
CN111600609A (en) * | 2020-05-19 | 2020-08-28 | 东南大学 | DNA storage coding method for optimizing Chinese storage |
-
2020
- 2020-12-18 CN CN202011508358.7A patent/CN112582030B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104850760A (en) * | 2015-03-27 | 2015-08-19 | 苏州泓迅生物科技有限公司 | Artificially synthesized DNA storage medium with coding information, storage reading method for information, and applications |
CN109416928A (en) * | 2016-06-07 | 2019-03-01 | 伊路米纳有限公司 | For carrying out the bioinformatics system, apparatus and method of second level and/or tertiary treatment |
CN106845158A (en) * | 2017-02-17 | 2017-06-13 | 苏州泓迅生物科技股份有限公司 | A kind of method that information Store is carried out using DNA |
CN111183233A (en) * | 2017-10-02 | 2020-05-19 | 皇家飞利浦有限公司 | Assessment of Notch cell signaling pathway activity using mathematical modeling of target gene expression |
CN110427786A (en) * | 2019-05-31 | 2019-11-08 | 西藏自治区人民政府驻成都办事处医院 | A method of use DNA as text information efficient storage medium |
CN110706751A (en) * | 2019-09-25 | 2020-01-17 | 东南大学 | A DNA storage encryption coding method |
CN111368132A (en) * | 2020-02-28 | 2020-07-03 | 元码基因科技(北京)股份有限公司 | Method for storing audio or video files based on DNA sequences and storage medium |
CN111600609A (en) * | 2020-05-19 | 2020-08-28 | 东南大学 | DNA storage coding method for optimizing Chinese storage |
Non-Patent Citations (2)
Title |
---|
许鹏;方刚;石晓龙;刘文斌;: "DNA存储及其研究进展", 电子与信息学报, no. 06, pages 1 - 5 * |
陈为刚;黄刚;李炳志;尹烨;元英进;: "音视频文件的DNA信息存储", 中国科学:生命科学, no. 01, pages 1 - 4 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113299347A (en) * | 2021-05-21 | 2021-08-24 | 广州大学 | DNA storage method based on modulation coding |
CN113315623A (en) * | 2021-05-21 | 2021-08-27 | 广州大学 | Symmetric encryption method for DNA storage |
CN113299347B (en) * | 2021-05-21 | 2023-09-26 | 广州大学 | DNA storage method based on modulation coding |
CN113314187A (en) * | 2021-05-27 | 2021-08-27 | 广州大学 | Data storage method, decoding method, system, device and storage medium |
CN113314187B (en) * | 2021-05-27 | 2022-05-10 | 广州大学 | Data storage method, decoding method, system, device and storage medium |
WO2023272499A1 (en) * | 2021-06-29 | 2023-01-05 | 中国科学院深圳先进技术研究院 | Encoding method, decoding method, apparatus, terminal device, and readable storage medium |
CN114218937A (en) * | 2021-11-24 | 2022-03-22 | 中国科学院深圳先进技术研究院 | Data error correction method and device and electronic equipment |
WO2023092723A1 (en) * | 2021-11-24 | 2023-06-01 | 中国科学院深圳先进技术研究院 | Data error correction method and apparatus, and electronic device |
CN114356220A (en) * | 2021-12-10 | 2022-04-15 | 深圳先进技术研究院 | Encoding method based on DNA storage, electronic device and readable storage medium |
CN117254819A (en) * | 2023-11-20 | 2023-12-19 | 深圳市瑞健医信科技有限公司 | Medical waste intelligent supervision system |
CN117254819B (en) * | 2023-11-20 | 2024-02-27 | 深圳市瑞健医信科技有限公司 | Medical waste intelligent supervision system |
Also Published As
Publication number | Publication date |
---|---|
CN112582030B (en) | 2023-08-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112582030B (en) | A Text Storage Method Based on DNA Storage Medium | |
Organick et al. | Random access in large-scale DNA data storage | |
US10370246B1 (en) | Portable and low-error DNA-based data storage | |
Shomorony et al. | Information-theoretic foundations of DNA data storage | |
CN107403075B (en) | Comparison method, device and system | |
CN111600609B (en) | A DNA Storage Coding Method for Optimizing Chinese Storage | |
EP2983297A1 (en) | Code generation method, code generating apparatus and computer readable storage medium | |
CN111858507B (en) | DNA-based data storage method, decoding method, system and device | |
CN110569974B (en) | Hierarchical representation and interleaving encoding method for DNA storage that can contain artificial bases | |
US11600360B2 (en) | Trace reconstruction from reads with indeterminant errors | |
CN113314187B (en) | Data storage method, decoding method, system, device and storage medium | |
CN112100982B (en) | DNA storage method, system and storage medium | |
EP3160049A1 (en) | Data processing method and device for recovering valid code words from a corrupted code word sequence | |
CN112749247B (en) | Method and device for storing and reading text information | |
CN113870949A (en) | Deep learning-based nanopore sequencing data base identification method | |
Conde-Canencia et al. | Nanopore DNA sequencing channel modeling | |
Sabary et al. | Survey for a Decade of Coding for DNA Storage | |
JP4912646B2 (en) | Gene transcript mapping method and system | |
Bi et al. | Extended XOR algorithm with biotechnology constraints for data security in DNA storage | |
CN116564424A (en) | DNA data storage method, reading method and terminal based on erasure codes and assembly technology | |
Shafir et al. | Sequence design and reconstruction under the repeat channel in enzymatic dna synthesis | |
WO2019023978A1 (en) | Alignment method, device and system | |
Luo | Clustering for DNA Storage | |
Qin et al. | Robust multi-read reconstruction from contaminated clusters using deep neural network for DNA storage | |
CN118335197B (en) | DNA data storage method based on nanopore sequencing chip |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |