[go: up one dir, main page]

CN112582030A - Text storage method based on DNA storage medium - Google Patents

Text storage method based on DNA storage medium Download PDF

Info

Publication number
CN112582030A
CN112582030A CN202011508358.7A CN202011508358A CN112582030A CN 112582030 A CN112582030 A CN 112582030A CN 202011508358 A CN202011508358 A CN 202011508358A CN 112582030 A CN112582030 A CN 112582030A
Authority
CN
China
Prior art keywords
text
sequence
dna
original text
decoded
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011508358.7A
Other languages
Chinese (zh)
Other versions
CN112582030B (en
Inventor
刘文斌
昝乡镇
姚祥宇
许�鹏
方刚
陈智华
石晓龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN202011508358.7A priority Critical patent/CN112582030B/en
Publication of CN112582030A publication Critical patent/CN112582030A/en
Application granted granted Critical
Publication of CN112582030B publication Critical patent/CN112582030B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a text storage method based on a DNA storage medium, which comprises the following steps: acquiring an original text, and coding the original text to obtain a DNA storage sequence; synthesizing the DNA storage sequence to obtain a DNA molecule sequence, amplifying the DNA molecule sequence, and storing the amplified DNA molecule sequence; obtaining a stored DNA molecule sequence, and transcoding to obtain an original text; the transcoding to obtain the original text comprises the following steps: sequencing the stored DNA molecule sequence to obtain the read length of the DNA molecule sequence; and preprocessing the read length, removing noise data in the read length, and transcoding the preprocessed read length to obtain an original text. The method directly converts the stored DNA molecule sequence through the reading length of the sequence, removes more redundant codes, improves the storage efficiency, fully utilizes the semantic information in the original text in the transcoding and decoding processes of the method, has strong query processing capacity, and can be widely applied to the technical field of system biology research.

Description

Text storage method based on DNA storage medium
Technical Field
The invention relates to the technical field of system biology research, in particular to a text storage method based on a DNA storage medium.
Background
With the development of distributed, cloud computing and internet of things technologies, the total amount of data generated by human beings every day is exponentially and explosively increased. The traditional magnetic, optical, electric and other storage technologies cannot meet the storage requirement of exponential growth of mass data in the future. In addition, semiconductor-based general purpose processors (CPUs) and application specific processing chips (ASICs) have encountered endless difficulties in terms of power consumption, size, reliability, and the like. Therefore, the search for new information storage modes has become a key fundamental problem for the sustainable development of information technology. As a carrier of life genetic information, DNA molecules have the advantages of high density, small volume, good storage stability, low energy consumption in the aspect of storage, and possibility of fusion in biological calculation, thereby realizing a novel data processing mode integrating storage and calculation. The general procedure for DNA storage is: the binary file in the computer is firstly coded into a base sequence, then synthesis, amplification and sequencing are carried out, and original information is recovered from the base sequence. However, most of the current researches add a lot of redundant codes to the original input information, for example, the inner code solves the problem of base errors in the sequence, and the outer code solves the problem of deletion at the sequence level, because the DNA strand is prone to base deletion, insertion and substitution errors during synthesis, storage and sequencing. While the prior art does have its unique advantages, disadvantages are also apparent. For example, the storage efficiency is low, the decoding process is complex, semantic information is not utilized, and the information query processing capability is poor.
Disclosure of Invention
In view of the above, to at least partially solve one of the above technical problems, embodiments of the present invention provide a text storage method based on a DNA storage medium, which can achieve convenient and efficient text indifference storage.
In a first aspect, the present invention provides a text storage method based on a DNA storage medium, comprising the steps of:
acquiring an original text, and coding the original text to obtain a DNA storage sequence;
synthesizing the DNA storage sequence to obtain a DNA molecule sequence, amplifying the DNA molecule sequence, and storing the amplified DNA molecule sequence;
obtaining a stored DNA molecule sequence, and transcoding to obtain the original text;
the transcoding to obtain the original text comprises the following steps:
sequencing the stored DNA molecule sequence to obtain the read length of the DNA molecule sequence;
and preprocessing the read length, removing noise data in the read length, and transcoding the preprocessed read length to obtain the original text.
In a possible embodiment of the present disclosure, the step of obtaining an original text, and encoding the original text to obtain a DNA storage sequence includes:
generating a coding base sequence according to a coding rule and characters in the original text, and generating an index value according to the coding base sequence;
generating byte check codes according to characters in the original text;
and constructing the DNA storage sequence according to the index value, the byte check code and the text data consisting of the coding base sequence.
In a possible embodiment of the present disclosure, the step of generating a byte check code according to characters in the original text includes:
coding characters in the original text through the codes to obtain a binary character string;
and carrying out grouped base coding according to the binary character string to obtain the byte check code.
In a possible embodiment of the present disclosure, the step of preprocessing the read length, removing noise data from the read length, and transcoding the preprocessed read length to obtain the original text includes:
acquiring the read length, and performing reverse pushing according to a coding rule to obtain a decoded character line;
correcting the error of the decoded character line to obtain a decoded text character line;
and obtaining a plurality of groups according to the decoded text character lines and the text content, and decoding the groups to obtain the original text.
In a possible embodiment of the present application, the step of preprocessing the read length, removing noise data in the read length, and transcoding the preprocessed read length to obtain the original text further includes:
and determining the character with the minimum Hamming distance as a decoding character of the error base according to the read error base.
In a possible embodiment of the present disclosure, the step of obtaining a plurality of packets according to the decoded text character lines and the text content includes:
dividing according to the index value of the decoded text character line to obtain a plurality of groups, and determining the text similarity of group members;
performing secondary division on the group members according to the text similarity, wherein the secondary division comprises at least one of the following steps:
adding the members with the text similarity smaller than the first threshold value to other groups according to a preset first threshold value;
determining the average value of the text similarity, and deleting the group members according to the average value;
and clustering the members which do not belong to the group according to the text similarity to obtain a new group.
In a possible embodiment of the present disclosure, the step of decoding the packet to obtain an original text includes:
determining a weight value for a character in the decoded text character line in the packet;
determining a unique length value for the packet such that a length value of a decoded text character line in the packet is the same as the unique length value;
and determining characters of the original text according to the decoded text character lines with consistent length values and the weight values of the characters, and combining to obtain the original text.
Advantages and benefits of the present invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention:
the method comprises the steps of coding an original text into a base sequence, synthesizing and amplifying the base sequence, storing the amplified DNA molecular sequence, sequencing the stored DNA molecular sequence to obtain the read length of the sequence, deleting the read length of noise in the sequence, and recovering according to the read length to obtain the original text; the method directly converts the stored DNA molecule sequence through the reading length of the sequence, removes more redundant codes, improves the storage efficiency, fully utilizes the semantic information in the original text in the transcoding and decoding processes, and has strong query processing capability.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flowchart illustrating steps of a method for storing text based on a DNA storage medium according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a DNA storage sequence according to an embodiment;
FIG. 3 is a flowchart illustrating the grouping steps according to the decoded text character lines and the text content in the embodiment;
FIG. 4 is a histogram showing the accuracy of reducing English text under different error rates and sequencing depths.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
In a first aspect, as shown in FIG. 1, the present application provides a text storage method based on a DNA storage medium, comprising steps S01-S03:
and S01, acquiring the original text, and coding the original text to obtain a DNA storage sequence.
Taking the selection of the english text as an example, the present embodiment encodes the characters of the english text according to the encoding rule to form the DNA storage sequence.
In this embodiment, the step of encoding the original text to obtain the DNA storage sequence specifically includes steps S011 to S013:
s011, generating a coding base sequence according to a coding rule and characters in an original text, and generating an index value according to the coding base sequence;
s012, generating byte check codes according to characters in the original text;
and S013, constructing a DNA storage sequence according to the index value, the byte check code and the text data consisting of the coding base sequence.
Specifically, various characters appearing in the English text are sequentially coded according to a character coding rule, and the coding base sequence of every M (M >0) text characters is a storage data unit. M text characters use Reed-Solomon codes (RS) to test and generate t-bit byte check codes, and according to the sequence generated by the data storage units, the corresponding base sequences are coded by n decimal numbers to be used as Index values (Index) of the data storage units. Thereby. A DNA storage sequence is composed of an index value part, an RS check code and a text data field.
Taking n as 5, t as 4, and M as 25 as an example; as shown in FIG. 2, a DNA storage sequence structure is shown.
In this example, the base sequences corresponding to the text characters are shown in table 1:
TABLE 1
Figure BDA0002845581200000041
In this example, the first part of the DNA memory sequence is Index, which is also a base sequence, and marks the order of the DNA memory lines in the original encoded text file. Each 6 bases of the Index base sequence is a unit and corresponds to n decimal numbers. The numerical code table corresponding to each digit of Index is shown in table 2:
TABLE 2
Figure BDA0002845581200000051
In this embodiment, the step S012 of generating the byte check code according to the characters in the original text can be further subdivided into steps S012a and S012 b:
s012a, coding the characters in the original text by the code of the Chinese character to obtain a binary character string;
s012b, grouping base coding is carried out according to the binary character string, and byte check codes are obtained.
Specifically, in the embodiment, after characters in the original English text are converted into binary character strings through RS inspection, grouping is carried out according to 4 bits as one group, and each group of 4-bit binary data is subjected to base coding according to the table 3.
TABLE 3
RS grouping Encoding RS grouping Encoding RS grouping Encoding RS grouping Encoding
0000 GTGT 0100 CACA 1000 TCAC 1100 ACTC
0001 GATG 0101 GTTC 1001 TACC 1101 AGCT
0010 AGAC 0110 TGGT 1010 GAGA 1110 TCGA
0011 CTTG 0111 CAGT 1011 GAAC 1111 TGCA
In the examples of coding rules, i.e. coding in combination with the coding relationships provided in tables 1, 2 and 3, the length of the DNA reservoir sequence is fixed, with a length L of value n x 5+8 x t + M x 4. If the number of characters L (L >0) of the encoded english text in the memory sequence is smaller than M, the remaining base sequence units of the memory sequence may be constituted by base sequences corresponding to (M-L) space characters.
S02, synthesizing the DNA storage sequence to obtain a DNA molecule sequence, amplifying the DNA molecule sequence, and storing the amplified DNA molecule sequence.
Specifically, the DNA storage sequence obtained in step S01 is synthesized, amplified, and stored. The synthesis process is to obtain DNA storage sequence and artificially connect deoxynucleotides one by one through chemical reaction according to the sequence of preset nucleotides to synthesize DNA chain, namely DNA molecule sequence. The amplification process, i.e. generating multiple copies according to the sequence of the DNA molecule, in the example, the sequence of the DNA molecule is amplified by PCR (polymerase Chain reaction), i.e. polymerase Chain reaction. PCR amplification is a molecular biology technique for amplifying a specific DNA fragment, and can be regarded as special DNA replication in vitro, and the biggest characteristic of PCR is that a trace amount of DNA can be greatly increased. In the examples, the PCR process is divided into three steps: 1) DNA denaturation (90 ℃ -96 ℃): the double-stranded DNA template is broken by hydrogen bonds under the action of heat to form single-stranded DNA; 2) annealing (60 ℃ -65 ℃): the temperature of the system is lowered, and the primer is combined with the DNA template to form a local double strand. 3) Extension (70 ℃ -75 ℃): under the action of Taq enzyme (about 72 ℃ C., the activity is optimal), dNTP is used as a raw material, and a DNA strand complementary to the template is synthesized by extending from the 3 ' -end of the primer in the direction from the 5 ' → 3 ' -end. After each cycle of denaturation, annealing and extension, the DNA content was doubled.
Further, several DNA molecule sequences obtained after amplification are stored, for example, in a DNA molecule database.
And S03, acquiring the stored DNA molecule sequence, and transcoding to obtain the original text. The transcoding to obtain the original text comprises steps S031-S032:
s031, sequencing the stored DNA molecule sequence to obtain the read length of the DNA molecule sequence;
s032, preprocessing the read length, removing noise data in the read length, and transcoding the preprocessed read length to obtain an original text.
Specifically, the stored DNA molecule sequence is first sequenced, i.e. DNA sequencing (DNA sequencing), which means to analyze the base sequence of a specific DNA fragment, i.e. the (G) arrangement of adenine (a), thymine (T), cytosine (C) and guanine; for example, a second generation sequencer or a third generation sequencer is used for sequencing, and a result file output by the sequencer consists of reads; wherein reads is the judgment of the base composition of a DNA sequence molecule by a sequencer, namely the read length.
In step S032, before decoding reads obtained by sequencing to restore english text characters, data preprocessing is required, where the data preprocessing mainly includes deleting low-quality reads, that is, processing noise data in read length, and includes: deletions cannot correct reads that are inserted or deleted, and corrections can correct reads that are inserted or deleted. On the basis of data preprocessing, decoding, RS error correction and multi-sequence error correction are carried out on the obtained reads, and then the original coded English text is restored by using a word error correction technology.
More specifically, in an embodiment, the process of preprocessing reads includes at least one of the following steps:
1) the 'N' base in reads is replaced by a base 'A', wherein the 'N' character means that the sequencer cannot accurately give the specific base at the position, and the 'N' is adopted for replacement.
2) Low quality reads, i.e., reads with a Phred mass of less than 20 for four consecutive bases, are deleted. The quality value corresponding to each base in reads reflects the degree of accuracy of the base recognition, and when the length of one coding unit determined by the coding rule in step S01 is 6 and the number of coding units with Index values is 5, the phred value of 6 consecutive bases is determined to be low in the preprocessing, that is, the reads is determined to be of poor quality and should be deleted.
3) Deleting reads with an excessively small number of bases, i.e., reads with a length less than (L-5);
4) reads with an excessively large number of bases, i.e., reads with a length greater than (L +5), are deleted.
In some optional embodiments, the read length is preprocessed, and the process of deleting the read length with low quality may further include step 5):
5) reads with insertion/deletion errors of length between (L-2) and (L +2) are corrected. And determining the character with the minimum Hamming distance as the decoded character of the character aiming at the wrong base unit.
For example, the complete process for correcting the insertion/deletion errors of reads is:
a) setting a sliding window with a window size of
Figure BDA0002845581200000071
The number of decoding units, for example,
Figure BDA0002845581200000072
is 2.
b) reads are taken from left to right in sequence
Figure BDA0002845581200000073
Until the base sequence corresponding to each decoding unit can not be completely taken out
Figure BDA0002845581200000075
Calculating according to the coding table until each coding unit
Figure BDA0002845581200000074
The minimum hamming distance list of decoding units from the coding table. If the values of all elements of the list are more than or equal to 2, the first decoding unit is inserted or deleted, and the step c) is executed; otherwise, repeating the step b).
c) Inserting a proper character or deleting a character in each base of the first coding unit in sequence, wherein the inserted or deleted character must satisfy the following conditions: condition 1) is a character corresponding to a coding unit whose hamming distance is the smallest according to the coding table; condition 2) each element of the minimum hamming distance list of the sliding window after the character is inserted or deleted is less than 2; otherwise, executing the step b).
d) The length of the reads for deletion insertion/deletion correction is not equal to the reads for L.
In this embodiment, the process of transcoding in step S032 to obtain original text can be further subdivided into steps S032a-S032 c:
s032a, obtaining the read length after preprocessing, and performing reverse pushing according to the coding rule to obtain a decoded character line;
s032b, correcting errors of the decoded character lines to obtain decoded text character lines;
s032c, obtaining a plurality of groups according to the decoded text character lines and the text content, and decoding according to the groups to obtain the original text.
Specifically, regarding reads obtained through preprocessing, first, according to the character encoding table, the index encoding table, and the RS grouping encoding table adopted in step S01, the characters corresponding to the encoding units are obtained by using consecutive 6 bases as an encoding unit, and then the decoded character rows corresponding to the reads are obtained. Then, RS error correction is carried out on the decoded character line corresponding to each ready, and a decoded text character line formed by splicing text character strings only containing index information and error correction results is generated. And grouping according to the index value of the decoded text character line and the text content. And decoding the real text lines corresponding to the obtained packets, namely the original text lines according to a multiplicity principle, putting the decoded text lines into a set T, and sequencing the decoded text character lines in the set T according to index values. And sequentially removing the index values of the decoded text character lines in the T set, and outputting the text data region character strings to a decoded character file.
In this embodiment, in step S032c, the process of decoding text character lines and text content to obtain several groups can be further subdivided into steps S032c1-S032c 2:
s032c1, dividing the character lines according to the index values of the decoded text to obtain a plurality of groups, and determining the text similarity of the members in the groups;
S032C2, performing secondary division on the members in the group according to the text similarity, wherein the secondary division comprises at least one of the steps A-C:
A. and adding the members with the text similarity smaller than the first threshold value to other groups according to a preset first threshold value.
B. And determining the average value of the text similarity, and deleting the group members according to the average value.
C. And clustering the members which do not belong to the group according to the text similarity to obtain a new group.
Specifically, as shown in fig. 3, grouping is performed according to the index value of the decoded text character line and the text content. Grouping by adopting an index value during primary grouping;
after the preliminary grouping, for each decoded text character line of the group with the group member number less than 3, adopting a text data area and a central member of other groups (the group members are more than 3) ((The member in the center of the group means that the member has the highest average text similarity with other members in the group to which the member belongs, and the member can approximately represent the text similarity of the actual storage line corresponding to the group), if the similarity is more than a certain threshold value
Figure BDA0002845581200000081
(for example,
Figure BDA0002845581200000082
take 0.8), the decoded character line is deleted from the current packet and delivered to the packet with the highest similarity to the text. In an embodiment, the text similarity calculation method for two character strings includes: the two character strings s1 and s2 are subjected to sequence comparison by using a sequence comparison algorithm such as a Needle-Wunsch algorithm, the number of characters at the same position of the compared character strings is counted, the counted number is directly divided by the maximum value of the lengths of the character strings s1 and s2, and the divided result is the text similarity of the two character strings.
For each group member, according to the principle of text similarity
Figure BDA0002845581200000083
And deleting the members with larger text similarity difference with other members of the group. The text similarity between a member in the group and other members in the group is specifically the mean value of the text similarities between the member and other members in the group to which the member belongs.
And deleting the packets with illegal Index values in the unique decoding character line represented by the packets. The basis for judging whether the Index value is legal is as follows: and comparing the Index value with the number of the sequenced DNA storage sequences when the text file is coded, wherein if the Index value is small, the result is legal, and otherwise, the result is illegal.
For the decoded text character lines of the undetermined groups according to the text similarity
Figure BDA0002845581200000084
And clustering, and deleting clusters with illegal decoded text character line Index values corresponding to the clusters. Decoding of undetermined packetsAnd delivering the text character line to the packet with the maximum text similarity with the decoded text character line according to the text similarity.
In this embodiment, in step S032c, the process of decoding according to the packets to obtain the original text can also be subdivided into steps S032c3-S032c 5:
s032c3, determining the weight value of the characters in the decoded text character line in the packet;
s032c4, determining a unique length value of the packet, so that the length value of the decoded text character line in the packet is the same as the unique length value;
s032c5, determining characters of the original text according to the decoded text character lines with consistent length values and the weight values of the characters, and combining to obtain the original text.
Specifically, according to the multiplicity principle, the real text lines corresponding to the obtained packets are decoded, put into a set T, and the decoded text character lines in the set T are sorted according to the index values. The specific steps of the unique decoding character row represented by the multiplicity judgment grouping are as follows:
firstly, calculating an initial weight value of each decoded text character line of the group, wherein the weight calculation rule in the embodiment is as follows: english letter number decoded correctly by the decoded text character line/all English letter number decoded by the decoded text character line.
And determining a unique length value of the packet, wherein the unique length value corresponding to the packet is the value with the highest occurrence frequency of the length of the decoded text character line in the packet.
If the number of members of the group is less than τ (e.g., τ is 3), the spelling of the word in each decoded text character line of the group is checked and corrected on its own. And if the member number of the group is less than tau, carrying out sequence comparison on the decoded text character line with the character length not equal to the length theta of the unique decoded character line to be decoded in the group and any decoded text decoded character line with the length equal to theta in the group, and further carrying out proper expansion or stretching on the decoded text character line.
And sequentially calculating character values of corresponding columns in the unique decoding row corresponding to the grouping according to the data of each decoding text character row in the grouping. The calculation rule in the embodiment is as follows: determining characters of each column, and calculating the sum of weight values of the column in all rows and each character in each row in sequence; and selecting the weighted value and the maximum character.
In summary, the implementation process of this embodiment can be summarized as follows: and according to the coding rule, coding each character which sequentially appears in the English text, and sequentially adding index values to the base sequences of every N original text characters to obtain a series of DNA storage sequences. And combining the DNA storage sequences into a base sequence, performing biological storage, amplification and sequencing, and performing data cleaning on each reads in a sequencing file. And (4) after decoding, RS error correction and multi-sequence error correction are carried out on the obtained reads, and the original coded English text is recovered by using a word error correction technology.
As shown in fig. 4, it can be shown from the data that the english text can be completely restored when the sequencing depth is 25 in the cases of error rates of 0.01, 0.02 and 0.05 respectively in this embodiment; in the case of an error rate of 0.1, when the sequencing depth is 45, the original english text can be completely reproduced.
From the above specific implementation process, it can be concluded that the technical solution provided by the present invention has the following advantages or advantages compared to the prior art:
according to the technical scheme, the storage efficiency is improved, the semantic information in the original text is fully utilized in the transcoding and decoding processes, and the query processing capacity is high.
In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.
Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the functions and/or features may be integrated in a single physical device and/or software module, or one or more of the functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.
Wherein the functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1.一种基于DNA存储介质的文本存储方法,其特征在于,包括以下步骤:1. a text storage method based on DNA storage medium, is characterized in that, comprises the following steps: 获取原始文本,对所述原始文本进行编码得到DNA存储序列;Obtain the original text, and encode the original text to obtain the DNA storage sequence; 将所述DNA存储序列进行合成,得到DNA分子序列,对所述DNA分子序列进行扩增,将扩增后的DNA分子序列进行存储;synthesizing the DNA storage sequence to obtain a DNA molecular sequence, amplifying the DNA molecular sequence, and storing the amplified DNA molecular sequence; 获取存储的DNA分子序列,进行转码得到所述原始文本;Obtain the stored DNA molecular sequence, and perform transcoding to obtain the original text; 所述进行转码得到所述原始文本包括以下步骤:The transcoding to obtain the original text includes the following steps: 对存储的DNA分子序列进行测序,得到DNA分子序列的读长;Sequence the stored DNA molecular sequence to obtain the read length of the DNA molecular sequence; 预处理所述读长,去除所述读长中的噪音数据,将预处理后的读长进行转码得到所述原始文本。The read length is preprocessed, noise data in the read length is removed, and the preprocessed read length is transcoded to obtain the original text. 2.根据权利要求1所述的一种基于DNA存储介质的文本存储方法,其特征在于,所述获取原始文本,对所述原始文本进行编码得到DNA存储序列这一步骤,其包括:2. a kind of text storage method based on DNA storage medium according to claim 1, is characterized in that, described obtaining original text, the step of encoding described original text to obtain DNA storage sequence, it comprises: 根据编码规则以及所述原始文本中的字符生成编码碱基序列,根据所述编码碱基序列生成索引值;Generate an encoded base sequence according to the encoding rule and the characters in the original text, and generate an index value according to the encoded base sequence; 根据所述原始文本中的字符生成字节校验码;generating a byte check code according to the characters in the original text; 根据所述索引值、字节校验码以及由所述编码碱基序列构成的文本数据,构建所述DNA存储序列。The DNA storage sequence is constructed according to the index value, the byte check code, and the text data composed of the encoded base sequence. 3.根据权利要求2所述的一种基于DNA存储介质的文本存储方法,其特征在于,所述根据所述原始文本中的字符生成字节校验码这一步骤,其包括:3. a kind of text storage method based on DNA storage medium according to claim 2, is characterized in that, the described step of generating byte check code according to the character in described original text, it comprises: 将所述原始文本中的字符通过里所码编码得到二进制字符串;The characters in the original text are encoded to obtain a binary string by the code; 根据所述二进制字符串进行分组碱基编码,得到所述字节校验码。The grouped base encoding is performed according to the binary string to obtain the byte check code. 4.根据权利要求1所述的一种基于DNA存储介质的文本存储方法,其特征在于,所述预处理所述读长,去除所述读长中的噪音数据,将预处理后的读长进行转码得到所述原始文本这一步骤,其包括:4. a kind of text storage method based on DNA storage medium according to claim 1, is characterized in that, described preprocessing described read length, removes the noise data in described read length, the read length after preprocessing The step of transcoding to obtain the original text includes: 获取所述预处理后的读长,根据编码规则逆推,得到解码字符行;Obtain the read length after the preprocessing, and inversely infer according to the encoding rule to obtain a decoded character line; 对所述解码字符行进行纠错,得到解码文本字符行;Error correction is performed on the decoded character row to obtain a decoded text character row; 根据所述解码文本字符行以及文本内容得到若干分组,解码所述分组,得到原始文本。Several groups are obtained according to the decoded text character line and the text content, and the groups are decoded to obtain the original text. 5.根据权利要求4所述的一种基于DNA存储介质的文本存储方法,其特征在于,所述预处理所述读长,去除所述读长中的噪音数据,将预处理后的读长进行转码得到所述原始文本这一步骤,其还包括:5. a kind of text storage method based on DNA storage medium according to claim 4 is characterized in that, described read length of described preprocessing, removes the noise data in described read length, the read length after preprocessing The step of transcoding to obtain the original text further includes: 根据所述读长的错误碱基,确定汉明距离为最小值的字符为所述错误碱基的解码字符。According to the erroneous bases of the read length, it is determined that the character whose Hamming distance is the minimum value is the decoded character of the erroneous base. 6.根据权利要求4所述的一种基于DNA存储介质的文本存储方法,其特征在于,所述根据所述解码文本字符行以及文本内容得到若干分组这一步骤,其包括:6. a kind of text storage method based on DNA storage medium according to claim 4, is characterized in that, described according to described decoding text character line and text content to obtain this step of several groupings, it comprises: 根据所述解码文本字符行的索引值划分得到若干分组,确定分组成员的文本相似度;According to the index value of the decoded text character line, several groups are obtained, and the text similarity of the group members is determined; 根据所述文本相似度对所述分组成员进行二次划分,所述二次划分,包括以下步骤至少之一:Perform secondary division on the group members according to the text similarity, and the secondary division includes at least one of the following steps: 根据预设的第一阈值,将所述文本相似度小于所述第一阈值的成员添加至其他分组;According to a preset first threshold, members whose text similarity is less than the first threshold are added to other groups; 确定所述文本相似度的均值,根据所述均值删除所述分组成员;Determine the mean value of the text similarity, and delete the group member according to the mean value; 根据所述文本相似度,将未归属分组的成员进行聚类,得到新的分组。According to the text similarity, the members who do not belong to the group are clustered to obtain a new group. 7.根据权利要求4所述的一种基于DNA存储介质的文本存储方法,其特征在于,所述解码所述分组,得到原始文本这一步骤,其包括:7. a kind of text storage method based on DNA storage medium according to claim 4, is characterized in that, described decoding described grouping, obtains this step of original text, it comprises: 确定分组中所述解码文本字符行中字符的权重值;Determine the weight value of the character in the decoded text character line in the grouping; 确定所述分组的唯一长度值,使得所述分组中的解码文本字符行的长度值与所述唯一长度值相同;determining the unique length value of the grouping, so that the length value of the decoded text character line in the grouping is the same as the unique length value; 根据长度值一致的解码文本字符行以及所述字符的权重值,确定所述原始文本的字符,并组合得到原始文本。The characters of the original text are determined according to the decoded text character line with the same length value and the weight value of the character, and the original text is obtained by combining.
CN202011508358.7A 2020-12-18 2020-12-18 A Text Storage Method Based on DNA Storage Medium Active CN112582030B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011508358.7A CN112582030B (en) 2020-12-18 2020-12-18 A Text Storage Method Based on DNA Storage Medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011508358.7A CN112582030B (en) 2020-12-18 2020-12-18 A Text Storage Method Based on DNA Storage Medium

Publications (2)

Publication Number Publication Date
CN112582030A true CN112582030A (en) 2021-03-30
CN112582030B CN112582030B (en) 2023-08-15

Family

ID=75136171

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011508358.7A Active CN112582030B (en) 2020-12-18 2020-12-18 A Text Storage Method Based on DNA Storage Medium

Country Status (1)

Country Link
CN (1) CN112582030B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113299347A (en) * 2021-05-21 2021-08-24 广州大学 DNA storage method based on modulation coding
CN113314187A (en) * 2021-05-27 2021-08-27 广州大学 Data storage method, decoding method, system, device and storage medium
CN113315623A (en) * 2021-05-21 2021-08-27 广州大学 Symmetric encryption method for DNA storage
CN114218937A (en) * 2021-11-24 2022-03-22 中国科学院深圳先进技术研究院 Data error correction method and device and electronic equipment
CN114356220A (en) * 2021-12-10 2022-04-15 深圳先进技术研究院 Encoding method based on DNA storage, electronic device and readable storage medium
WO2023272499A1 (en) * 2021-06-29 2023-01-05 中国科学院深圳先进技术研究院 Encoding method, decoding method, apparatus, terminal device, and readable storage medium
CN117254819A (en) * 2023-11-20 2023-12-19 深圳市瑞健医信科技有限公司 Medical waste intelligent supervision system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104850760A (en) * 2015-03-27 2015-08-19 苏州泓迅生物科技有限公司 Artificially synthesized DNA storage medium with coding information, storage reading method for information, and applications
CN106845158A (en) * 2017-02-17 2017-06-13 苏州泓迅生物科技股份有限公司 A kind of method that information Store is carried out using DNA
CN109416928A (en) * 2016-06-07 2019-03-01 伊路米纳有限公司 For carrying out the bioinformatics system, apparatus and method of second level and/or tertiary treatment
CN110427786A (en) * 2019-05-31 2019-11-08 西藏自治区人民政府驻成都办事处医院 A method of use DNA as text information efficient storage medium
CN110706751A (en) * 2019-09-25 2020-01-17 东南大学 A DNA storage encryption coding method
CN111183233A (en) * 2017-10-02 2020-05-19 皇家飞利浦有限公司 Assessment of Notch cell signaling pathway activity using mathematical modeling of target gene expression
CN111368132A (en) * 2020-02-28 2020-07-03 元码基因科技(北京)股份有限公司 Method for storing audio or video files based on DNA sequences and storage medium
CN111600609A (en) * 2020-05-19 2020-08-28 东南大学 DNA storage coding method for optimizing Chinese storage

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104850760A (en) * 2015-03-27 2015-08-19 苏州泓迅生物科技有限公司 Artificially synthesized DNA storage medium with coding information, storage reading method for information, and applications
CN109416928A (en) * 2016-06-07 2019-03-01 伊路米纳有限公司 For carrying out the bioinformatics system, apparatus and method of second level and/or tertiary treatment
CN106845158A (en) * 2017-02-17 2017-06-13 苏州泓迅生物科技股份有限公司 A kind of method that information Store is carried out using DNA
CN111183233A (en) * 2017-10-02 2020-05-19 皇家飞利浦有限公司 Assessment of Notch cell signaling pathway activity using mathematical modeling of target gene expression
CN110427786A (en) * 2019-05-31 2019-11-08 西藏自治区人民政府驻成都办事处医院 A method of use DNA as text information efficient storage medium
CN110706751A (en) * 2019-09-25 2020-01-17 东南大学 A DNA storage encryption coding method
CN111368132A (en) * 2020-02-28 2020-07-03 元码基因科技(北京)股份有限公司 Method for storing audio or video files based on DNA sequences and storage medium
CN111600609A (en) * 2020-05-19 2020-08-28 东南大学 DNA storage coding method for optimizing Chinese storage

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
许鹏;方刚;石晓龙;刘文斌;: "DNA存储及其研究进展", 电子与信息学报, no. 06, pages 1 - 5 *
陈为刚;黄刚;李炳志;尹烨;元英进;: "音视频文件的DNA信息存储", 中国科学:生命科学, no. 01, pages 1 - 4 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113299347A (en) * 2021-05-21 2021-08-24 广州大学 DNA storage method based on modulation coding
CN113315623A (en) * 2021-05-21 2021-08-27 广州大学 Symmetric encryption method for DNA storage
CN113299347B (en) * 2021-05-21 2023-09-26 广州大学 DNA storage method based on modulation coding
CN113314187A (en) * 2021-05-27 2021-08-27 广州大学 Data storage method, decoding method, system, device and storage medium
CN113314187B (en) * 2021-05-27 2022-05-10 广州大学 Data storage method, decoding method, system, device and storage medium
WO2023272499A1 (en) * 2021-06-29 2023-01-05 中国科学院深圳先进技术研究院 Encoding method, decoding method, apparatus, terminal device, and readable storage medium
CN114218937A (en) * 2021-11-24 2022-03-22 中国科学院深圳先进技术研究院 Data error correction method and device and electronic equipment
WO2023092723A1 (en) * 2021-11-24 2023-06-01 中国科学院深圳先进技术研究院 Data error correction method and apparatus, and electronic device
CN114356220A (en) * 2021-12-10 2022-04-15 深圳先进技术研究院 Encoding method based on DNA storage, electronic device and readable storage medium
CN117254819A (en) * 2023-11-20 2023-12-19 深圳市瑞健医信科技有限公司 Medical waste intelligent supervision system
CN117254819B (en) * 2023-11-20 2024-02-27 深圳市瑞健医信科技有限公司 Medical waste intelligent supervision system

Also Published As

Publication number Publication date
CN112582030B (en) 2023-08-15

Similar Documents

Publication Publication Date Title
CN112582030B (en) A Text Storage Method Based on DNA Storage Medium
Organick et al. Random access in large-scale DNA data storage
US10370246B1 (en) Portable and low-error DNA-based data storage
Shomorony et al. Information-theoretic foundations of DNA data storage
CN107403075B (en) Comparison method, device and system
CN111600609B (en) A DNA Storage Coding Method for Optimizing Chinese Storage
EP2983297A1 (en) Code generation method, code generating apparatus and computer readable storage medium
CN111858507B (en) DNA-based data storage method, decoding method, system and device
CN110569974B (en) Hierarchical representation and interleaving encoding method for DNA storage that can contain artificial bases
US11600360B2 (en) Trace reconstruction from reads with indeterminant errors
CN113314187B (en) Data storage method, decoding method, system, device and storage medium
CN112100982B (en) DNA storage method, system and storage medium
EP3160049A1 (en) Data processing method and device for recovering valid code words from a corrupted code word sequence
CN112749247B (en) Method and device for storing and reading text information
CN113870949A (en) Deep learning-based nanopore sequencing data base identification method
Conde-Canencia et al. Nanopore DNA sequencing channel modeling
Sabary et al. Survey for a Decade of Coding for DNA Storage
JP4912646B2 (en) Gene transcript mapping method and system
Bi et al. Extended XOR algorithm with biotechnology constraints for data security in DNA storage
CN116564424A (en) DNA data storage method, reading method and terminal based on erasure codes and assembly technology
Shafir et al. Sequence design and reconstruction under the repeat channel in enzymatic dna synthesis
WO2019023978A1 (en) Alignment method, device and system
Luo Clustering for DNA Storage
Qin et al. Robust multi-read reconstruction from contaminated clusters using deep neural network for DNA storage
CN118335197B (en) DNA data storage method based on nanopore sequencing chip

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant