CN115423096A - DNA-based dynamic equalization system, data storage method and decoding method - Google Patents
DNA-based dynamic equalization system, data storage method and decoding method Download PDFInfo
- Publication number
- CN115423096A CN115423096A CN202210953717.2A CN202210953717A CN115423096A CN 115423096 A CN115423096 A CN 115423096A CN 202210953717 A CN202210953717 A CN 202210953717A CN 115423096 A CN115423096 A CN 115423096A
- Authority
- CN
- China
- Prior art keywords
- dna
- data
- characters
- molecular chain
- address
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000013500 data storage Methods 0.000 title claims abstract description 23
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 claims abstract description 28
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 claims abstract description 28
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 claims abstract description 18
- 229940104302 cytosine Drugs 0.000 claims abstract description 14
- 230000008569 process Effects 0.000 claims abstract description 10
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 claims abstract description 9
- 229930024421 Adenine Natural products 0.000 claims abstract description 9
- 229960000643 adenine Drugs 0.000 claims abstract description 9
- 229940113082 thymine Drugs 0.000 claims abstract description 9
- 238000012163 sequencing technique Methods 0.000 claims abstract description 7
- 108020004414 DNA Proteins 0.000 claims description 98
- 239000011159 matrix material Substances 0.000 claims description 21
- 238000012216 screening Methods 0.000 claims description 13
- 230000006835 compression Effects 0.000 claims description 7
- 238000007906 compression Methods 0.000 claims description 7
- 102000053602 DNA Human genes 0.000 claims description 6
- 238000012217 deletion Methods 0.000 claims description 5
- 230000037430 deletion Effects 0.000 claims description 5
- 238000003780 insertion Methods 0.000 claims description 5
- 230000037431 insertion Effects 0.000 claims description 5
- 238000004806 packaging method and process Methods 0.000 claims description 5
- 238000006467 substitution reaction Methods 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000000295 complement effect Effects 0.000 claims description 3
- 238000012937 correction Methods 0.000 claims description 2
- 238000006073 displacement reaction Methods 0.000 claims description 2
- 238000012795 verification Methods 0.000 claims description 2
- 238000013507 mapping Methods 0.000 claims 1
- 239000013598 vector Substances 0.000 description 6
- 230000015572 biosynthetic process Effects 0.000 description 4
- 229920001519 homopolymer Polymers 0.000 description 4
- 238000011084 recovery Methods 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 230000005055 memory storage Effects 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- 230000003190 augmentative effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000008030 elimination Effects 0.000 description 2
- 238000003379 elimination reaction Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000003252 repetitive effect Effects 0.000 description 2
- 102000004190 Enzymes Human genes 0.000 description 1
- 108090000790 Enzymes Proteins 0.000 description 1
- 206010035148 Plague Diseases 0.000 description 1
- XUIMIQQOPSSXEZ-UHFFFAOYSA-N Silicon Chemical compound [Si] XUIMIQQOPSSXEZ-UHFFFAOYSA-N 0.000 description 1
- 108020004682 Single-Stranded DNA Proteins 0.000 description 1
- 241000607479 Yersinia pestis Species 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 238000006555 catalytic reaction Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000013078 crystal Substances 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 229910052739 hydrogen Inorganic materials 0.000 description 1
- 239000001257 hydrogen Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 238000012856 packing Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 229910052710 silicon Inorganic materials 0.000 description 1
- 239000010703 silicon Substances 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/123—DNA computing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/602—Providing cryptographic facilities or services
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Software Systems (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Genetics & Genomics (AREA)
- Bioethics (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明涉及数据存储领域,具体涉及一种基于DNA的可动态均衡系统、数据存储方法以及解码方法。The invention relates to the field of data storage, in particular to a DNA-based dynamic equalization system, a data storage method and a decoding method.
背景技术Background technique
随着全球进入信息时代,大数据背景下,指数级别上升的数据量,对于数据存储设备有着更高的要求,DNA作为高密度存储介质,是亿万年来,自然进化的选择,同时由于其优良的结构性能,这里面主要是磷酸二脂键以及碱基基于氢键的互补配对规则,这种组合结构,具有稳定性的同时,亦可经由酶的催化,完成信息的复制、检索、转录等工作,单个DNA碱基尺寸在1纳米以下,同时DNA是柔性的,可在三维空间蜷缩、延展,其信息存储密度约为1019bit/cm3,对比电子计算机的最小单元三极管来说,5纳米几乎是目前硅晶材料的极限,其中硬盘存储密度大约是1013Bit/cm3,闪存约为1016Bit/cm3。在存储成本方面,DNA可在低温至室温环境下可稳定且长期存储,不需要能量维系,至少可以存储10万年以上。目前困扰DNA存储的主要问题一方面来源于其过高的合成、测序成本,另一方面也缺乏有效的编码手段,可以最大化的利用DNA存储的存储空间,为此我们提出了基于DNA的可动态均衡系统、数据存储方法以及解码方法。As the world enters the information age, under the background of big data, the exponentially increasing amount of data has higher requirements for data storage devices. As a high-density storage medium, DNA has been a natural evolution choice for hundreds of millions of years. At the same time, due to its excellent The structure and properties of the structure, which mainly include the phosphodiester bond and the complementary pairing rules of bases based on hydrogen bonds. This combination structure is stable and can also complete the replication, retrieval, transcription, etc. of information through the catalysis of enzymes. Work, the size of a single DNA base is less than 1 nanometer, and DNA is flexible, and can be curled up and extended in three-dimensional space, and its information storage density is about 1019bit/cm3. Compared with the smallest unit triode of an electronic computer, 5 nanometers is almost At present, the limit of silicon crystal materials, the storage density of hard disk is about 1013Bit/cm3, and the storage density of flash memory is about 1016Bit/cm3. In terms of storage cost, DNA can be stored stably and for a long time at low temperature to room temperature without energy maintenance, and can be stored for at least 100,000 years. At present, the main problems that plague DNA storage come from the high cost of synthesis and sequencing on the one hand, and the lack of effective encoding methods that can maximize the use of the storage space of DNA storage. Therefore, we propose a DNA-based A dynamic equalization system, a data storage method and a decoding method.
发明内容Contents of the invention
(一)解决的技术问题(1) Solved technical problems
针对现有技术的不足,本发明提供一种基于DNA的可动态均衡系统、数据存储方法以及解码方法,以解决上述的问题。Aiming at the deficiencies of the prior art, the present invention provides a DNA-based dynamic equalization system, data storage method and decoding method to solve the above-mentioned problems.
(二)技术方案(2) Technical solution
为实现上述所述目的,本发明提供如下技术方案:In order to achieve the above-mentioned purpose, the present invention provides the following technical solutions:
一种基于DNA的可动态均衡系统,包括:A dynamically balanced system based on DNA, including:
用于获取第一数据的获取模块、用于对所述第一数据进行编码的编码模块、用于压缩存储空间和加密的汉字字符压缩模块以及用于对编码得到的DNA分子链进行解码的解码模块。An acquisition module for acquiring the first data, an encoding module for encoding the first data, a Chinese character compression module for compressing storage space and encryption, and a decoding unit for decoding the encoded DNA molecular chain module.
一种基于DNA的可动态均衡系统的数据存储方法,包括以下步骤:A data storage method based on DNA that can dynamically balance the system, comprising the following steps:
第一步:通过获取模块获取第一数据;Step 1: Obtain the first data through the acquisition module;
第二步:通过编码模块对第一数据进行编码,得到包括伪随机存储器存储次数、编码数据、均衡、再均衡、校验以及正向和反向引物的DNA分子链,正向引物位于所述DNA分子链的3’端,所述反向引物位于所述DNA分子链的另一端,即是5’端。Step 2: Encode the first data through the encoding module to obtain the DNA molecular chain including the number of pseudo-random memory storage times, encoded data, equalization, re-equilibrium, verification, and forward and reverse primers. The forward primer is located in the The 3' end of the DNA molecular chain, and the reverse primer is located at the other end of the DNA molecular chain, that is, the 5' end.
优选的,所述第一数据的来源可以是任何现有的计算机存储的电子文件。Preferably, the source of the first data may be any existing computer-stored electronic file.
优选的,所述所述正向引物经过DNA碱基互补配对规则,反向置换后,即是反向引物,长度方面正向引物和反向引物长度一致;Preferably, the forward primer is subjected to DNA base complementary pairing rules, and after reverse displacement, it is a reverse primer, and the length of the forward primer and the reverse primer are the same;
每一所述引物中鸟嘌呤和胞嘧啶的含量占所述引物所含有的鸟嘌呤、胞嘧啶、腺嘌呤和胸腺嘧啶总含量的预设比值。The content of guanine and cytosine in each primer accounts for a preset ratio of the total content of guanine, cytosine, adenine and thymine contained in the primer.
优选的,所述第二步中当第一数据为汉字文本信息时,需对其进行汉字字符压缩,包括以下步骤:Preferably, in the second step, when the first data is Chinese character text information, it needs to perform Chinese character compression, including the following steps:
S1:先将目标文档进行分析,得到文档中的总字符数K,剔除掉重复字符之后,统计文档中出现的字符个数S,并按照字符出现频率从高到低排序,得到字符表;S1: First analyze the target document to obtain the total number of characters K in the document. After eliminating repeated characters, count the number S of characters that appear in the document, and sort them according to the frequency of occurrence of characters from high to low to obtain a character table;
S2:计算汉字字符对应的碱基长度,计算规则是:根据字符个数S,计算一个字符应该用n位碱基来编码,n需要保证>=log4S;S2: Calculate the length of bases corresponding to Chinese characters, the calculation rule is: according to the number of characters S, calculate a character should use n bases to encode, n needs to ensure >=log4S;
S3:建立碱基序列对文本字符进行标注,碱基序列的生成方式是:S3: Create a base sequence to mark text characters, the generation method of the base sequence is:
(a)以m的长度作为种子注入伪随机生成器,随机生成n位的不重复的碱基序列;(a) inject the pseudo-random generator with the length of m as a seed, and randomly generate n-bit non-repetitive base sequences;
(b)对生成的碱基序列进行筛选,如果存在2位以上连续碱基,则从字符表由下至上进行填充对应,否则从上到下进行填充对应,直至完成全部字符的碱基编码对应填充,得到填充之后的字典表,作为DNA文档统计的字典及密钥。(b) Screen the generated base sequence. If there are more than 2 consecutive bases, fill and correspond from the character table from bottom to top, otherwise fill and correspond from top to bottom until the base code correspondence of all characters is completed Filling, get the dictionary table after filling, as the dictionary and key of DNA document statistics.
(4)根据字典表中字符对应的碱基序列,将原始文档中的汉字字符全部转化为DNA碱基序列,得到压缩和加密后的DNA碱基序列。(4) According to the base sequences corresponding to the characters in the dictionary table, all the Chinese characters in the original document are converted into DNA base sequences, and the compressed and encrypted DNA base sequences are obtained.
优选的,所述DNA分子链的获取步骤如下:Preferably, the steps for obtaining the DNA molecular chain are as follows:
第一步:将第一数据划分为若干个数据子包;Step 1: dividing the first data into several data sub-packets;
第二步:通过电子计算机的伪随机数生成器生成0/1随机矩阵;The second step: generate a 0/1 random matrix through the pseudo-random number generator of the electronic computer;
第三步:根据随机矩阵中的元素值及元素位置,指定相应的数据子包进行异或编码;将编码后的数据进行动态均衡;Step 3: According to the element value and element position in the random matrix, specify the corresponding data sub-packet for XOR encoding; dynamically balance the encoded data;
第四步:记录下过程中的数据,根据预设字母表,将所述运行数字和映射为鸟嘌呤、胞嘧啶、腺嘌呤和胸腺嘧啶,并进行编码,得到并组成DNA分子链数据。Step 4: Record the data in the process, map the running numbers into guanine, cytosine, adenine and thymine according to the preset alphabet, and encode them to obtain and form DNA molecular chain data.
优选的,对生成的DNA链进行查错,当生成的DNA链发生错误时,可以识别到;Preferably, the generated DNA chain is checked for errors, and when an error occurs in the generated DNA chain, it can be identified;
DNA链可能发生的错误包括:替换、插入、删除错误;Errors that may occur in the DNA chain include: substitution, insertion, and deletion errors;
针对插入和删除错误,可以通过链的长度进行判断,针对替换错误;For insertion and deletion errors, it can be judged by the length of the chain, for replacement errors;
采用XOR异或校验的纠错方式,通过按位进行异或,最终得到所有碱基数据异或后的结果,可判断相关DNA分子链是否发生替换错误。The error correction method of XOR exclusive OR check is adopted, and the XOR result of all base data is finally obtained by performing exclusive OR bit by bit, and it can be judged whether there is a substitution error in the relevant DNA molecular chain.
优选的,对DNA分子链进行筛选处理,筛选出所述DNA分子链中折叠错乱的结构和/或无界运行数字和码,并就未通过筛选的链,进行动态均衡。Preferably, screening is performed on the DNA molecular chains to screen out disorderly folded structures and/or unbounded running numbers and codes in the DNA molecular chains, and dynamically balance the chains that fail the screening.
一种基于DNA的可动态均衡系统的解码方法,包括以下内容:根据打包结果,利用解码模块进行解码处理。A decoding method based on a DNA-based dynamic equalization system includes the following content: according to the packaging result, a decoding module is used to perform decoding processing.
(三)有益效果(3) Beneficial effects
与现有技术相比,本发明提供的基于DNA的可动态均衡系统、数据存储方法以及解码方法,具备以下有益效果:Compared with the prior art, the DNA-based dynamic equalization system, data storage method and decoding method provided by the present invention have the following beneficial effects:
1、该基于DNA的可动态均衡系统、数据存储方法以及解码方法,对所述第一数据进行编码得到DNA分子链的过程中,对所述第一地址和所述第二地址添加了多种约束,使得能够高效率且准确地的对编码数据进行读取,如使所述第一地址与所述第二地址的汉明距离大于或等于所述第一地址长度的一半,降低读取时的地址选择错误的可能性;所述第一地址的前缀与所述第二地址的前缀以及所述第二地址的后缀均不相同,避免了读取过程中出现匹配错误的可能性;每一所述引物的前缀中鸟嘌呤和胞嘧啶的含量占所述引物所含有的鸟嘌呤、胞嘧啶、腺嘌呤和胸腺嘧啶总含量的预设比值,使得在需要读取编码数据事先进行测序时,准确率高。1. In the DNA-based dynamic balancing system, data storage method, and decoding method, in the process of encoding the first data to obtain DNA molecular chains, various types of data are added to the first address and the second address Constraints, so that the encoded data can be read efficiently and accurately, such as making the Hamming distance between the first address and the second address greater than or equal to half the length of the first address, reducing the reading time The possibility of wrong address selection; the prefix of the first address is different from the prefix of the second address and the suffix of the second address, which avoids the possibility of matching errors during the reading process; each The content of guanine and cytosine in the prefix of the primer accounts for the preset ratio of the total content of guanine, cytosine, adenine and thymine contained in the primer, so that when it is necessary to read the encoded data and perform sequencing in advance, High accuracy.
2、该基于DNA的可动态均衡系统、数据存储方法以及解码方法,建立了基于DNA的存储架构,可以高效的完成第一数据信息至DNA分子链之间的转换,转换效率高。2. The DNA-based dynamic balance system, data storage method, and decoding method establish a DNA-based storage architecture, which can efficiently complete the conversion between the first data information and the DNA molecular chain, and the conversion efficiency is high.
3、该基于DNA的可动态均衡系统、数据存储方法以及解码方法,在编码DNA双链上的数据时,通过设置动态均衡的方式,实现DNA双链的绝对稳定,避免了均聚物、GC含量不均衡等情况。3. The DNA-based dynamic balance system, data storage method and decoding method, when encoding the data on the DNA double strand, realize the absolute stability of the DNA double strand by setting a dynamic balance method, avoiding the homopolymer, GC Unbalanced content etc.
4、该基于DNA的可动态均衡系统、数据存储方法以及解码方法,设计了一个基于随机矩阵的编码系统,并使用多比特打包方式,减少了降低编译码的复杂度,缩短编码时间,提高译码的速率;减少存储冗余的数量级,而且提高了编码和数据中心数据恢复率的效率,实现高度敏感的随机访问和准确重写寻址。4. The DNA-based dynamic equalization system, data storage method and decoding method design a random matrix-based coding system, and use a multi-bit packaging method to reduce the complexity of coding and decoding, shorten coding time, and improve translation. Code rate; reduce the order of magnitude of storage redundancy, but also improve the efficiency of encoding and data center data recovery rate, realize highly sensitive random access and accurate rewrite addressing.
5、该基于DNA的可动态均衡系统、数据存储方法以及解码方法,针对汉字文本信息,提供了一套全新的压缩、加密方法,实现汉字文本的高效存储。5. The DNA-based dynamic equalization system, data storage method and decoding method provide a new set of compression and encryption methods for Chinese character text information to realize efficient storage of Chinese character text.
附图说明Description of drawings
图1为本发明具体实施例数据存储方法的步骤流程示意图;Fig. 1 is a schematic flow chart of the steps of a data storage method according to a specific embodiment of the present invention;
图2为本发明具体实施例DNA分子链的示意图;Fig. 2 is the schematic diagram of the DNA molecular chain of the specific embodiment of the present invention;
图3为本发明具体实施例矩阵的生成示意图。Fig. 3 is a schematic diagram of generating a matrix according to a specific embodiment of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.
实施例Example
参见图1-3,本实施例提供的基于DNA的可动态均衡系统的数据存储方法,包括以下步骤:Referring to Figures 1-3, the data storage method of the DNA-based dynamically balanced system provided in this embodiment includes the following steps:
获取第一数据;Get the first data;
对第一数据进行编码,得到DNA分子链,当第一数据为汉字文本信息时,需对其进行汉字字符压缩,汉字是目前是世界上使用人数最多的文字,据统计,截止2020年,全球使用汉字的人数超过16亿人,约占世界总人口的20%以上,中国是目前的世上互联网使用人数最多的国家,(截止2019年9月,中国共有8.3亿用户接入互联网),其创造的汉字文档资源量巨大,目前汉字的通用电子存储方案,是UTF-8,单个汉字字符需要占据3-4个字节,参照2bit/nt的DNA理想存储密度,则每个汉字字符需要12-16个碱基的信息空间。然而实际上汉字的常用字符大约5000个,这5000个字符,完全可以用4-7个碱基进行存储,这无疑存在巨大的存储空间改进空间。我们提出了一种汉字动态规划存储的方法,可根据拟存储的文档信息,进行动态最优规划。同时具有很好的压缩能力和加密作用,实现做法是:Encode the first data to obtain DNA molecular chains. When the first data is Chinese character text information, it needs to be compressed. Chinese characters are currently the most used text in the world. According to statistics, by 2020, the global The number of people who use Chinese characters exceeds 1.6 billion, accounting for more than 20% of the world's total population. China is currently the country with the largest number of Internet users in the world. (As of September 2019, China has a total of 830 million users connected to the Internet). The amount of Chinese character document resources is huge. The current general electronic storage solution for Chinese characters is UTF-8. A single Chinese character needs to occupy 3-4 bytes. Referring to the ideal storage density of 2bit/nt DNA, each Chinese character needs 12- 16 bases of information space. However, in fact, there are about 5,000 commonly used characters in Chinese characters, and these 5,000 characters can be stored in 4-7 bases. There is undoubtedly a huge room for improvement in storage space. We propose a method for dynamic planning and storage of Chinese characters, which can perform dynamic optimal planning according to the document information to be stored. At the same time, it has very good compression ability and encryption function. The implementation method is:
(1)先将目标文档进行分析,得到文档中的总字符数K,剔除掉重复字符之后,统计文档中出现的字符个数S,并按照字符出现频率从高到低排序,得到字符表;(1) First analyze the target document to obtain the total number of characters K in the document. After removing repeated characters, count the number of characters S that appear in the document, and sort them according to the frequency of occurrence of characters from high to low to obtain a character table;
(2)计算汉字字符对应的碱基长度。计算规则是:根据字符个数S,计算一个字符应该用n位碱基来编码,n需要保证>=log4S。(2) Calculate the base length corresponding to the Chinese character. The calculation rule is: according to the number S of characters, a character should be encoded with n bases, and n needs to be guaranteed to be >=log4S.
(3)建立碱基序列对文本字符进行标注。碱基序列的生成方式是:(a)以m的长度作为种子注入伪随机生成器,随机生成n位的不重复的碱基序列;(b)对生成的碱基序列进行筛选,如果存在2位以上连续碱基,则从字符表由下至上进行填充对应,否则从上到下进行填充对应,直至完成全部字符的碱基编码对应填充,得到填充之后的字典表,作为DNA文档统计的字典及密钥;(3) Establish the base sequence to mark the text characters. The base sequence generation method is: (a) injecting the pseudo-random generator with the length of m as a seed, and randomly generating n-bit non-repetitive base sequences; (b) screening the generated base sequences, if there are 2 If there are more than 1 consecutive bases, then fill and correspond from the character table from bottom to top, otherwise fill and correspond from top to bottom until the base code corresponding to all characters is filled, and the dictionary table after filling is obtained as a dictionary for DNA document statistics and key;
(4)根据字典表中字符对应的碱基序列,将原始文档中的汉字字符全部转化为DNA碱基序列,得到压缩和加密后的DNA碱基序列。(4) According to the base sequences corresponding to the characters in the dictionary table, all the Chinese characters in the original document are converted into DNA base sequences, and the compressed and encrypted DNA base sequences are obtained.
获取到第一数据之后的编码方式,具体见下步骤:For the encoding method after obtaining the first data, see the following steps for details:
1)确定编码空间,编码空间主要是根据DNA合成商的合成技术以及预算决定的。1) Determine the coding space, which is mainly determined according to the synthesis technology and budget of the DNA synthesizer.
2)数据分割,将L划分为n份chunks(n=S//P+1),构成L1、L2、L3……Ln,每一份的长度为p nt;2) Data segmentation, divide L into n chunks (n=S//P+1) to form L1, L2, L3...Ln, and the length of each chunk is p nt;
3)生成随机向量,利用Python中的random函数,生成范围在(0,2n)的整数,同时记录随机生成器生成次数,将生成的整数转变为n位的二进制数据向量,其从低位(0)到高位(n-1)的元素是否为1,表示第2)步中将原始文件L分割的L1,L2……Ln的chunks是否参与编码。3) Generate random vectors, use the random function in Python to generate integers in the range of (0, 2n), record the number of times the random generator is generated, and convert the generated integers into n-bit binary data vectors, which start from the low bit (0 ) to high (n-1) elements are 1, indicating whether the chunks of L1, L2...Ln that divide the original file L in step 2) participate in encoding.
4)将生成的随机向量中元素为1位置对应的chunks进行异或,得到对应的Droplet;4) XOR the chunks corresponding to the position where the element in the generated random vector is 1, and obtain the corresponding Droplet;
5)每生成一次随机向量,记录下伪随机数生成器生成次数,存储在Times位置,通过步骤4得到的Droplet存储在Data payload位置,构成一份Packs,不断重复步骤4,直至用尽Times的存储空间,共生成4t的Packs;5) Every time a random vector is generated, record the number of times the pseudo-random number generator generates and store it in the Times location. The Droplet obtained through
6)对得到的Packs进行均衡,首先进行条件筛选,要求均聚物(连续碱基不超过3个),GC含量(45%-55%),引物在信息编码空间中是否重复,如果DNA链通过筛选,则将XE位标记为0(10个A),将此packs作为可用链进行存储;若packs没有通过筛查,则对Packs进行均衡,将Adapter序列作为随机生成器的种子,生成k次长度为(t+p)nt的随机碱基,将此随机碱基与未通过筛选的packs的碱基进行异或,得到新的碱基序列,不断重复均衡直至碱基序列通过筛选或当k>410时,表示XE均衡位存储空间用尽。对于经过k次均衡后通过筛选的链,将k存储在XE位,同时记录下通过均衡的packs碱基序列;对于XE均衡位空间用尽依然无法均衡通过筛选的链,则选择丢弃。6) Balance the obtained Packs, first perform condition screening, require homopolymer (no more than 3 consecutive bases), GC content (45%-55%), whether the primer is repeated in the information coding space, if the DNA chain If the screening is passed, the XE bit is marked as 0 (10 A), and the packs are stored as an available chain; if the packs do not pass the screening, the Packs are balanced, and the Adapter sequence is used as the seed of the random generator to generate k A random base with a secondary length of (t+p)nt is XORed with the bases of the packs that have not passed the screening to obtain a new base sequence, and the balance is repeated until the base sequence passes the screening or when When k>410, it means that the XE equalization bit storage space is exhausted. For the chain that passed the screening after k times of equalization, store k in the XE position, and record the base sequence of the packs that passed the balance; for the chain that still cannot be balanced and pass the screening when the XE balance bit space is exhausted, choose to discard.
7)再均衡,由于加入了10nt的Xe均衡位,Xe位本身还会造成均聚物、GC含量过高或过低等问题,我们引入了2nt的再均衡位,通过对XE位进行再均衡,均衡方法同第6)中对Packs的均衡方法一致,再均衡后,均衡次数存储在XEE位,Packs及XE位存储为均衡后的碱基序列,未通过再均衡的链则直接丢弃。值得注意的是再均衡不再改变Packs的碱基序列,同时再均衡的判断条件,考虑到了XE与Packs之间的连接部分以及XE与XEE之间的链接部分的均聚物情况;7) Rebalance. Since the 10nt Xe balance bit is added, the Xe position itself will cause problems such as homopolymers, too high or too low GC content, etc. We have introduced a 2nt rebalance bit to rebalance the XE position , the equalization method is the same as the equalization method for Packs in Section 6). After rebalancing, the number of equalization times is stored in the XEE bit, and the Packs and XE bits are stored as the balanced base sequence, and the chains that do not pass the rebalancing are directly discarded. It is worth noting that rebalancing will no longer change the base sequence of Packs, and at the same time, the judging conditions for rebalancing take into account the homopolymer conditions of the link between XE and Packs and the link between XE and XEE;
8)假定通过步骤7后的DNA链为G个,存储冗余为m个,为确保最终存储内容的鲁棒性,先再G中,随机选取(n+m)条链,再在(n+m)中,随机选择1000次n,根据Times生成相应的度分布序列(此序列向量长度为n),构造成为(n*n)维矩阵,利用异或高斯消元法求解得到的三角矩阵,需保证至少550次最终得到的三角矩阵主对角线上元素全部为1。8) Assume that there are G DNA strands after
9)合成,将步骤8)最终选取的(n+m)条链,自均衡后得到的packs开始,经XE、XEE为止,按3nt为单位,进行异或,最终异或得到3nt的碱基序列,存储为XC位附在XEE后。9) Synthesis, the (n+m) chains finally selected in step 8), start from the packs obtained after equalization, go through XE and XEE, perform XOR in units of 3nt, and finally get 3nt bases by XOR Sequence, stored as XC bits appended to XEE.
10)将得到的链添加正向、反向引物,进行生物合成,完成最终DNA存储。10) Add forward and reverse primers to the obtained strand to carry out biosynthesis and complete the final DNA storage.
如图2所示,在本实施例中,目的是实现高度敏感的随机访问和准确重写寻址。所提出的方法的原理是每个块在随机存取中系统必须装备有允许通过DNA进行独特选择和扩增的地址序列引物。由于信息编码的原因,数据信息长度不确定,数据链有可能是一条完整的数据信息,也有可能是一段或极端分散的链信息,所以在DNA分子链前、后端加上引物,用以识别不同的链信息。本实施例中以DNA分子链长度为700bps为例进行说明,在其他实施例中可以为其他长度,将正向引物和反向引物合成在的块序列两端,将正向引物和反向引物分别存放在到长度为20bps的短块中,块序列用于存储数据。编码过程为将块序列的第一数据进行编码,得到编码数据,以得到编码后的DNA分子链。As shown in Figure 2, in this embodiment, the aim is to achieve highly sensitive random access and accurate rewrite addressing. The principle of the proposed method is that each block in random access must be systematically equipped with address sequence primers that allow unique selection and amplification by DNA. Due to information encoding, the length of data information is uncertain. The data chain may be a complete data information, or it may be a segment or extremely scattered chain information. Therefore, primers are added to the front and back ends of the DNA molecular chain for identification. Different chain information. In this embodiment, the length of the DNA molecular chain is 700bps as an example for illustration. In other embodiments, it can be other lengths. The forward primer and the reverse primer are synthesized at both ends of the block sequence, and the forward primer and the reverse primer are They are respectively stored in short blocks with a length of 20bps, and the block sequence is used to store data. The encoding process is to encode the first data of the block sequence to obtain the encoded data, so as to obtain the encoded DNA molecular chain.
在本实施例中,将第一数据划分为若干个第二数据,每一第二数据为包含引物、伪随机存储器存储次数位、编码数据位、均衡位、再均衡位、校验位,引物是描述DNA前缀同步的代码,选择的数量共同编码字由目标来控制,以使重写尽可能简单和避免由于可变代码长度的错误传播;“字编码”操作如下:首先,不同文本数据信息中的单词被计数并在字典中制表,字典中的每个词都被转换变换为足够长的二进制序列,以允许对字典进行编码。其次,使用四进制模型(例如可将00,01,10,11与DNA中的碱基A(腺嘌呤),T(胸腺嘧啶),C(胞嘧啶),G(鸟嘌呤)一一对应进行编码。In this embodiment, the first data is divided into several second data, and each second data includes a primer, a pseudo-random memory storage number bit, an encoded data bit, an equalization bit, a re-equilibrium bit, a check bit, and a primer is a code describing the synchronization of DNA prefixes, the number of co-coded words selected is controlled by the goal, to make rewriting as simple as possible and to avoid error propagation due to variable code length; "word encoding" operates as follows: first, different textual data information The words in are counted and tabulated in a dictionary, and each word in the dictionary is transformed into a binary sequence long enough to allow encoding of the dictionary. Secondly, use the quaternary model (for example, 00, 01, 10, 11 can be one-to-one correspondence with the bases A (adenine), T (thymine), C (cytosine), and G (guanine) in DNA to encode.
每一引物的前缀中G(鸟嘌呤)和C(胞嘧啶)的含量占引物所含有的G(鸟嘌呤)、C(胞嘧啶)、A(腺嘌呤)和T(胸腺嘧啶)总含量的预设比值,预设比值包括但不限于45%~55%。The content of G (guanine) and C (cytosine) in the prefix of each primer accounts for the total content of G (guanine), C (cytosine), A (adenine) and T (thymine) contained in the primer The preset ratio includes but not limited to 45%-55%.
原因在于:由于DNA是通过G(鸟嘌呤)、A(腺嘌呤)、C(胞嘧啶)、T(胸腺嘧啶)和来存储信息,A与T,C与G之间两两配对能够形成稳定的双链接结构,无论是单链DNA还是双链DNA,均可以二进制编码的形式存储信息,其中双链DNA中,需要有一定GC含量的前缀(使GC含量约占总量的45%~55%)。因为具有50%GC含量的DNA双链比具有更低或者更高GC含量的DNA双链更加稳定,并且在测序期间能有更好的覆盖。由于编码中的用户信息是通过前缀同步来实现的,因此对地址以及它们的前缀来施加GC内容约束很重要,因为后一个要求还需要确保编码数据块的所有片段能具有被去除的GC内容;The reason is that because DNA stores information through G (guanine), A (adenine), C (cytosine), T (thymine) and the pairing of A and T, C and G can form a stable Whether it is single-stranded DNA or double-stranded DNA, information can be stored in the form of binary codes. Among them, double-stranded DNA needs to have a prefix with a certain GC content (so that the GC content accounts for about 45% to 55% of the total. %). Because DNA duplexes with 50% GC content are more stable than DNA duplexes with lower or higher GC content, and can have better coverage during sequencing. Since the user information in the encoding is implemented by prefix synchronization, it is important to impose GC content constraints on addresses and their prefixes, because the latter requirement also needs to ensure that all fragments of the encoded data block can have GC content removed;
译码方面的步骤如下:The decoding steps are as follows:
1)将测序得到的DNA链,根据正反向Adapter,进行提取,对提出的DNA链长度进行判断分析,如果长度低于目标长度或高于目标长度,则说明发生了删除/插入错误,丢弃此链;1) Extract the DNA strand obtained by sequencing according to the forward and reverse Adapter, and judge and analyze the length of the proposed DNA strand. If the length is lower than the target length or higher than the target length, it means that a deletion/insertion error has occurred and discarded this chain;
2)对数据按XC位规定的长度,进行异或还原,还原后的数据,如果和XC中存储的数据不一致,否则说明发生了替换错误,丢弃此条链;2) Perform XOR restoration on the data according to the length specified by the XC bit. If the restored data is inconsistent with the data stored in XC, otherwise it means that a replacement error has occurred, and the chain is discarded;
3)若通过测序完成的链数量K>=n,则可以启动正式译码流程,首先是信息恢复,将引物信息,作为随机生成器的种子,注入伪随机生成器,根据XEE位置记录的伪随机生成器生成次数,对XEE进行信息恢复,然后再根据恢复后得到的XEE的伪随机生成器生成次数信息,对Times及data payload位进行数据恢复,得到原始的数据数据信息。3) If the number of chains K>=n completed by sequencing, the formal decoding process can be started. First, the information is recovered, and the primer information is used as the seed of the random generator, and injected into the pseudo-random generator. Random generator generation times, XEE information recovery, and then according to the XEE pseudo-random generator generation number information obtained after recovery, data recovery of Times and data payload bits, to obtain the original data data information.
4)根据Times对应的伪随机数生成器生成次数,生成相应的0-2**(n)的随机整数,并将其转变为n位的二进制向量,组成K*n维二进制分布序列。4) Generate a corresponding random integer of 0-2**(n) according to the generation times of the pseudo-random number generator corresponding to Times, and convert it into an n-bit binary vector to form a K*n-dimensional binary distribution sequence.
5)将Data payload中的数据,由碱基转换为十进制数据,同度分布序列矩阵构成增广矩阵,利用异或高斯消元法,进行矩阵求解。(求解规则如下:首先将K阶矩阵D,与K行1列的Data矩阵组合,构建增广矩阵,接下来沿着矩阵对角线进行判断(i从0-k),若D[i][i]=1,则沿着列判断其下所有序列,若D[j][i]=1,则将第i行所有数据与第j行所有数据进行异或。若D[i][i]=0,则沿着列向下寻找,找到D[j][i]=1时,互换两行,然后再向下寻找,若还有D[j][i]=1,则用第i行同第j行进行异或,确保构建出一个上三角矩阵,矩阵对角线下方区域全部为0.再依照上一步,反向操作,将对角线上方为1的全部消为0,即可得到唯一的S1……Sk,以及Data1……Datak.完成译码过程。)得到S1、S2……Sn的数据解,将此部分解转换为二进制序列,进而转变成碱基序列,即可得到ChunksL1、L2……Ln。5) The data in the Data payload is converted from base to decimal data, and the sequence matrix of the same degree distribution forms an augmented matrix, and the matrix is solved by using the XOR Gaussian elimination method. (The solution rules are as follows: first combine the K-order matrix D with the Data matrix of K rows and 1 column to construct an augmented matrix, and then judge along the diagonal of the matrix (i from 0-k), if D[i] [i]=1, then judge all the sequences below it along the column, if D[j][i]=1, then XOR all the data in row i and all data in row j. If D[i][ i]=0, then search down along the column, when D[j][i]=1 is found, exchange the two rows, and then search down, if there is still D[j][i]=1, then XOR the i-th row with the j-th row to ensure that an upper triangular matrix is constructed, and the area below the diagonal of the matrix is all 0. Then follow the previous step and reverse the operation to eliminate all the 1 above the diagonal as 0, you can get the only S1...Sk, and Data1...Datak. Complete the decoding process.) Get the data solutions of S1, S2...Sn, convert this part into a binary sequence, and then convert it into a base sequence , you can get ChunksL1, L2...Ln.
6)根据表2,完成存储文本数据的译码。6) According to Table 2, complete the decoding of the stored text data.
基于DNA的可动态均衡系统,包括:DNA-based dynamic balancing system, including:
获取模块,用于获取第一数据;an acquisition module, configured to acquire the first data;
编码模块,用于对所述第一数据进行编码,得到DNA分子链,所述DNA分子链包括编码数据以及正向引物和反向引物,所述正向引物位于所述编码数据的一端,所述反向引物位于所述编码数据的另一端,所述编码数据包括伪随机存储器存储次数位、编码数据位、均衡位、再均衡位、校验位;An encoding module, configured to encode the first data to obtain a DNA molecular chain, the DNA molecular chain includes the encoded data, a forward primer and a reverse primer, the forward primer is located at one end of the encoded data, and the The reverse primer is located at the other end of the coded data, and the coded data includes pseudo-random memory storage number bits, coded data bits, equalization bits, re-balanced bits, and check bits;
汉字字符压缩模块:用于当第一数据为汉字文本信息时,对汉字字符进行特殊编码,起到压缩存储空间和加密的作用。Chinese character compression module: used to perform special encoding on Chinese characters when the first data is Chinese character text information, and play the role of compressing storage space and encrypting.
解码模块,用于对编码得到的DNA分子链进行解码,解码还原至完整的第一数据。The decoding module is used to decode the encoded DNA molecular chain, and restore the decoding to the complete first data.
本发明实施例还提供了一种存储介质,存储介质存储有程序,程序被处理器执行完成基于DNA的数据存储方法。The embodiment of the present invention also provides a storage medium, the storage medium stores a program, and the program is executed by a processor to complete the DNA-based data storage method.
本实施例还提供基于DNA的可动态均衡系统的解码方法,包括以下步骤:根据打包结果,进行解码处理。This embodiment also provides a decoding method based on a DNA-based dynamic equalization system, which includes the following steps: performing decoding processing according to the packing result.
当DNA为采用上述的打包方式得到时,当需要读取DNA而进行解码时,只需要利用打包结果,即相当于仅需要在生成的矩阵G中发送1的行位置和列位置,而不发送整个生成矩阵G,然后仅需要根据接收到的行位置和列位置进行解码,恢复生成的矩阵来翻译原始数据。在这个阶段LT码的编解码过程和应用是单位原始数据进行封装传输,而单位数据包传输,在大量数据的情况下会出现占用更多的内存和带宽问题当较大和出现的有效性和可靠性的现象出现下降。经过上述的处理方式,即封装编码传输后的一些位,替换原来的单位数传输,使数据量大大减少存储空间,减少存储冗余量,提高解码成功率。When the DNA is obtained by the above packaging method, when the DNA needs to be read and decoded, only the packaging result needs to be used, which is equivalent to only sending the row position and column position of 1 in the generated matrix G, instead of sending The entire generator matrix G, then only needs to be decoded according to the received row and column positions, and the generated matrix is restored to translate the original data. At this stage, the encoding and decoding process and application of LT codes are packaged and transmitted per unit of raw data, while the transmission of unit data packets will occupy more memory and bandwidth problems when large and appear effective and reliable in the case of large amounts of data. Sexuality has declined. After the above-mentioned processing method, that is, some bits after encoding and transmission are encapsulated, and the original unit number transmission is replaced, the data volume is greatly reduced, the storage space is reduced, the storage redundancy is reduced, and the decoding success rate is improved.
尽管已经示出和描述了本发明的实施例,对于本领域的普通技术人员而言,可以理解在不脱离本发明的原理和精神的情况下可以对这些实施例进行多种变化、修改、替换和变型,本发明的范围由所附权利要求及其等同物限定。Although the embodiments of the present invention have been shown and described, those skilled in the art can understand that various changes, modifications and substitutions can be made to these embodiments without departing from the principle and spirit of the present invention. and modifications, the scope of the invention is defined by the appended claims and their equivalents.
Claims (9)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210953717.2A CN115423096A (en) | 2022-08-10 | 2022-08-10 | DNA-based dynamic equalization system, data storage method and decoding method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210953717.2A CN115423096A (en) | 2022-08-10 | 2022-08-10 | DNA-based dynamic equalization system, data storage method and decoding method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115423096A true CN115423096A (en) | 2022-12-02 |
Family
ID=84195575
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210953717.2A Pending CN115423096A (en) | 2022-08-10 | 2022-08-10 | DNA-based dynamic equalization system, data storage method and decoding method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115423096A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115514375A (en) * | 2022-11-18 | 2022-12-23 | 江苏网进科技股份有限公司 | Cache data compression method |
-
2022
- 2022-08-10 CN CN202210953717.2A patent/CN115423096A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115514375A (en) * | 2022-11-18 | 2022-12-23 | 江苏网进科技股份有限公司 | Cache data compression method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111600609B (en) | A DNA Storage Coding Method for Optimizing Chinese Storage | |
CN111858507B (en) | DNA-based data storage method, decoding method, system and device | |
Lu et al. | Extractors: Optimal up to constant factors | |
CN101222295B (en) | System for distributing data by dividing the same into plural pieces of partial data | |
CN101610088A (en) | System and method for encoding data based on compression techniques with security features | |
CN113314187B (en) | Data storage method, decoding method, system, device and storage medium | |
Shpilka | New constructions of WOM codes using the Wozencraft ensemble | |
US20080005648A1 (en) | Data compression | |
CN114328000B (en) | DNA storage cascade coding and decoding method for 1 type 2 type segment error correction inner code | |
CN115423096A (en) | DNA-based dynamic equalization system, data storage method and decoding method | |
US12218698B2 (en) | File compression system | |
Tang et al. | Error-correcting codes for noisy duplication channels | |
CN116594572A (en) | Floating point number stream data compression method, device, computer equipment and medium | |
US20240184666A1 (en) | Preprocessing for Correcting Insertions and Deletions in DNA Data Storage | |
CN117111854A (en) | Data storage method, device and medium based on distributed encryption storage | |
Van Zandwijk | A mathematical approach to NAND flash-memory descrambling and decoding | |
CN110209598B (en) | Cache memory, data read-write control method and system | |
Hidayat et al. | Comparison of lossless compression schemes for WAV audio data 16-bit between Huffman and coding arithmetic | |
CN116187435B (en) | Method and system for storing information by utilizing DNA (deoxyribonucleic acid) based on large and small fountain codes and MRC (MRC) algorithm | |
Mu et al. | RBS: a rotational coding based on blocking strategy for DNA storage | |
CN109831544B (en) | Code storage method and system applied to email address | |
CN116880778B (en) | User privacy protection method based on regenerative coding and distributed storage | |
CN117040539B (en) | Petroleum logging data compression method and device based on M-ary tree and LZW algorithm | |
Poyias et al. | m-Bonsai: A practical compact dynamic trie | |
US9143163B2 (en) | Method and system for text compression and decompression |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |