CN114958828B

CN114958828B - Data information storage method based on DNA molecule medium

Info

Publication number: CN114958828B
Application number: CN202210668109.7A
Authority: CN
Inventors: 戴俊彪; 黄小罗
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2022-06-14
Filing date: 2022-06-14
Publication date: 2024-04-19
Anticipated expiration: 2042-06-14
Also published as: WO2023240950A1; CN114958828A

Abstract

The invention relates to a data information storage method based on DNA molecular medium, in particular to a recombinant plasmid library composed of recombinant plasmids containing general information storage DNA, wherein the plasmids in the plasmid library use base joints as base sequences for recording information. The recombinant plasmid library can be repeatedly applied to the storage of different data information in DNA molecular media, can be prepared for multiple times, reduces the proportion of chemical synthesis, and realizes more economic, rapid and green DNA data storage.

Description

Data information storage method based on DNA molecule medium

技术领域Technical Field

本发明属于数据存储技术领域，具体涉及基于DNA分子介质的数据的存储方法、装置、设备及可读存储介质。The present invention belongs to the technical field of data storage, and in particular relates to a data storage method, device, equipment and a readable storage medium based on a DNA molecule medium.

背景技术Background technique

随着互联网，大数据，人工智能及其它人类社会文明科技的发展，全球数据呈现指数级的增长。传统的硬盘，磁带，光盘等存储方式，由于维护成本高，空间占用大，以及存储寿命短，无法满足不断增长的海量数据存储需求。With the development of the Internet, big data, artificial intelligence and other human civilization technologies, global data is growing exponentially. Traditional storage methods such as hard disks, tapes, and optical disks cannot meet the growing demand for massive data storage due to high maintenance costs, large space occupation, and short storage life.

DNA作为数据存储介质，具有存储密度高，存储时间长，维护成本低的特点。据计算，1g DNA能够存储相当于100万部高清电影数据。同时，DNA作为生命遗传信息物质，能够插入到动植物微生物细胞中，通过生命体的复制，实现代代相传的永久存储。As a data storage medium, DNA has the characteristics of high storage density, long storage time and low maintenance cost. It is calculated that 1g of DNA can store data equivalent to 1 million high-definition movies. At the same time, as a material of life genetic information, DNA can be inserted into the cells of animal, plant and microorganisms, and through the replication of living organisms, it can achieve permanent storage from generation to generation.

目前的DNA数据存储方法，包括以下几个步骤，1)数据信息到DNA序列的编码；2)DNA序列的合成；3)合成DNA的测序解读；4)测序获得DNA序列到数据信息的解码。2012年，George Church等人利用DNA介质成功实现53,426个单词，11JPG图片和1个Java程序在合成DNA中的数据存储。2013年，Goldman等人757,051bytes不同形式数据信息在合成DNA介质中的存储。2018年，微软和华盛顿大学联合在自然生物技术(Nature Biotechnology)杂志上发文，成功实现200Mb数据在合成DNA中的存储。这些研究工作，都是利用一次性合成DNA文库对数据进行存储。存储的计算机bits数和合成碱基以1碱基＝1～2bits数据的形式对应，即存储的计算机二进制数据量，至少需要二分之一的对应的合成碱基量进行存储。这种存储方式，由于DNA的合成价格较高(目前商业价格大约0.01元-1元/nt，取决于不同的合成方式)，合成DNA在存储对应的数据上只能使用一次，使得最终DNA数据的存储成本高昂。The current DNA data storage method includes the following steps: 1) encoding of data information to DNA sequence; 2) synthesis of DNA sequence; 3) sequencing and interpretation of synthetic DNA; 4) decoding of DNA sequence to data information obtained by sequencing. In 2012, George Church et al. successfully realized the data storage of 53,426 words, 11 JPG pictures and 1 Java program in synthetic DNA using DNA medium. In 2013, Goldman et al. stored 757,051 bytes of data information in different forms in synthetic DNA medium. In 2018, Microsoft and the University of Washington jointly published an article in the journal Nature Biotechnology, successfully realizing the storage of 200Mb data in synthetic DNA. These research works all use disposable synthetic DNA libraries to store data. The number of computer bits stored and the synthetic bases correspond to each other in the form of 1 base = 1~2 bits of data, that is, the amount of computer binary data stored requires at least half of the corresponding amount of synthetic bases to be stored. This storage method is expensive due to the high cost of DNA synthesis (the current commercial price is about 0.01 yuan to 1 yuan per nt, depending on the different synthesis methods). Synthetic DNA can only be used once to store the corresponding data, making the final storage cost of DNA data high.

如何实现DNA数据存储成本的大幅降低，是目前该领域面临的主要难题。How to significantly reduce the cost of DNA data storage is the main challenge currently facing this field.

发明内容Summary of the invention

本发明所要解决的问题是目前DNA数据存储成本较高，DNA数据存储中昂贵的合成DNA无法反复使用的问题。针对这一问题，本发明提出一种方法，基于生物发酵大量制备一个能够反复调用的通用质粒DNA库，利用酶组装实现任意信息在质粒DNA中的体外或者体内存储，使得原料DNA能够反复调用，实现数据信息在体外或者体内的存储。The problem to be solved by the present invention is that the current cost of DNA data storage is high, and the expensive synthetic DNA in DNA data storage cannot be reused. In view of this problem, the present invention proposes a method, based on biological fermentation, to prepare a universal plasmid DNA library that can be repeatedly called, and to use enzyme assembly to realize in vitro or in vivo storage of any information in the plasmid DNA, so that the raw DNA can be repeatedly called, and the data information can be stored in vitro or in vivo.

本发明一个方面提供了一种含有通用信息存储DNA的重组质粒的制备方法，所述制备方法包括以下步骤：One aspect of the present invention provides a method for preparing a recombinant plasmid containing universal information storage DNA, the preparation method comprising the following steps:

S01)：设计通用信息DNA序列模块，其包含两组碱基接头，以及位于两组碱基接头之间的有连接间臂序列，连接两组碱基接头以及位于两组碱基接头之间的有连接间臂序列获得碱基接头核心序列；在碱基接头核心序列左右两端加入二类限制性内切酶接头序列，获得通用信息DNA序列模块序列；S01): designing a universal information DNA sequence module, which comprises two groups of base linkers and a linker arm sequence located between the two groups of base linkers, connecting the two groups of base linkers and the linker arm sequence located between the two groups of base linkers to obtain a base linker core sequence; adding a second type of restriction endonuclease linker sequence to both ends of the base linker core sequence to obtain a universal information DNA sequence module sequence;

S02)：合成通用信息DNA序列模块，并将其插入质粒载体中，获得重组质粒；S02): synthesizing a universal information DNA sequence module and inserting it into a plasmid vector to obtain a recombinant plasmid;

任选地，S03)：将重组质粒导入宿主菌中培养保存，或从宿主菌中提取包含通用信息DNA序列的重组质粒保存。Optionally, S03): the recombinant plasmid is introduced into a host bacterium for culture and preservation, or the recombinant plasmid containing the universal information DNA sequence is extracted from the host bacterium for preservation.

进一步地，所述碱基接头为由任意碱基组合形成的集合，所述碱基接头用于记录信息。Furthermore, the base linker is a set formed by any combination of bases, and the base linker is used to record information.

进一步地，所述连接间臂序列为已知确定碱基序列。Furthermore, the linker arm sequence is a known and determined base sequence.

进一步地，所述两组碱基接头分别具有n1个碱基和n2个碱基，所述n1和n2独立地选自3-10的任意整数，例如独立地选自3、4、5、6、7、8、9、10。Furthermore, the two groups of base linkers have n1 bases and n2 bases respectively, and n1 and n2 are independently selected from any integer of 3-10, for example, independently selected from 3, 4, 5, 6, 7, 8, 9, 10.

本发明另一个方面提供了一种含有通用信息存储DNA的重组质粒，其由上述方法制备获得；或者Another aspect of the present invention provides a recombinant plasmid containing universal information storage DNA, which is prepared by the above method; or

所述重组质粒中包含通用信息DNA序列模块序列，所述通用信息DNA序列模块包含两组碱基接头，以及位于两组碱基接头之间的有连接间臂序列，连接两组碱基接头以及位于两组碱基接头之间的有连接间臂序列获得碱基接头核心序列；且在碱基接头核心序列左右两端还具有二类限制性内切酶接头序列。The recombinant plasmid contains a universal information DNA sequence module sequence, and the universal information DNA sequence module contains two groups of base linkers and a linker arm sequence located between the two groups of base linkers. The two groups of base linkers and the linker arm sequence located between the two groups of base linkers are connected to obtain a base linker core sequence; and there are also two types of restriction endonuclease linker sequences on the left and right ends of the base linker core sequence.

本发明再一个方面提供了一种重组质粒库，所述重组质粒库包含多种上述含有通用信息存储DNA的重组质粒。In another aspect, the present invention provides a recombinant plasmid library, which comprises a plurality of the above-mentioned recombinant plasmids containing universal information storage DNA.

进一步地，所述的重组质粒库中包含4^(n1+n2)种具有不同碱基接头的含有通用信息存储DNA的重组质粒，n1和n2分别为两组碱基接头中第一碱基接头和第二碱基接头的碱基长度。Furthermore, the recombinant plasmid library comprises 4 ^(n1+n2) types of recombinant plasmids containing universal information storage DNA with different base linkers, and n1 and n2 are the base lengths of the first base linker and the second base linker in the two groups of base linkers, respectively.

所述n1和n2独立地选自3-10的任意整数，例如独立地选自3、4、5、6、7、8、9、10。The n1 and n2 are independently selected from any integer of 3-10, for example, independently selected from 3, 4, 5, 6, 7, 8, 9, 10.

进一步地，同一重组质粒库中的重组质粒连接间臂序列为相同序列。Furthermore, the sequences of the linker arms of the recombinant plasmids in the same recombinant plasmid library are identical.

进一步地，同一重组质粒库中的重组质粒连接间臂序列的长度为3-100个碱基，例如为10个、20个、30个、40个、50个、60个、70个、80个、90个、100个。Furthermore, the length of the recombinant plasmid linker arm sequence in the same recombinant plasmid library is 3-100 bases, for example, 10, 20, 30, 40, 50, 60, 70, 80, 90, or 100 bases.

本发明再一个方面提供了一种数据的DNA存储方法，所述DNA存储方法包括以下步骤：In another aspect, the present invention provides a DNA storage method for data, the DNA storage method comprising the following steps:

S11)提取待存储计算机数据信息的二进制信息；S11) extracting binary information of computer data information to be stored;

S12)将二进制信息根据预设的映射关系转换为信息记录用碱基序列；S12) converting the binary information into a base sequence for information recording according to a preset mapping relationship;

S13)将信息记录用碱基序列拆分成短碱基片段，且在相邻的短碱基片段之间还设置有重叠碱基序列；或者S13) splitting the information recording base sequence into short base segments, and providing overlapping base sequences between adjacent short base segments; or

相邻的奇数位序列组成的短碱基片段之间、相邻的偶数位序列组成的短碱基片段还设置有重叠碱基序列；There are also overlapping base sequences between the short base segments composed of adjacent odd-numbered bit sequences and between the short base segments composed of adjacent even-numbered bit sequences;

S14)在步骤S13)所拆分获得的每条所述短碱基片段从中间分割，使每条短碱基片段分割为第一短碱基片段和第二短碱基片段；S14) splitting each of the short base fragments obtained by splitting in step S13) from the middle, so that each short base fragment is split into a first short base fragment and a second short base fragment;

S15)采用本发明上述制备方法制备或者从本发明所述的含有通用信息存储DNA的重组质粒库中挑选，获得所需的信息记录用具有通用信息DNA序列的重组质粒；所述信息记录用具有通用信息DNA序列的重组质粒上的第一碱基接头与步骤S14)设计的第一短碱基片段相同，第二碱基接头与步骤S14)获得的第二短碱基片段相同；获得与步骤S13)拆分后的所有短碱基片段对应的通用信息DNA序列的重组质粒；S15) preparing the required recombinant plasmid with a universal information DNA sequence for information recording by the above-mentioned preparation method of the present invention or selecting from the recombinant plasmid library containing universal information storage DNA of the present invention; the first base linker on the recombinant plasmid with a universal information DNA sequence for information recording is the same as the first short base fragment designed in step S14), and the second base linker is the same as the second short base fragment obtained in step S14); obtaining a recombinant plasmid with a universal information DNA sequence corresponding to all the short base fragments split in step S13);

S16)利用限制性内切酶酶切步骤S15)获得的重组质粒，并收集酶切后的片段，利用连接酶连接成一条完整的信息记录用片段；S16) using restriction endonucleases to digest the recombinant plasmid obtained in step S15), collecting the digested fragments, and connecting them into a complete information recording fragment using a ligase;

S17)采用本发明的制备方法制备或者从本发明所述的含有通用信息存储DNA的重组质粒库中挑选，获得载体用具有通用信息DNA序列的重组质粒；所述载体用具有通用信息DNA序列的重组质粒上的第一碱基接头与步骤S13)设计的信息记录用碱基序列的5’端短碱基片段的第一短碱基片段相同，第二碱基接头与步骤S13)设计的信息记录用碱基序列的3’端短碱基片段的第二短碱基片段相同；S17) preparing a recombinant plasmid having a universal information DNA sequence for a vector by the preparation method of the present invention or selecting from the recombinant plasmid library containing universal information storage DNA of the present invention; the first base linker on the recombinant plasmid having a universal information DNA sequence for a vector is the same as the first short base fragment of the 5' end short base fragment of the information recording base sequence designed in step S13), and the second base linker is the same as the second short base fragment of the 3' end short base fragment of the information recording base sequence designed in step S13);

S18)利用限制性内切酶酶切步骤S17)获得的重组质粒，并收集酶切后的质粒作为载体质粒，将步骤S16)获得的信息记录用片段插入到载体质粒中，获得记录信息的重组质粒。S18) using restriction endonuclease to digest the recombinant plasmid obtained in step S17), collecting the digested plasmid as a vector plasmid, inserting the information recording fragment obtained in step S16) into the vector plasmid, and obtaining a recombinant plasmid for recording information.

进一步地，步骤S18)获得的记录信息的重组质粒直接在体外冻干保存。Furthermore, the recombinant plasmid with recorded information obtained in step S18) is directly freeze-dried and stored in vitro.

进一步地，步骤S18)获得的记录信息的重组质粒转入宿主菌中，复制保存。Furthermore, the recombinant plasmid containing the recorded information obtained in step S18) is transferred into the host bacteria and replicated and stored.

进一步地，步骤S18)获得的记录信息的重组质粒转入宿主菌中，复制然后提取以冻干粉的形式保存。Furthermore, the recombinant plasmid recording the information obtained in step S18) is transferred into the host bacteria, replicated, and then extracted and stored in the form of freeze-dried powder.

进一步地，所述宿主菌为任意能够使重组质粒复制保存的微生物，进一步优选为大肠杆菌、酵母菌、乳酸杆菌、双歧杆菌，肠球菌。Furthermore, the host bacteria is any microorganism that can replicate and preserve the recombinant plasmid, and is more preferably Escherichia coli, yeast, lactobacillus, bifidobacterium, or enterococcus.

进一步地，步骤S11)中，计算机数据信息能够在计算机上存在的数据信息，优选地，选自图片、文本、程序、音频、视频。Furthermore, in step S11), the computer data information can be data information existing on the computer, preferably selected from pictures, texts, programs, audios, and videos.

进一步地，将所述碱基序列分割为多个短碱基片段时，末端短碱基片段序列长度小于其他短碱基片段时采用重复碱基补齐至与其他短碱基片段长度一致。Furthermore, when the base sequence is divided into a plurality of short base fragments, when the sequence length of the terminal short base fragment is shorter than that of other short base fragments, repeated bases are used to fill the sequence length to be consistent with that of other short base fragments.

进一步地，步骤S12)中，信息记录用碱基序列拆分为短碱基片段前还包括将信息记录用碱基序列分割为多个亚信息记录用碱基序列，在每个亚信息记录用碱基序列前增加索引片段。优选地，所述索引片段序列长度为3-100个碱基，例如3、4、5、6、7、8、9、10个碱基。每个片段前增加的索引片段不同，并在S13)将连接有不同索引片段的亚信息记录用碱基序列分别拆分成短碱基片段。Furthermore, in step S12), before the base sequence for information recording is split into short base fragments, the base sequence for information recording is also split into a plurality of sub-base sequences for information recording, and an index fragment is added before each sub-base sequence for information recording. Preferably, the length of the index fragment sequence is 3-100 bases, for example, 3, 4, 5, 6, 7, 8, 9, 10 bases. The index fragment added before each fragment is different, and in S13), the sub-base sequences for information recording connected with different index fragments are split into short base fragments respectively.

进一步地，当信息记录用碱基序列被拆分为亚信息记录用碱基序列后，在S16)中，每个亚信息记录用碱基序列对应地制备一条完整的信息记录用片段。Furthermore, after the information recording base sequence is split into sub-information recording base sequences, in S16), a complete information recording segment is prepared corresponding to each sub-information recording base sequence.

进一步地，当信息记录用碱基序列被拆分为亚信息记录用碱基序列后，在S18)中，将每个亚信息记录用碱基序列对应制备的信息记录用片段分别插入到不同载体质粒中，获得记录信息的重组质粒集合。Furthermore, after the information recording base sequence is split into sub-information recording base sequences, in S18), the information recording fragments prepared corresponding to each sub-information recording base sequence are respectively inserted into different vector plasmids to obtain a recombinant plasmid set for recording information.

本发明再一个方面提供了一种数据信息的DNA存取的方法，所述方法包括以下步骤：In another aspect, the present invention provides a method for DNA access of data information, the method comprising the following steps:

步骤S1)采用上述方法将数据信息存储在DNA中，Step S1) using the above method to store data information in DNA,

步骤S2)提取DNA后进行分析和解码获得存储的数据信息，所述步骤S2)包括以下步骤。Step S2) extracts the DNA and then performs analysis and decoding to obtain the stored data information. Step S2) includes the following steps.

步骤S21)：提取带有信息记录用片段的载体质粒，并进行测序读取；Step S21): extracting the vector plasmid with the information recording fragment and performing sequencing reading;

步骤S22)：根据测序数据，去掉序列中的酶切位点和连接间臂序列，获得记录了数码信息的碱基序列；Step S22): according to the sequencing data, the restriction sites and the junction arm sequences in the sequence are removed to obtain the base sequence that records the digital information;

步骤S23)：将步骤S22)获得的碱基序列信息转换成二进制信息；Step S23): converting the base sequence information obtained in step S22) into binary information;

步骤S24)：从二进制信息中提取对应的数据信息。Step S24): extract corresponding data information from the binary information.

本发明再一个方面提供了本发明上述的重组质粒或上述重组质粒库在制备用于DNA存储信息中的重组质粒中的用途。Another aspect of the present invention provides use of the above-mentioned recombinant plasmid or the above-mentioned recombinant plasmid library of the present invention in preparing a recombinant plasmid for use in DNA storage information.

本发明再一个方面提供了一种利用含有通用信息存储DNA重组质粒库进行DNA数据存储的装置，所述装置包括：Another aspect of the present invention provides a device for storing DNA data using a recombinant plasmid library containing universal information storage DNA, the device comprising:

二进制序列提取单元，用于提取待存储数据对应的二进制序列；A binary sequence extraction unit, used to extract a binary sequence corresponding to the data to be stored;

二进制序列-碱基序列转换单元，用于根据预设的映射关系，将所述二进制序列转换为碱基序列，或将碱基序列转换为二进制序列；A binary sequence-base sequence conversion unit, used to convert the binary sequence into a base sequence, or convert the base sequence into a binary sequence according to a preset mapping relationship;

短碱基片段拆分设计单元，用于将碱基序列拆分为短碱基片段，相邻的短碱基片段之间具有重叠碱基序列；或者A short base fragment splitting design unit is used to split the base sequence into short base fragments, and adjacent short base fragments have overlapping base sequences; or

相邻的奇数位序列组成的短碱基片段之间、相邻的偶数位序列组成的短碱基片段包括重叠碱基序列；The short base segments composed of adjacent odd-numbered bit sequences and the short base segments composed of adjacent even-numbered bit sequences include overlapping base sequences;

所述短碱基片段拆分设计单元还能够进一步将拆分获得的每条所述短碱基片段从中间分割，使每条短碱基片段分割为第一短碱基片段和第二短碱基片段；The short base fragment splitting design unit can further split each of the short base fragments obtained by splitting from the middle, so that each short base fragment is split into a first short base fragment and a second short base fragment;

重组质粒库单元，所述重组质粒库包含上述含有通用信息存储DNA的重组质粒，所述重组质粒库单元还能够从所述重组质粒中挑选出所有所需的信息记录用重组质粒以及载体用重组质粒，所述所需的信息记录用重组质粒为其的第一碱基接头与短碱基片段拆分设计单元设计获得短碱基片段第一短碱基片段相同，第二碱基接头与短碱基片段拆分设计单元设计获得短碱基片段的第二短碱基片段相同；所述载体用重组质粒上的第一碱基接头与二进制序列-碱基序列转换单元获得的碱基序列的5’端短碱基片段的第一短碱基片段相同，第二碱基接头与二进制序列-碱基序列转换单元的碱基片段的3’端短碱基片段的第二短碱基片段相同；A recombinant plasmid library unit, wherein the recombinant plasmid library comprises the above-mentioned recombinant plasmid containing universal information storage DNA, and the recombinant plasmid library unit can also select all required recombinant plasmids for information recording and recombinant plasmids for vectors from the recombinant plasmids, wherein the first base linker of the required recombinant plasmid for information recording is identical to the first short base segment of the short base segment obtained by the short base segment splitting design unit, and the second base linker is identical to the second short base segment of the short base segment obtained by the short base segment splitting design unit; the first base linker on the recombinant plasmid for vectors is identical to the first short base segment of the 5' end short base segment of the base sequence obtained by the binary sequence-base sequence conversion unit, and the second base linker is identical to the second short base segment of the 3' end short base segment of the base segment of the binary sequence-base sequence conversion unit;

重组质粒库酶切和拼接单元，用于将重组质粒库单元挑选出的所有所需的信息记录用重组质粒进行酶切和连接为信息记录用片段，并能够将载体用重组质粒进行酶切，并连接信息记录用片段，获得重组质粒；The recombinant plasmid library enzyme digestion and splicing unit is used to digest and connect all the required information recording recombinant plasmids selected by the recombinant plasmid library unit into information recording fragments, and can digest the vector recombinant plasmid and connect the information recording fragments to obtain the recombinant plasmid;

储存和复制单元，用于储存重组质粒库酶切和拼接单元获得的重组质粒，或者用于复制重组质粒库单元的重组质粒或重组质粒库酶切和拼接单元获得的存储信息的重组质粒。The storage and replication unit is used to store the recombinant plasmid obtained by the enzyme cutting and splicing unit of the recombinant plasmid library, or to replicate the recombinant plasmid of the recombinant plasmid library unit or the recombinant plasmid of the storage information obtained by the enzyme cutting and splicing unit of the recombinant plasmid library.

进一步地，所述装置还包括解码装置。Furthermore, the device also includes a decoding device.

进一步地，所述解码装置包括：Furthermore, the decoding device comprises:

DNA序列获取单元，用于获取存储信息的重组质粒中DNA序列；A DNA sequence acquisition unit, used to acquire the DNA sequence in the recombinant plasmid storing information;

数据分析单元，用于将DNA序列中的酶切位点和连接间臂序列去除，获得记录了数码信息的碱基序列；并将碱基序列转换为二进制序列，并进一步转换为存储数据。The data analysis unit is used to remove the restriction sites and linker arm sequences in the DNA sequence to obtain the base sequence that records the digital information; and convert the base sequence into a binary sequence, and further convert it into storage data.

本发明再一个方面提供了一种计算机可读存储介质，其上存储有计算机程序，其中，该计算机程序被处理器执行时实现上述DNA存储方法的步骤或上述数据信息的DNA存取的方法的步骤。Still another aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the above-mentioned DNA storage method or the steps of the above-mentioned DNA access method for data information are implemented.

本发明再一个方面提供了一种计算机设备，包括存储器和处理器，在所述存储器上存储有能够在处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现上述DNA存储方法的步骤或上述数据信息的DNA存取的方法的步骤。Still another aspect of the present invention provides a computer device comprising a memory and a processor, wherein the memory stores a computer program that can be run on the processor, and when the processor executes the computer program, the steps of the above-mentioned DNA storage method or the steps of the above-mentioned DNA access method for data information are implemented.

有益效果Beneficial Effects

鉴于目前的DNA数据存储技术，合成DNA无法重复使用，DNA数据成本高的问题，本发明提供了一种解决该问题，实际可行的技术方法。In view of the current DNA data storage technology, synthetic DNA cannot be reused and the cost of DNA data is high. The present invention provides a practical and feasible technical method to solve this problem.

本发明创造性地设计了通用信息存储DNA的重组质粒库，该库中包含多种重组质粒，每次用于信息存储时仅需要挑选所需的重组质粒，被消耗的重组质粒可以通过细菌发酵的方法快速补充，无需每次采用化学方法构建。The present invention creatively designs a recombinant plasmid library of universal information storage DNA, which contains a variety of recombinant plasmids. Each time it is used for information storage, only the required recombinant plasmids need to be selected. The consumed recombinant plasmids can be quickly replenished by bacterial fermentation, without the need to use chemical methods to construct each time.

本发明的方法通过将存储数据的DNA序列拆分成交错重叠的小序列片段，并利用一个提前合成重组质粒库的可以多次调取使用的，成功实现了任意DNA的组装合成以及体内，体外存储。The method of the present invention successfully realizes the assembly synthesis and in vivo and in vitro storage of any DNA by splitting the DNA sequence storing data into small overlapping sequence fragments and utilizing a pre-synthesized recombinant plasmid library that can be retrieved and used multiple times.

本发明方法设计巧妙，能够应用于不同场景的DNA数据存储，具有灵活度高，通用性好，成本低的特点。本发明利用细菌发酵大量制备质粒DNA，相比较传统的化学合成DNA，污染程度低，成本低。本发明建立的方法不仅适用于将数据存储于冻干的质粒DNA中，还适用于将数据大规模的存于生物体内，显著降低DNA生物体内存储成本。The method of the present invention is cleverly designed and can be applied to DNA data storage in different scenarios, and has the characteristics of high flexibility, good versatility and low cost. The present invention uses bacterial fermentation to prepare plasmid DNA in large quantities, which has low pollution and low cost compared to traditional chemically synthesized DNA. The method established by the present invention is not only suitable for storing data in freeze-dried plasmid DNA, but also suitable for storing data on a large scale in organisms, significantly reducing the cost of storing DNA in organisms.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为含有通用信息存储DNA的重组质粒的示意图。FIG1 is a schematic diagram of a recombinant plasmid containing universal information storage DNA.

其中1和5为限制性内切酶接头、2为第一碱基接头、3为连接间臂序列、4为第二碱基接头。Among them, 1 and 5 are restriction endonuclease linkers, 2 is the first base linker, 3 is the linker arm sequence, and 4 is the second base linker.

图2为3个碱基交错重叠或4个碱基相邻短碱基片段交错重叠的示意图。FIG. 2 is a schematic diagram of staggered overlap of 3 bases or staggered overlap of 4 adjacent short base fragments.

图3为通用信息存储DNA的重组质粒的制备示意图。FIG. 3 is a schematic diagram of the preparation of a recombinant plasmid of universal information storage DNA.

图4为4个碱基以奇数加偶数重叠的方式交错重叠的示意图。FIG. 4 is a schematic diagram showing the staggered overlapping of four bases in an odd-number plus an even-number overlapping manner.

图5为最后一个短碱基片段的拆分块长度无法达到与在先的短碱基片段长度一致时，用碱基补齐的示意图。FIG5 is a schematic diagram of base padding when the length of the split block of the last short base fragment cannot be consistent with the length of the previous short base fragment.

图6为含有通用信息存储DNA的重组质粒的制备方法流程示意图。FIG. 6 is a schematic flow chart of a method for preparing a recombinant plasmid containing universal information storage DNA.

图7为本发明数据的DNA存储方法流程示意图。FIG. 7 is a schematic flow chart of the DNA data storage method of the present invention.

图8为本发明数据的DNA存取方法流程示意图。FIG. 8 is a schematic flow chart of the DNA data access method of the present invention.

具体实施方式Detailed ways

为了使本发明的上述目的、特征和优点能够更加明显易懂，下面结合附图对本发明的具体实施方式做详细的说明，但不能理解为对本发明的可实施范围的限定。In order to make the above-mentioned objects, features and advantages of the present invention more obvious and understandable, the specific implementation modes of the present invention are described in detail below with reference to the accompanying drawings, but they should not be construed as limiting the applicable scope of the present invention.

为了解决现有技术中关于以DNA形式存储信息的方法存在储存和复制成本高的问题，提出了可提前合成、并可以多次调取使用的通用信息存储质粒DNA库，该通用信息存储质粒DNA库可以预先批量生产，也可以根据具体存储信息特异性合成。In order to solve the problem of high storage and replication costs in the prior art methods of storing information in the form of DNA, a universal information storage plasmid DNA library that can be synthesized in advance and retrieved for multiple uses is proposed. The universal information storage plasmid DNA library can be mass-produced in advance or specifically synthesized according to the specific storage information.

如图1所示，本发明技术方案首先提供了含有通用信息存储DNA的重组质粒的制备方法，所述制备方法包括以下步骤：As shown in FIG1 , the technical solution of the present invention first provides a method for preparing a recombinant plasmid containing universal information storage DNA, and the preparation method comprises the following steps:

S01)：设计通用信息DNA序列模块，其包含两组碱基接头，以及位于两组碱基接头之间的有连接间臂序列获得碱基接头核心序列；在碱基接头核心序列左右两端加入二类限制性内切酶接头，获得通用信息DNA序列模块序列；S01): designing a universal information DNA sequence module, which comprises two sets of base linkers and a linker arm sequence located between the two sets of base linkers to obtain a base linker core sequence; adding a second type of restriction endonuclease linker to both ends of the base linker core sequence to obtain a universal information DNA sequence module sequence;

在上述制备方法S01)中，所述碱基接头为由任意碱基组合形成的集合，所述碱基接头用于记录信息。In the above preparation method S01), the base linker is a set formed by any combination of bases, and the base linker is used to record information.

同时，所述的碱基接头还用于与其他碱基接头核心序列中的碱基接头互补连接。At the same time, the base linker is also used for complementary connection with base linkers in other base linker core sequences.

所述连接间臂序列用于连接两组碱基接头。The linker arm sequence is used to connect two sets of base linkers.

上述已知确定的碱基序列是指当该合成该重组质粒时，该连接间臂序列为任意的确定了的碱基序列，可以根据实际需要，例如合成效率等确定该序列的碱基序列，理论上，其可以为任意序列，但由于该序列在后期解码过程中去除，所以在合成过程中，该序列的碱基序列是已知确定的碱基序列。The above-mentioned known and determined base sequence means that when the recombinant plasmid is synthesized, the linker arm sequence is an arbitrary determined base sequence, and the base sequence of the sequence can be determined according to actual needs, such as synthesis efficiency, etc. Theoretically, it can be any sequence, but since the sequence is removed in the later decoding process, the base sequence of the sequence is a known and determined base sequence during the synthesis process.

上述碱基接头为由任意碱基组合形成的集合，是指在合成过程中，碱基接头序列相对变化的，在制备一个质粒库全集时，需要合成各种不同排列组合的碱基接头。The above-mentioned base linkers are a set formed by any combination of bases, which means that during the synthesis process, the base linker sequence changes relatively. When preparing a complete set of plasmid libraries, base linkers with various arrangements and combinations need to be synthesized.

在上述制备方法S01)中，在碱基接头核心序列左右两端加入的二类限制性内切酶接头，能够使得该序列在对应的二类限制性内切酶内切后，在两端分别获得具有上述碱基接头的粘性末端。In the above preparation method S01), the type II restriction endonuclease linkers added to the left and right ends of the base linker core sequence can enable the sequence to obtain sticky ends having the above base linkers at both ends after being cut by the corresponding type II restriction endonuclease.

其中在具体实施方式中，步骤S01)中所述两组碱基接头为第一碱基接头和第二碱基接头，第一碱基接头和第二碱基接头的长度独立地选自3-5个。例如，第一碱基接头为4个碱基，第二碱基接头为4个碱基，从而产生丰度为4⁸＝65536种组合。或者第一碱基接头为3个碱基，第二碱基接头为3个碱基，从而产生丰度为4⁶＝4096种组合。第一碱基接头为3个碱基，第二碱基接头为4个碱基，或第二碱基接头为3个碱基，第一碱基接头为4个碱基，从而产生丰度为4⁷＝16384种组合，依次类推。In a specific embodiment, the two groups of base adapters in step S01) are the first base adapter and the second base adapter, and the lengths of the first base adapter and the second base adapter are independently selected from 3-5. For example, the first base adapter is 4 bases and the second base adapter is 4 bases, thereby generating an abundance of 4 ⁸ = 65536 combinations. Or the first base adapter is 3 bases and the second base adapter is 3 bases, thereby generating an abundance of 4 ⁶ = 4096 combinations. The first base adapter is 3 bases and the second base adapter is 4 bases, or the second base adapter is 3 bases and the first base adapter is 4 bases, thereby generating an abundance of 4 ⁷ = 16384 combinations, and so on.

其中在具体实施方式中，步骤S01)中的所述连接间臂序列为固定长度的碱基序列。例如该第一连接间臂序列长度为3个碱基以上，例如3-100个碱基长度、10个、20个、30个、40个、50个、60个、70个、80个、90个。In a specific embodiment, the linker arm sequence in step S01) is a base sequence of fixed length. For example, the length of the first linker arm sequence is more than 3 bases, such as 3-100 bases, 10, 20, 30, 40, 50, 60, 70, 80, 90 bases.

其中在具体实施方式中，步骤S01)中在碱基接头核心序列左右两端加入的限制性内切酶为相同种类的限制性内切酶，或者是不同种类的限制性内切酶，所述限制性内切酶能够酶切出包含第一碱基接头和第二碱基接头的粘性末端。In a specific embodiment, the restriction endonucleases added to the left and right ends of the base linker core sequence in step S01) are restriction endonucleases of the same type, or restriction endonucleases of different types, and the restriction endonucleases can enzymatically cut out sticky ends containing the first base linker and the second base linker.

其中在具体实施方式中，限制性内切酶可以选自Alw I、AflII、AflIII、AgeI、ApaI、ApaLI、ApeKI、ApoI、AscI、AvaI、AvaII、BamHI、BanI、BanII、Bbs I、Bbv I、Bbv CI、BciV I、Bpm I、Bsal I、BsmA I、BsmF I、Eci I、Fok I、Hga I、Hph I、Mbo II、或Mme I中的任意一种或多种的组合。或者也可以采用现有技术中已知能够产生能够酶切出包含第一碱基接头和第二碱基接头的粘性末端的限制性内切酶。In a specific embodiment, the restriction endonuclease can be selected from any one or more combinations of Alw I, AflII, AflIII, AgeI, ApaI, ApaLI, ApeKI, ApoI, AscI, AvaI, AvaII, BamHI, BanI, BanII, Bbs I, Bbv I, Bbv CI, BciV I, Bpm I, Bsal I, BsmA I, BsmF I, Eci I, Fok I, Hga I, Hph I, Mbo II, or Mme I. Alternatively, a restriction endonuclease known in the prior art that can produce a sticky end that can enzymatically cut out a first base linker and a second base linker can also be used.

其中在具体实施方式中，步骤S02)中，通用信息DNA序列模块的合成方法是化学合成法或者生物合成法。In a specific implementation, in step S02), the synthesis method of the universal information DNA sequence module is chemical synthesis or biological synthesis.

在上述技术方案的技术基础上，本发明还提供了由上述制备方法制备获得的含有通用信息存储DNA的重组质粒。On the basis of the above technical solution, the present invention also provides a recombinant plasmid containing universal information storage DNA prepared by the above preparation method.

在上述技术方案的技术基础上，本发明还提供了由含有通用信息存储DNA的重组质粒库，所述的重组质粒库为包含两种以上具有不同碱基接头的含有通用信息存储DNA的重组质粒。On the basis of the above technical solution, the present invention also provides a recombinant plasmid library containing universal information storage DNA, wherein the recombinant plasmid library comprises two or more recombinant plasmids containing universal information storage DNA with different base linkers.

在上述技术方案的技术基础上，所述的重组质粒库中包含4^(n1+n2)种具有不同碱基接头的含有通用信息存储DNA的重组质粒，n1和n2分别为第一碱基接头和第二碱基接头的碱基长度。On the basis of the above technical solution, the recombinant plasmid library comprises 4 ^(n1+n2) types of recombinant plasmids containing universal information storage DNA with different base linkers, and n1 and n2 are the base lengths of the first base linker and the second base linker, respectively.

在一些技术的应用方案中，将本发明上述制备通用信息存储DNA的重组质粒或重组质粒库进行合理保存，通过大规模制备或预制能够降低以DNA储存信息的成本。In some technical application scenarios, the recombinant plasmid or recombinant plasmid library for preparing universal information storage DNA of the present invention is reasonably preserved, and the cost of storing information with DNA can be reduced through large-scale preparation or prefabrication.

其中在具体实施方式中，通过步骤S02)中通用信息DNA序列模块的合成方法是化学合成法或者生物合成法。In a specific implementation, the synthesis method of the universal information DNA sequence module in step S02) is chemical synthesis or biological synthesis.

本发明一些实施例中还提供了一种数据的DNA存储方法，所述DNA存储方法包括以下步骤：Some embodiments of the present invention also provide a DNA storage method for data, the DNA storage method comprising the following steps:

S13)将信息记录用碱基序列拆分成短碱基片段，相邻的短碱基片段之间具有重叠碱基序列；S13) splitting the information recording base sequence into short base fragments, wherein adjacent short base fragments have overlapping base sequences;

S15)采用本发明的制备方法制备或者从本发明所述的含有通用信息存储DNA的重组质粒库中挑选，获得所需的信息记录用具有通用信息DNA序列的重组质粒；所述信息记录用具有通用信息DNA序列的重组质粒上的第一碱基接头与步骤S14)设计的第一短碱基片段相同，第二碱基接头与步骤S14)获得的第二短碱基片段相同；获得与步骤S13)拆分后的所有短碱基片段对应的通用信息DNA序列的重组质粒；S15) preparing by the preparation method of the present invention or selecting from the recombinant plasmid library containing universal information storage DNA of the present invention, obtaining the required recombinant plasmid with universal information DNA sequence for information recording; the first base linker on the recombinant plasmid with universal information DNA sequence for information recording is the same as the first short base fragment designed in step S14), and the second base linker is the same as the second short base fragment obtained in step S14); obtaining the recombinant plasmid with universal information DNA sequence corresponding to all the short base fragments split in step S13);

S17)采用本发明的制备方法制备或者从本发明所述的含有通用信息存储DNA的重组质粒库中挑选，获得载体用具有通用信息DNA序列的重组质粒；所述载体用具有通用信息DNA序列的重组质粒上的第一碱基接头与步骤S13)设计的信息记录用碱基序列的5’端短碱基片段的第一短碱基片段相同，第二碱基接头与步骤S13)设计的信息记录用碱基序列的3’端短碱基片段的第二短碱基片段相同；S17) preparing a recombinant plasmid with a universal information DNA sequence for a vector by the preparation method of the present invention or selecting from the recombinant plasmid library containing universal information storage DNA of the present invention; the first base linker on the recombinant plasmid with a universal information DNA sequence for a vector is the same as the first short base fragment of the 5' end short base fragment of the information recording base sequence designed in step S13), and the second base linker is the same as the second short base fragment of the 3' end short base fragment of the information recording base sequence designed in step S13);

进一步地，步骤S18)获得的记录信息的重组质粒转入宿主菌中，复制然后提取以冻干粉的形式保存。Furthermore, the recombinant plasmid recording the information obtained in step S18) is transferred into the host bacteria, replicated, extracted and stored in the form of freeze-dried powder.

其中在具体实施方式中，步骤S11)中，计算机数据信息选自图片、文本、程序、音频、视频等任意能够在计算机上存在的数据信息。In a specific implementation, in step S11), the computer data information is selected from any data information that can exist on a computer, such as pictures, texts, programs, audio, and video.

其中在具体实施方式中，步骤S12)中，二进制信息为由0和1构成的信息，碱基序列为由碱基A、T、C、G四种类型的碱基构成的碱基序列。所述的映射关系包括任意可实现0/1到A/T/C/G信息的转换方式，比如Goldman等人发表的Huffman编码(Goldman et al.2013,Nature)，Erlich等人发表的Fountain编码(Erlich et al.2017,Science)等。In a specific implementation, in step S12), the binary information is information composed of 0 and 1, and the base sequence is a base sequence composed of four types of bases: A, T, C, and G. The mapping relationship includes any conversion method that can realize 0/1 to A/T/C/G information, such as Huffman coding published by Goldman et al. (Goldman et al. 2013, Nature), Fountain coding published by Erlich et al. (Erlich et al. 2017, Science), etc.

在一项具体的实施方式中，所述映射关系为00映射为A，01映射为T，11映射为C，10映射为G。该项映射关系也可以是不同的二进制代码和碱基代码之间的组合，例如11映射为A，10映射为T，01映射为C，00映射为G。In a specific implementation, the mapping relationship is 00 is mapped to A, 01 is mapped to T, 11 is mapped to C, and 10 is mapped to G. The mapping relationship may also be a combination of different binary codes and base codes, for example, 11 is mapped to A, 10 is mapped to T, 01 is mapped to C, and 00 is mapped to G.

其中在具体实施方式中，信息记录用碱基序列文件较大时，可以把信息记录用碱基序列拆成多个子文件，并用索引序列来记录；并在步骤S13)中选择带有对应的索引序列进行拆分。索引序列比如AAAA＝1，AAAG＝2，AAAT＝3，AAAC＝4，AATA＝5，AACA＝6等。In a specific implementation, when the base sequence file for information recording is large, the base sequence for information recording can be split into multiple sub-files and recorded with index sequences; and in step S13), the corresponding index sequences are selected for splitting. Index sequences are, for example, AAAA=1, AAAG=2, AAAT=3, AAAC=4, AATA=5, AACA=6, etc.

其中在具体实施方式中，步骤S13)中，相邻的短碱基片段之间的重叠碱基序列长度相等或不相等，重叠碱基序列长度为1-10个碱基长度，比如1个碱基的重叠，2个碱基的重叠，3个碱基的重叠，4个碱基的重叠，5个碱基的重叠，6个碱基的重叠，7个碱基的重叠等，如图2所示，显示的是3个碱基的重叠和4个碱基的重叠。In a specific embodiment, in step S13), the lengths of overlapping base sequences between adjacent short base fragments are equal or unequal, and the length of the overlapping base sequences is 1-10 bases in length, such as an overlap of 1 base, an overlap of 2 bases, an overlap of 3 bases, an overlap of 4 bases, an overlap of 5 bases, an overlap of 6 bases, an overlap of 7 bases, etc. As shown in FIG. 2 , an overlap of 3 bases and an overlap of 4 bases are shown.

其中在具体实施方式中，步骤S13)中，相隔一条的短碱基片段的两条短碱基片段之间没有重叠碱基序列。In a specific embodiment, in step S13), there is no overlapping base sequence between two short base fragments separated by one short base fragment.

其中在具体实施方式中，步骤S13)所述的短碱基片段之间的交错重叠方式，可以是相邻序列交错重叠，也可以是以其它规则重叠的方式，比如奇数加偶数重叠的方式。In a specific embodiment, the staggered overlapping manner between the short base fragments in step S13) can be a staggered overlapping of adjacent sequences, or can be another regular overlapping manner, such as an odd-number plus an even-number overlapping manner.

其中，图2展示的是相邻序列交错重叠，图4展示的是奇数位序列重叠的方式。FIG. 2 shows the overlapping of adjacent sequences, and FIG. 4 shows the overlapping of odd-numbered bit sequences.

其中在具体实施方式中，步骤S13)中，当最后一个短碱基片段的拆分块长度无法达到与在先的短碱基片段长度一致时，用碱基补齐，并记录下来。(图5)In a specific implementation, in step S13), when the length of the last short base segment cannot be consistent with the length of the previous short base segment, the segment is padded with bases and recorded. (Figure 5)

其中在具体实施方式中，步骤S18)中质粒DNA可以存放在宿主菌菌液中，以菌液形式保存或者提取冻干成干粉的形式保存。In a specific embodiment, the plasmid DNA in step S18) can be stored in a host bacterial solution in the form of a bacterial solution or extracted and freeze-dried into a dry powder.

其中在具体实施方式中，步骤S18)中提取的质粒DNA可以存放在不同的温度条件下，包括-20℃，-80℃等。In a specific embodiment, the plasmid DNA extracted in step S18) can be stored at different temperature conditions, including -20°C, -80°C, etc.

本发明一些实施例中，还提供了数据信息的解码的方法In some embodiments of the present invention, a method for decoding data information is also provided.

步骤S23)：将步骤S22)获得的碱基序列信息转换成0/1二进制信息；Step S23): converting the base sequence information obtained in step S22) into 0/1 binary information;

步骤S24)：从0/1二进制信息中提取对应的图片/文本/视频信息。Step S24): extract corresponding picture/text/video information from the 0/1 binary information.

其中在具体实施方式中，步骤S21)中，合成DNA测序读取的方式包括任意可以读取DNA序列库的方式，比如二代测序，三代测序等等。In a specific implementation, in step S21), the method of synthetic DNA sequencing reading includes any method that can read a DNA sequence library, such as second-generation sequencing, third-generation sequencing, etc.

其中在具体实施方式中，在步骤S22)中测序数据具有索引序列时，读取索引序列，并将不同编码顺序的索引序列进行排列，去除其中的索引序列后拼合为完整的碱基序列。In a specific implementation, when the sequencing data has an index sequence in step S22), the index sequence is read, and index sequences of different encoding orders are arranged, and the index sequences are removed and then assembled into a complete base sequence.

其中在具体实施方式中，步骤S22)中，索引序列可以将属于同一个文件的序列归属到一起。In a specific implementation, in step S22), the index sequence may group sequences belonging to the same file together.

其中在具体实施方式中，步骤S23)中，碱基序列信息到0/1二进制信息的转换过程，碱基序列中A/T/C/G与二进制信息的0/1映射关系为上述数据的DNA存储方法的步骤S12)中预设的映射关系。In a specific implementation, in step S23), the conversion process of base sequence information to 0/1 binary information, the mapping relationship between A/T/C/G in the base sequence and the 0/1 of the binary information is the mapping relationship preset in step S12) of the DNA storage method of the above data.

其中在具体实施方式中，步骤S24)中，从二进制0/1信息提取对应的图片/文本/视频信息的方法为任意方法。In a specific implementation, in step S24), the method for extracting corresponding picture/text/video information from binary 0/1 information is any method.

其中在具体实施方式中，上述含有通用信息存储DNA的重组质粒的制备方法的步骤S01)，数据的DNA存储方法步骤S11-S14)，数据信息的解码的方法的步骤S22-S24)可以利用人为计算实现，也可以利用计算机程序实现。In a specific embodiment, step S01) of the method for preparing a recombinant plasmid containing universal information storage DNA, steps S11-S14) of the method for storing data in DNA, and steps S22-S24) of the method for decoding data information can be implemented by manual calculation or by computer program.

本发明一些实施例中还提供了一种利用含有通用信息存储DNA重组质粒库进行DNA数据存储的装置，所述装置包括：Some embodiments of the present invention also provide a device for storing DNA data using a recombinant plasmid library containing universal information storage DNA, the device comprising:

在一些具体的实施例中，二进制序列-碱基序列转换单元用于：对二进制数据中的每一个二进制单元，采用不同的映射编码规则进行编码，或者对二进制数据中的中相邻的两个二进制单元，采用不同的映射编码规则进行编码。In some specific embodiments, the binary sequence-base sequence conversion unit is used to: encode each binary unit in the binary data using a different mapping coding rule, or encode two adjacent binary units in the binary data using different mapping coding rules.

在一些具体的实施例中，所述装置还包括解码装置。In some specific embodiments, the device further includes a decoding device.

在一些具体的实施例中，所述解码装置包括：In some specific embodiments, the decoding device includes:

本发明一些实施例还提供了一种计算机可读存储介质，其上存储有计算机程序，其中，该计算机程序被处理器执行时实现上述DNA存储方法的步骤或上述数据信息的DNA存取的方法的步骤。Some embodiments of the present invention further provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the above-mentioned DNA storage method or the steps of the above-mentioned DNA access method for data information.

本发明一些实施例还提供了一种计算机设备，包括存储器和处理器，在所述存储器上存储有能够在处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现上述DNA存储方法的步骤或上述数据信息的DNA存取的方法的步骤。Some embodiments of the present invention also provide a computer device, including a memory and a processor, wherein the memory stores a computer program that can be run on the processor, and when the processor executes the computer program, the steps of the above-mentioned DNA storage method or the steps of the above-mentioned DNA access method for data information are implemented.

其中在具体实施方式中，上述计算机程序实现的过程，可以是用任意语言编写的计算机程序，比如Python，R，Perl,C++等。In a specific implementation manner, the process implemented by the above-mentioned computer program can be a computer program written in any language, such as Python, R, Perl, C++, etc.

本发明可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质，其上载有用于使处理器实现本发明的各个方面的计算机可读程序指令。The present invention may be a system, a method and/or a computer program product. The computer program product may include a computer-readable storage medium carrying computer-readable program instructions for causing a processor to implement various aspects of the present invention.

计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是但不限于电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括：便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身，诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如，通过光纤电缆的光脉冲)、或者通过电线传输的电信号。Computer readable storage medium can be a tangible device that can hold and store instructions used by an instruction execution device. Computer readable storage medium can be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination thereof. More specific examples (non-exhaustive list) of computer readable storage medium include: a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a static random access memory (SRAM), a portable compact disk read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanical encoding device, for example, a punch card or a convex structure in a groove on which instructions are stored, and any suitable combination thereof. The computer readable storage medium used here is not interpreted as a transient signal itself, such as a radio wave or other freely propagating electromagnetic wave, an electromagnetic wave propagated by a waveguide or other transmission medium (for example, a light pulse by an optical fiber cable), or an electrical signal transmitted by a wire.

这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备，或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令，并转发该计算机可读程序指令，以供存储在各个计算/处理设备中的计算机可读存储介质中。The computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to each computing/processing device, or downloaded to an external computer or external storage device via a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network can include copper transmission cables, optical fiber transmissions, wireless transmissions, routers, firewalls, switches, gateway computers, and/or edge servers. The network adapter card or network interface in each computing/processing device receives the computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in the computer-readable storage medium in each computing/processing device.

用于执行本发明操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码，所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++、Python等，以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中，远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机，或者，可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中，通过利用计算机可读程序指令的状态信息来个性化定制电子电路，例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA)，该电子电路可以执行计算机可读程序指令，从而实现本发明的各个方面。The computer program instructions for performing the operation of the present invention may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages, such as Smalltalk, C++, Python, etc., and conventional procedural programming languages, such as "C" language or similar programming languages. Computer-readable program instructions may be executed entirely on a user's computer, partially on a user's computer, as an independent software package, partially on a user's computer, partially on a remote computer, or entirely on a remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer via any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., using an Internet service provider to connect via the Internet). In some embodiments, an electronic circuit, such as a programmable logic circuit, a field programmable gate array (FPGA), or a programmable logic array (PLA), may be personalized by utilizing the state information of the computer-readable program instructions, and the electronic circuit may execute the computer-readable program instructions, thereby realizing various aspects of the present invention.

这里参照根据本发明实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本发明的各个方面。应当理解，流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合，都可以由计算机可读程序指令实现。Various aspects of the present invention are described herein with reference to the flow charts and/or block diagrams of the methods, devices (systems) and computer program products according to embodiments of the present invention. It should be understood that each box of the flow chart and/or block diagram and the combination of each box in the flow chart and/or block diagram can be implemented by computer-readable program instructions.

这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器，从而生产出一种机器，使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时，产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中，这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作，从而，存储有指令的计算机可读介质则包括一个制造品，其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。These computer-readable program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, thereby producing a machine, so that when these instructions are executed by the processor of the computer or other programmable data processing device, a device that implements the functions/actions specified in one or more boxes in the flowchart and/or block diagram is generated. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause the computer, programmable data processing device, and/or other equipment to work in a specific manner, so that the computer-readable medium storing the instructions includes a manufactured product, which includes instructions for implementing various aspects of the functions/actions specified in one or more boxes in the flowchart and/or block diagram.

也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上，使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤，以产生计算机实现的过程，从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device so that a series of operating steps are performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to implement the functions/actions specified in one or more boxes in the flowchart and/or block diagram.

实施例1构建含有通用信息存储DNA的重组质粒库Example 1 Construction of a recombinant plasmid library containing universal information storage DNA

本实施例以构建第一碱基序列和第二碱基序列长度均为4个碱基长度的重组质粒库。具有其他碱基长度的重组质粒库也可以通过类似方法获得。In this embodiment, a recombinant plasmid library with a first base sequence and a second base sequence of 4 bases in length is constructed. Recombinant plasmid libraries with other base lengths can also be obtained by a similar method.

01、设计通用信息DNA序列模块，其包含两组碱基接头，以及位于两组碱基接头之间的有连接间臂序列GGATCTCA获得碱基接头核心序列；在碱基接头核心序列左右两端加入二类限制性内切酶接头，获得通用信息DNA序列模块序列；本实施例选择的二类限制性内切酶接头为BsalI酶酶切序列：01. Design a universal information DNA sequence module, which includes two groups of base linkers and a linker arm sequence GGATCTCA located between the two groups of base linkers to obtain a base linker core sequence; add a second type of restriction endonuclease linker to the left and right ends of the base linker core sequence to obtain a universal information DNA sequence module sequence; the second type of restriction endonuclease linker selected in this embodiment is the BsalI enzyme digestion sequence:

5’-GGTCTC(N)₁3’SEQ ID NO.15'-GGTCTC(N) ₁ 3' SEQ ID NO.1

3’-CCAGAG(N)₅5’。SEQ ID NO.23'-CCAGAG(N) ₅ 5'. SEQ ID NO.2

两组碱基接头分别称为第一碱基接头和第二碱基接头，并定义第一碱基接头在第二碱基接头的5’上游。两组碱基接头的长度均为4个碱基，并进行不同的排列组合，设计获得4⁸种的组合。The two groups of base linkers are respectively called the first base linker and the second base linker, and the first base linker is defined as being 5' upstream of the second base linker. The length of the two groups of base linkers is 4 bases, and different permutations and combinations are performed to obtain 4 ⁸ combinations.

02、根据步骤01设计获得的通用信息DNA序列模块，分别进行合成并将其插入质粒载体中，获得重组质粒；02. Synthesize the universal information DNA sequence modules designed and obtained in step 01 respectively and insert them into plasmid vectors to obtain recombinant plasmids;

03、将重组质粒导入大肠杆菌中培养，并制备提取包含通用信息DNA序列的重组质粒。03. Introduce the recombinant plasmid into Escherichia coli for cultivation, and prepare and extract the recombinant plasmid containing the universal information DNA sequence.

04、将包含不同第一碱基接头和第二碱基接头的重组质粒进行收集，形成含有通用信息存储DNA的重组质粒库。04. Collect the recombinant plasmids containing different first base linkers and second base linkers to form a recombinant plasmid library containing universal information storage DNA.

实施例2对数据信息进行存储和解码Example 2: Storing and decoding data information

1、利用计算机程序提取“中国”GB18030的二进制代码“11010110110100001011100111111010”.1. Use a computer program to extract the binary code of "China" GB18030: "11010110110100001011100111111010".

2、利用编码规则“A＝00，T＝01，C＝11，G＝10”，将上述二进制信息转换为DNA序列信息”GTTCGTAACGCTGGCC”SEQ ID NO.3。2. Using the coding rule "A=00, T=01, C=11, G=10", the above binary information is converted into DNA sequence information "GTTCGTAACGCTGGCC" SEQ ID NO.3.

3、将DNA序列信息切分成3个短碱基片段，1)“GTTCGTAA”SEQ ID NO.4、2)“GTAACGCT”SEQ ID NO.5、3)“CGCTGGCC”SEQ ID NO.6。相临的片段之间具有重叠碱基序列GTAA和CGCT。3. Divide the DNA sequence information into three short base fragments: 1) "GTTCGTAA" SEQ ID NO. 4, 2) "GTAACGCT" SEQ ID NO. 5, and 3) "CGCTGGCC" SEQ ID NO. 6. Adjacent fragments have overlapping base sequences GTAA and CGCT.

4、将上述短碱基片段分割为两段，如下所示“GTTC/GTAA”、“GTAA/CGCT”、“CGCT/GGCC”的序列。4. Split the above short base fragment into two segments, as shown below: "GTTC/GTAA", "GTAA/CGCT", "CGCT/GGCC" sequences.

5、寻找具有与步骤4中短碱基片段对应片段的第一碱基序列和第二碱基序列的质粒，所述质粒具有，如下双链序列，在5. Find a plasmid having a first base sequence and a second base sequence corresponding to the short base fragment in step 4, wherein the plasmid has the following double-stranded sequence:

1)“GGTCTCAGTTCGGATCTCAGTAAAGAGACC SEQ ID NO.71) "GGTCTCA GTTC GGATCTCA GTAA AGAGACC SEQ ID NO.7

CCAGAGTCAAGCCTAGAGTCATTTCTCTGG”SEQ ID NO.8；CCAGAGTCAAGCCTAGAGTCATTTCTCTGG" SEQ ID NO.8;

2)“GGTCTCAGTAAGGATCTCACGCTAGAGACC SEQ ID NO.92) "GGTCTCA GTAA GGATCTCA CGCT AGAGACC SEQ ID NO.9

CCAGAGTCATTCCTAGAGTGCGATCTCTGG”SEQ ID NO.10；CCAGAGTCATTCCTAGAGTGCGATCTCTGG" SEQ ID NO.10;

3)“GGTCTCACGCTGGATCTCAGGCCAGAGACC SEQ ID NO.113) "GGTCTCA CGCT GGATCTCA GGCC AGAGACC SEQ ID NO.11

CCAGAGTGCGACCTAGAGTCCGGTCTCTGG”SEQ ID NO.12。CCAGAGTGCGACCTAGAGTCCGGTCTCTGG" SEQ ID NO.12.

在本实施例中添加的限制性内切酶酶切位点为BsalI酶切位点：The restriction endonuclease cleavage site added in this example is the BsalI cleavage site:

5’-GGTCTC(N)₁-3’5'-GGTCTC(N) ₁ -3'

3’-CCAGAG(N)₅-5’3'-CCAGAG(N) ₅ -5'

连接间隔臂序列为GGATCTCA。The linker spacer sequence was GGATCTCA.

从实施例1质粒库中分别取100ng包含以上序列的质粒，利用BsalI酶切，获得片段，100 ng of plasmids containing the above sequences were taken from the plasmid library of Example 1, and digested with BsalI to obtain fragments.

1)“GTTCGGATCTCASEQ ID NO.131)“ GTTC GGATCTCASEQ ID NO.13

CCTAGAGTCATT”SEQ ID NO.14；CCTAGAGTCATT" SEQ ID NO.14;

2)“GTAAGGATCTCA SEQ ID NO.152) " GTAA GGATCTCA SEQ ID NO.15

CCTAGAGTGCGA”SEQ ID NO.16；CCTAGAGTGCGA" SEQ ID NO.16;

3)“CGCTGGATCTCA SEQ ID NO.173) " CGCT GGATCTCA SEQ ID NO.17

CCTAGAGTCCGG”。SEQ ID NO.18CCTAGAGTCCGG". SEQ ID NO.18

并以连接酶将三片段进行连接一条完整的信息记录用片段The three fragments are connected by ligase to form a complete information recording fragment

GTTCGGATCTCAGTAAGGATCTCACGCTGGATCTCA SEQ ID NO.19 GTTC GGATCTCA GTAA GGATCTCA CGCT GGATCTCA SEQ ID NO.19

CCTAGAGTCATT CCTAGAGTGCGACCTAGAGTCCGG。SEQ ID NO.20CCTAGAGTCATT CCTAGAGTGCGACCTAGAGTCCGG. SEQ ID NO.20

5、从实施例1质粒库中取出带有第一个片段5’端相同的第一碱基序列和最后一个片段3’段相同的第二碱基序列的质粒，即具有第一碱基序列为GTTC，第二碱基序列为GGCC的质粒，酶切获得空载体。5. Take out a plasmid with the same first base sequence at the 5' end of the first fragment and the same second base sequence at the 3' end of the last fragment from the plasmid library of Example 1, that is, a plasmid with a first base sequence of GTTC and a second base sequence of GGCC, and digest it with enzymes to obtain an empty vector.

6、将步骤4获得的完整的信息记录用片段插入步骤5的空载体中，并转化入大肠杆菌。6. Insert the complete information recording fragment obtained in step 4 into the empty vector in step 5 and transform it into Escherichia coli.

7、将大肠杆菌以甘油管的形式保存在相关质粒载体上。7. Store E. coli on the relevant plasmid vector in the form of glycerol tubes.

数据解读Data interpretation

8、提取上述步骤7中的质粒，交付二代测序服务公司进行测序，获得在酶切位点中间的如下通用序列信息：8. Extract the plasmid in step 7 above and send it to a next-generation sequencing service company for sequencing to obtain the following general sequence information in the middle of the restriction site:

“GGTCTCAGTTCGGATCTCAGTAAAGAGACC SEQ ID NO.7"GGTCTCA GTTC GGATCTCAGTAAAGAGACC SEQ ID NO.7

“GGTCTCAGTAAGGATCTCACGCTAGAGACC SEQ ID NO.9"GGTCTCAGTAAGGATCTCACGCTAGAGACC SEQ ID NO.9

CCAGAGTCATTCCTAGAGTGCGATCTCTGG”SEQ ID NO.10CCAGAGTCATTCCTAGAGTGCGATCTCTGG" SEQ ID NO.10

“GGTCTCACGCTGGATCTCAGGCCAGAGACC SEQ ID NO.11"GGTCTCACGCTGGATCTCAGGCCAGAGACC SEQ ID NO.11

CCAGAGTGCGACCTAGAGTCCGGTCTCTGG”SEQ ID NO.12CCAGAGTGCGACCTAGAGTCCGGTCTCTGG" SEQ ID NO.12

9、去掉序列中的连接间臂序列，获得左右接头“GTTC GTAA”“GTAA CGCT”“CGCTGGCC”。9. Remove the linker arm sequence in the sequence to obtain the left and right linkers "GTTC GTAA", "GTAA CGCT" and "CGCTGGCC".

10、从左右接头序列组装成完整的DNA序列GTTCGTAACGCTGGCC。SEQ ID NO.310. Assemble the complete DNA sequence GTTCGTAACGCTGGCC from the left and right linker sequences. SEQ ID NO.3

11、利用编码规则“A＝00，T＝01，C＝11，G＝10＂将DNA序列转换成为“11010110110100001011100111111010”。11. Use the coding rule “A=00, T=01, C=11, G=10” to convert the DNA sequence into “11010110110100001011100111111010”.

12、从二进制序列中提取“中国”。12. Extract “China” from the binary sequence.

该模拟实施例的实施效果，上述实施例，从已制备的通用DNA库中，分别取出了少量DNA，用于“中国”信息的DNA存储。相比较传统的DNA存储技术，有效的实现了已合成DNA的重复利用。具体在该实施例中，利用少量已经制备好的DNA，实现了该段信息的存储成本比传统技术存储降低。在具体的实验中，单次取出质粒DNA量甚至能够到单分子水平，这使得该技术方法实现的成本能够进一步大幅降低。同时，由于大规模制备质粒DNA，成本会进一步下降，还能进一步降低利用该流程实现的DNA存储价格。The implementation effect of this simulation embodiment is that in the above embodiment, a small amount of DNA was taken out from the prepared universal DNA library for DNA storage of "China" information. Compared with the traditional DNA storage technology, the reuse of synthesized DNA is effectively realized. Specifically in this embodiment, a small amount of prepared DNA is used to reduce the storage cost of this section of information compared to traditional technology storage. In the specific experiment, the amount of plasmid DNA taken out at a time can even reach the single-molecule level, which enables the cost of implementing the technical method to be further significantly reduced. At the same time, due to the large-scale preparation of plasmid DNA, the cost will be further reduced, and the price of DNA storage achieved using this process can be further reduced.

SEQUENCE LISTINGSEQUENCE LISTING

<110> 深圳先进技术研究院<110> Shenzhen Institute of Advanced Technology

<120> 基于DNA分子介质的数据信息存储方法<120> Data information storage method based on DNA molecule medium

<130> CP122010180E<130> CP122010180E

<160> 20<160> 20

<170> PatentIn version 3.3<170> PatentIn version 3.3

<210> 1<210> 1

<211> 6<211> 6

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<400> 1<400> 1

ggtctc 6ggtctc 6

<210> 2<210> 2

<211> 6<211> 6

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<400> 2<400> 2

ccagag 6ccagag 6

<210> 3<210> 3

<211> 16<211> 16

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<400> 3<400> 3

gttcgtaacg ctggcc 16gttcgtaacg ctggcc 16

<210> 4<210> 4

<211> 8<211> 8

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<400> 4<400> 4

gttcgtaa 8gttcgtaa 8

<210> 5<210> 5

<211> 8<211> 8

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<400> 5<400> 5

gtaacgct 8gtaacgct 8

<210> 6<210> 6

<211> 8<211> 8

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<400> 6<400> 6

cgctggcc 8cgctggcc 8

<210> 7<210> 7

<211> 30<211> 30

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<400> 7<400> 7

ggtctcagtt cggatctcag taaagagacc 30ggtctcagtt cggatctcag taaagagacc 30

<210> 8<210> 8

<211> 30<211> 30

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<400> 8<400> 8

ccagagtcaa gcctagagtc atttctctgg 30ccagagtcaa gcctagagtc atttctctgg 30

<210> 9<210> 9

<211> 30<211> 30

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<400> 9<400> 9

ggtctcagta aggatctcac gctagagacc 30ggtctcagta aggatctcac gctagacc 30

<210> 10<210> 10

<211> 30<211> 30

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<400> 10<400> 10

ccagagtcat tcctagagtg cgatctctgg 30ccagagtcat tcctagagtg cgatctctgg 30

<210> 11<210> 11

<211> 30<211> 30

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<400> 11<400> 11

ggtctcacgc tggatctcag gccagagacc 30ggtctcacgc tggatctcag gccagagacc 30

<210> 12<210> 12

<211> 30<211> 30

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<400> 12<400> 12

ccagagtgcg acctagagtc cggtctctgg 30ccagagtgcg acctagagtc cggtctctgg 30

<210> 13<210> 13

<211> 12<211> 12

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<400> 13<400> 13

gttcggatct ca 12gttcggatct ca 12

<210> 14<210> 14

<211> 12<211> 12

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<400> 14<400> 14

cctagagtca tt 12cctagagtca tt 12

<210> 15<210> 15

<211> 12<211> 12

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<400> 15<400> 15

gtaaggatct ca 12gtaaggatct ca 12

<210> 16<210> 16

<211> 12<211> 12

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<400> 16<400> 16

cctagagtgc ga 12cctagagtgc ga 12

<210> 17<210> 17

<211> 12<211> 12

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<400> 17<400> 17

cgctggatct ca 12cgctggatct ca 12

<210> 18<210> 18

<211> 12<211> 12

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<400> 18<400> 18

cctagagtcc gg 12cctagagtcc gg 12

<210> 19<210> 19

<211> 36<211> 36

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<400> 19<400> 19

gttcggatct cagtaaggat ctcacgctgg atctca 36gttcggatct cagtaaggat ctcacgctgg atctca 36

<210> 20<210> 20

<211> 36<211> 36

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<400> 20<400> 20

cctagagtca ttcctagagt gcgacctaga gtccgg 36cctagagtca ttcctagagt gcgacctaga gtccgg 36

Claims

1. A method for preparing a recombinant plasmid containing universal information storage DNA, characterized in that the preparation method comprises the following steps:

S01) Designing a universal information DNA sequence module, which comprises two groups of base linkers and a linker arm sequence between the two groups of base linkers, connecting the two groups of base linkers and the linker arm sequence between the two groups of base linkers to obtain a base linker core sequence; adding two types of restriction endonuclease linker sequences to the left and right ends of the base linker core sequence to obtain a universal information DNA sequence module sequence; the base linker is a set formed by any combination of bases, and the base linker is used to record information; the linker arm sequence is a known and determined base sequence; the linker arm sequence of the recombinant plasmid in the same recombinant plasmid library is the same sequence;

S02) synthesizing a universal information DNA sequence module and inserting it into a plasmid vector to obtain a recombinant plasmid;

S03) introducing the recombinant plasmid into a host bacterium for culture and preservation, or extracting the recombinant plasmid containing the universal information DNA sequence from the host bacterium for preservation.

2. A recombinant plasmid containing universal information storage DNA, characterized in that the recombinant plasmid is prepared by the preparation method described in claim 1.

3. A recombinant plasmid library, characterized in that the recombinant plasmid library comprises a plurality of recombinant plasmids containing universal information storage DNA according to claim 2;

The recombinant plasmid library contains 4 ^(n1+n2) types of recombinant plasmids containing universal information storage DNA with different base linkers, n1 and n2 are respectively the base lengths of the first base linker and the second base linker in the two groups of base linkers; n1 and n2 are independently selected from any integers of 3-10.

4. A DNA storage method for data, characterized in that the DNA storage method comprises the following steps:

S11) extracting binary information of computer data information to be stored;

S12) converting the binary information into a base sequence for information recording according to a preset mapping relationship;

S13) splitting the information recording base sequence into short base segments, and providing overlapping base sequences between adjacent short base segments; or,

There are also overlapping base sequences between the short base segments composed of adjacent odd-numbered bit sequences and between the short base segments composed of adjacent even-numbered bit sequences;

S14) splitting each of the short base fragments obtained by splitting in step S13) from the middle, so that each short base fragment is split into a first short base fragment and a second short base fragment;

S15) preparing by the preparation method described in claim 1, obtaining the required recombinant plasmid with universal information DNA sequence for information recording; the first base linker on the recombinant plasmid with universal information DNA sequence for information recording is the same as the first short base fragment designed in step S14), and the second base linker is the same as the second short base fragment obtained in step S14); obtaining the recombinant plasmid with universal information DNA sequence corresponding to all the short base fragments split in step S13);

S16) using restriction endonucleases to digest the recombinant plasmid obtained in step S15), collecting the digested fragments, and connecting them into a complete information recording fragment using a ligase;

S17) preparing by the preparation method of claim 1, obtaining a recombinant plasmid having a universal information DNA sequence for a vector; the first base linker on the recombinant plasmid having a universal information DNA sequence for a vector is the same as the first short base fragment of the 5' end short base fragment of the information recording base sequence designed in step S13), and the second base linker is the same as the second short base fragment of the 3' end short base fragment of the information recording base sequence designed in step S13);

S18) using restriction endonucleases to digest the recombinant plasmid obtained in step S17), collecting the digested plasmid as a vector plasmid, inserting the information recording fragment obtained in step S16) into the vector plasmid, and obtaining a recombinant plasmid for recording information;

In step S11), the computer data information can be data information existing on the computer, selected from pictures, texts, programs, audios, and videos.

5. A DNA storage method for data, characterized in that the DNA storage method comprises the following steps:

S11) extracting binary information of computer data information to be stored;

S15) selecting from the recombinant plasmid library of claim 3 to obtain the required recombinant plasmid with universal information DNA sequence for information recording; the first base linker on the recombinant plasmid with universal information DNA sequence for information recording is the same as the first short base fragment designed in step S14), and the second base linker is the same as the second short base fragment obtained in step S14); obtaining the recombinant plasmid with universal information DNA sequence corresponding to all the short base fragments split in step S13);

S17) selecting from the recombinant plasmid library of claim 3 to obtain a recombinant plasmid with a universal information DNA sequence for a vector; the first base linker on the recombinant plasmid with a universal information DNA sequence for a vector is the same as the first short base fragment of the 5' end short base fragment of the information recording base sequence designed in step S13), and the second base linker is the same as the second short base fragment of the 3' end short base fragment of the information recording base sequence designed in step S13);

6. The DNA storage method according to claim 4 or 5, characterized in that the recombinant plasmid recording information obtained in step S18) is directly freeze-dried and stored in vitro.

7. The DNA storage method according to claim 4 or 5 is characterized in that the recombinant plasmid recording the information obtained in step S18) is transferred into a host bacterium for replication and preservation; the host bacterium is any microorganism that can replicate and preserve the recombinant plasmid.

8. The DNA storage method according to claim 4 or 5 is characterized in that the recombinant plasmid recording information obtained in step S18) is transferred into a host bacterium, replicated, and then extracted and stored in the form of freeze-dried powder; the host bacterium is any microorganism that can replicate and store the recombinant plasmid.

9. The DNA storage method according to claim 4 or 5 is characterized in that, in step S13), when the base sequence is divided into multiple short base fragments, when the sequence length of the terminal short base fragment is smaller than that of other short base fragments, repeated bases are used to pad it to the same length as other short base fragments.

10. The DNA storage method according to claim 4 or 5, characterized in that, in step S12), before splitting the base sequence for information recording into short base fragments, the method further comprises splitting the base sequence for information recording into a plurality of sub-base sequences for information recording, and adding an index fragment before each sub-base sequence for information recording;

The index fragment sequence length is 3-100 bases;

The index fragment added before each fragment is different, and in S13), the sub-information records connected with different index fragments are split into short base fragments respectively using base sequences.

11. The DNA storage method according to claim 10, characterized in that, after the information recording base sequence is split into sub-information recording base sequences, in S16), a complete information recording fragment is prepared corresponding to each sub-information recording base sequence;

After the information recording base sequence is split into sub-information recording base sequences, in S18), the information recording fragments prepared corresponding to each sub-information recording base sequence are respectively inserted into different vector plasmids to obtain a recombinant plasmid set for recording information.

12. A method for DNA access of data information, characterized in that the method comprises the following steps:

Step S1) using the DNA storage method of claim 4 or 5 to store data information in DNA,

Step S2) extracts the DNA and then analyzes and decodes it to obtain the stored data information. Step S2) includes the following steps:

Step S21): extracting the vector plasmid with the information recording fragment and performing sequencing reading;

Step S22): according to the sequencing data, the restriction sites and the junction arm sequences in the sequence are removed to obtain the base sequence that records the digital information;

Step S23): converting the base sequence information obtained in step S22) into binary information;

Step S24): extract corresponding data information from the binary information.

13. Use of the recombinant plasmid according to claim 2 in preparing a recombinant plasmid for use in DNA storage information.

14. Use of the recombinant plasmid library according to claim 3 in preparing recombinant plasmids for use in DNA storage information.

15. A device for storing DNA data using a recombinant plasmid library containing universal information storage DNA, characterized in that the device comprises:

A binary sequence extraction unit, used to extract a binary sequence corresponding to the data to be stored;

A binary sequence-base sequence conversion unit, used to convert the binary sequence into a base sequence, or convert the base sequence into a binary sequence according to a preset mapping relationship;

A short base fragment splitting design unit is used to split the base sequence into short base fragments, and adjacent short base fragments have overlapping base sequences; or

The short base segments composed of adjacent odd-numbered bit sequences and the short base segments composed of adjacent even-numbered bit sequences include overlapping base sequences;

The short base fragment splitting design unit can further split each of the short base fragments obtained by splitting from the middle, so that each short base fragment is split into a first short base fragment and a second short base fragment;

A recombinant plasmid library unit, wherein the recombinant plasmid library comprises the recombinant plasmid containing the universal information storage DNA according to claim 2, and the recombinant plasmid library unit can also select all required recombinant plasmids for information recording and recombinant plasmids for vectors from the recombinant plasmids, wherein the first base linker of the required recombinant plasmid for information recording is identical to the first short base fragment of the short base fragment obtained by the short base fragment splitting design unit, and the second base linker is identical to the second short base fragment of the short base fragment obtained by the short base fragment splitting design unit; the first base linker on the recombinant plasmid for vectors is identical to the first short base fragment of the 5' end short base fragment of the base sequence obtained by the binary sequence-base sequence conversion unit, and the second base linker is identical to the second short base fragment of the 3' end short base fragment of the base fragment of the binary sequence-base sequence conversion unit;

The recombinant plasmid library enzyme digestion and splicing unit is used to digest and connect all the required information recording recombinant plasmids selected by the recombinant plasmid library unit into information recording fragments, and can digest the vector recombinant plasmid and connect the information recording fragments to obtain the recombinant plasmid;

A storage and replication unit, used for storing the recombinant plasmid obtained by the enzyme digestion and splicing unit of the recombinant plasmid library, or for replicating the recombinant plasmid of the recombinant plasmid library unit or the recombinant plasmid of the storage information obtained by the enzyme digestion and splicing unit of the recombinant plasmid library;

The device also includes a decoding device;

The decoding device comprises:

A DNA sequence acquisition unit, used to acquire the DNA sequence in the recombinant plasmid storing information;

The data analysis unit is used to remove the restriction site and the linker arm sequence in the DNA sequence to obtain a base sequence that records digital information; and convert the base sequence into a binary sequence, and further convert it into storage data; wherein the linker arm sequence is a known and determined base sequence; and the recombinant plasmid linker arm sequences in the same recombinant plasmid library are the same sequence.