CN111748613A

CN111748613A - Design method and preparation method of double-label joint

Info

Publication number: CN111748613A
Application number: CN201910237765.XA
Authority: CN
Inventors: 郑建超; 汪宇盈; 羊光辉; 叶明芝
Original assignee: Huada Digital Biotechnology Shenzhen Co ltd
Current assignee: Huada Digital Biotechnology Shenzhen Co ltd
Priority date: 2019-03-27
Filing date: 2019-03-27
Publication date: 2020-10-09

Abstract

The invention discloses a design method and a preparation method of a double-label joint. The invention provides a kit for constructing a DNA molecule sequencing library to be detected, which comprises a double-sample label joint; annealing the double-sample label joint by a joint sequence L and a joint sequence S to form a joint; one end of the double-sample label joint is used for connecting a DNA molecule to be detected; the invention has the following advantages: 1) by introducing new sample labels at two ends of the inserted DNA fragment, the adding times of the sequencing primer are reduced, and the sequencing cost is reduced; 2) the invention can also realize double-sample labeling for the single-ended sequencing project under the condition of not increasing the sequencing cost, thereby avoiding the false positive problem caused by sample label crosstalk. The design scheme of the joint can meet the requirement that double-sample labels can be realized by single-ended sequencing, and can realize filtering of wrong sequencing data generated by sample label crosstalk.

Description

Design method and preparation method of double-label joint

Technical Field

The invention belongs to the technical field of biology, and particularly relates to a design method and a preparation method of a double-label joint.

Background

Currently, a high-throughput sequencing technology has become an important gene detection technology, and is widely applied to the fields of scientific research, medical detection, agricultural breeding, judicial identification and the like. The current mainstream providers of high throughput sequencing technology include Illumina corporation, Thermo fisher corporation, Pacbio corporation, nanopore corporation in the uk, and china megagene (BGI) and the like. In order to reduce the average sequencing cost of a sample, a strategy of performing mixed on-machine sequencing on a plurality of sample libraries is adopted in most cases. In the library construction process, a sample label (index) is added to each sample, sequencing data can be split into each sample according to the sample label, and finally high throughput and low cost of sequencing are achieved. Sample tagging has become an integral part of high throughput sequencing technologies.

In practical application, the problem of sample-label crosstalk (index-cross or index-switching) is often encountered, that is, data pollution of other samples can be found in data of a certain sample label, so that the accuracy of sequencing data is affected, for example, false positive results occur in pathogenic microorganism detection and tumor low-frequency mutation detection, and the result of RNA quantification is inaccurate. The main causes of sample tag crosstalk include library adaptor synthesis contamination, contamination during library construction, contamination in target region capture firing, pre-amplification contamination before sequencing, erroneous reading of sample tags during sequencing, sample residues in intermediate flow pipelines of two sequencing experiments, and the like.

At present, the main scheme for solving the problem of sample label crosstalk is to adopt double-label sequencing, namely, sample labels are introduced into two ends of DNA to be detected simultaneously, and only data with correct two labels can enter the analysis of the next link during sequencing data analysis. Thus, the problem of sample label crosstalk can be greatly reduced or even avoided.

At present, the library structure of the double-sample label is shown in figure 1, sequencing primer binding regions are arranged between the double-sample label and the position of DNA to be detected, and Illumina can filter sample data by adopting the scheme. After the library is loaded on a sequencing chip, one end of read1 and index1 is sequenced, after the sequencing is completed, the copying and synthesis of a second end sequencing template are carried out, then the sequencing primers of read2 and index2 are respectively added by taking the template as the template, and finally the reading of the double-sample tag sequence is realized. And if the two sample labels do not accord with the experimental design, deleting the corresponding sequencing reads data, and finally filtering the sample label crosstalk data.

The existing double-sample label design scheme has the following defects: 1) in order to realize the data acquisition of the double-sample label, 2 times of adding index sequencing primers are needed, so that the sequencing cost is increased; 2) since the sequencing templates for the two sample tags are both strands of the DNA library, template strand synthesis is required before sequencing of the second sample tag, resulting in increased sequencing time; 3) current double-sample tag designs are not compatible with single-ended sequencing.

Disclosure of Invention

In order to overcome the defects of the existing double-sample label, the invention provides the following technical scheme.

The invention provides a kit for constructing a DNA molecule sequencing library to be detected, which comprises a double-sample label joint;

annealing the double-sample label joint by a joint sequence L and a joint sequence S to form a joint; one end of the double-sample label joint is used for connecting a DNA molecule to be detected;

the double-sample label joints connected with the two ends of the DNA molecule to be detected are the same.

The DNA molecule to be detected can be a sticky end DNA molecule to be detected or a flat end DNA molecule to be detected, and if the sticky end DNA molecule to be detected is the flat end DNA molecule to be detected, the sticky end DNA molecule to be detected and the flat end DNA molecule to be detected can be connected with the double-sample label joint after the A is added.

The joint sequence L sequentially comprises a region A which is complementary with the joint sequence S and a region C which is not complementary with the joint sequence S from the end close to the DNA molecule to be detected;

the region A sequentially consists of a second sample label sequence and a fragment B for annealing and complementation from the end close to the DNA molecule to be detected;

a binding region of a primer PF in a bank building primer pair is arranged on the region C;

the joint sequence S sequentially comprises a region D which is complementary with the joint sequence L and a region E which is not complementary with the joint sequence L from the end close to the DNA molecule to be detected;

the region D consists of a complementary sequence of the second sample label sequence and a complementary sequence of the fragment B in sequence from the end near the DNA molecule to be detected;

and the region E comprises a binding region of the primer PR in the library-establishing primer pair from the end near the DNA molecule to be detected.

The kit also comprises the library building primer pair;

the library building primer pair consists of the primer PF and the primer PR;

the primer PR comprises, from the 5' end, a first sample tag sequence and a region which binds to the region E.

In the kit, the region E comprises a first sample tag sequence and a binding region of a primer PR in the library-building primer pair from the end close to a DNA molecule to be detected;

the kit also comprises the library building primer pair;

the library building primer pair consists of the primer PF and the primer PR;

the primer PR contains a region which binds to the region E and does not contain the first sample tag sequence.

In the kit, the length of the second sample tag sequence is greater than 3 nt;

or the length of the second sample label sequence is 3-10 nt. The length of the second sample label may be 3 bases or any combination of bases greater than 3 bases, and 10 bases or more are not recommended because of the large amount of data wasted.

In the above kit, the double sample label linker is in a bubble-like or Y-shaped structure or may be in other structures, and it is within the scope of the present invention to introduce new sample labels at both ends of the DNA adjacent to the insert.

In embodiments of the invention, the structure is a Y-type structure, wherein the complementary region of the 2-linker sequence is the backbone of the Y-type, and the non-complementary region is the bifurcation region of the Y-type.

The other end of the double-sample label joint is in a bubbly shape or a free non-complementary double-stranded structure; or, the last base phosphorylation modification of the joint sequence L from the end near the DNA molecule to be detected.

In the kit, the double-sample label adaptor is formed by annealing an adaptor sequence L shown in a sequence 1 and an adaptor sequence S shown in a sequence 3;

or the double-sample label joint is formed by annealing a joint sequence L shown in a sequence 2 and a joint sequence S shown in a sequence 4;

or the pair of the library-establishing primers consists of a primer shown in a sequence 5 and a primer shown in a sequence 6 or 7.

Another purpose of the invention is to provide a method for constructing a DNA molecule sequencing library to be tested by using the kit.

The method provided by the invention comprises the following steps:

when a double-sample label is introduced into the library, a second sample label sequence in the double-sample label is positioned between a DNA molecule to be detected and a sequencing primer binding region;

or when double-sample labels are introduced in the database building, the second sample label sequence in the double-sample labels is close to the two ends of the DNA molecule to be detected.

The method comprises the following steps:

1) connecting the double-sample label joint with a DNA molecule to be detected to obtain a connection product;

the DNA molecules to be detected can be sticky-end DNA molecules to be detected or flat-end DNA molecules to be detected, and if the DNA molecules to be detected are flat-end DNA molecules to be detected, the double-sample label joint can be connected after A is added;

2) amplifying the ligation product by using the library building primer pair to obtain a DNA molecule sequencing library to be detected; and the second sample label sequence in the DNA molecule sequencing library to be detected is close to the two ends of the DNA molecule to be detected.

The application of the kit or the method in constructing a DNA molecule sequencing library to be detected is also within the protection scope of the invention;

or, the application of the kit or the method in single-ended sequencing of the DNA molecule to be detected is also within the protection scope of the invention;

or, the application of the kit or the method in double-end sequencing of the DNA molecule to be detected is also the protection scope of the invention;

or, the application of the double-sample tag adaptor and the library-building primer corresponding to the double-sample tag adaptor in the construction of a DNA molecule sequencing library to be detected is also within the protection scope of the invention;

or, the application of the double-sample label joint and the corresponding library-establishing primer in single-ended sequencing of the DNA molecule to be detected is also within the protection scope of the invention;

or, the application of the double-sample tag adaptor and the corresponding library-establishing primer in double-end sequencing of the DNA molecule to be detected is also within the protection scope of the invention.

In the above application, the single-ended sequencing is non-invasive prenatal gene sequencing, pathogenic microorganism gene sequencing or RNA sequencing.

The invention realizes the sequencing of the double-sample label by adopting lower sequencing cost and sequencing time, and particularly realizes the high-efficiency filtration of the sample label crosstalk data on single-ended sequencing projects, such as noninvasive prenatal gene detection, pathogenic microorganism detection and the like, under the condition of not increasing the sequencing cost.

The invention has the following advantages: 1) by introducing new sample labels at two ends of the inserted DNA fragment, the adding times of the sequencing primer are reduced, and the sequencing cost is reduced; 2) the invention can also realize double-sample labeling for the single-ended sequencing project under the condition of not increasing the sequencing cost, thereby avoiding the false positive problem caused by sample label crosstalk. The design scheme of the joint can meet the requirement that double-sample labels can be realized by single-ended sequencing, and can realize filtering of wrong sequencing data generated by sample label crosstalk.

Drawings

FIG. 1 shows a common design scheme and sequencing method for a double-sample tag adapter.

FIG. 2 is a schematic diagram of library construction results of the double-sample tag adaptor of the present invention.

FIG. 3 illustrates a method for implementing the dual sample label adapter design of the present invention.

Detailed Description

The experimental procedures used in the following examples are all conventional procedures unless otherwise specified.

Materials, reagents and the like used in the following examples are commercially available unless otherwise specified.

Example 1 design of double-sample tag linker and its application to sequencing library construction

One-and double-sample label joint

1. Double-sample label joint and preparation of library building kit thereof

The second sample tag in the double sample tag adaptor is located intermediate to the sequencing primer binding region and the inserted DNA fragment.

The structure of the double-sample label adaptor shown in fig. 3A is as follows:

the double-sample label joint A is A Y-shaped joint formed by annealing A joint sequence L-A and A joint sequence S-A,

one end of the double-sample label adaptor A forms A complementary flat end or A complementary sticky end by the complementary regions of the adaptor sequence L-A and the adaptor sequence S-A, and the other end forms A free non-complementary double strand by the non-complementary regions of the adaptor sequence L-A and the adaptor sequence S-A; and the complementary blunt end or the complementary cohesive end is used for connecting the DNA molecule to be detected; the complementary cohesive end is connected with the DNA molecule to be detected in a T-A connection mode.

The joint sequence L-A sequentially consists of A region A-A which is complementary with the joint sequence S-A and A region C-A which is not complementary with the joint sequence S-A from the end close to the DNA molecule to be detected;

wherein the region A-A sequentially consists of a second sample label sequence and a fragment B (with the size of 7-15nt) from the end close to the DNA molecule to be detected;

the region C-A is provided with a binding region of a primer PF-A in the library-establishing primer pair A (in the embodiment of the invention, the binding region of the primer PF-A is on the region C-A and is far away from the tail end of the region A-A);

phosphorylation of the last base of the joint sequence L-A from the end close to the DNA molecule to be detected;

the length of the second sample label is more than 3nt, specifically 3-10 nt.

The joint sequence S-A sequentially consists of A region T-A which is complementary with the joint sequence L-A and A region T-A which is not complementary with the joint sequence L-A from the end close to the DNA molecule to be detected;

the region D-A consists of a second sample label sequence complementary sequence and a fragment B complementary sequence in sequence from the end close to the DNA molecule to be detected;

the region pentA-A sequentially comprises A first sample label sequence and A binding region of A primer PR-A in the library-establishing primer pair A from the end close to the DNA molecule to be detected (in the embodiment of the invention, the binding region of the primer PR-A is on the region pentA and far away from the tail end of the region D);

the length of the first sample tag sequence is more than 6nt and less than 12.

The library establishing primer pair A consists of A library establishing primer PF-A and A library establishing primer PR-A.

The library building kit containing the double-sample label joint comprises a double-sample label joint and a library building primer pair A;

2. dual sample tag adaptor preparation

The structure shown in fig. 3B is as follows:

annealing the double-sample label joint B by a joint sequence L-B and a joint sequence S-B to form a Y-shaped joint;

one end of the double-sample label adaptor B forms a complementary flat end or a complementary sticky end by the complementary regions of the adaptor sequence L-B and the adaptor sequence S-B, and the other end forms a free non-complementary double strand by the non-complementary regions of the adaptor sequence L-B and the adaptor sequence S-B; and the complementary blunt end or the complementary cohesive end is used for connecting the DNA molecule to be detected;

the joint sequence L-B sequentially consists of a region A-B which is complementary with the joint sequence S-B and a region C-B which is not complementary with the joint sequence S-B from the end close to the DNA molecule to be detected;

wherein the region A-B consists of a second sample label sequence and a fragment B (with the size of 7-15nt) in sequence from the end close to the DNA molecule to be detected;

the third region C-B is provided with a binding region of a primer PF-B in the library-establishing primer pair B (in the embodiment of the invention, the binding region of the primer PF-B is on the third region and is far away from the tail end of the first region);

phosphorylation of the terminal base of the joint sequence L-B near to the DNA molecule to be detected;

the length of the second sample label is more than 3nt, specifically 3-10 nt.

The joint sequence S-B sequentially consists of a region T-B which is complementary with the joint sequence L-B and a region E-B which is not complementary with the joint sequence L-A from the end close to the DNA molecule to be detected;

the region D-B consists of a second sample label sequence complementary sequence and a fragment B complementary sequence in sequence from the end close to the DNA molecule to be detected;

the region penta-B is provided with a binding region of a primer PR-B in the library-establishing primer pair B (in the embodiment of the invention, the binding region of the primer PR-B is on the region penta-B and far away from the tail end of the region delta-B), and the first sample tag sequence is absent.

The library building primer pair B consists of a library building primer PF-B and a library building primer PR-B;

the library primer PR-B comprises a first sample tag sequence and a region which is combined with the region penta-B from the 5' end.

The length of the first sample tag sequence is more than 6nt and less than 12 nt.

A schematic diagram of the library construction results of the double-sample tag adapters of the present invention is shown in FIG. 2.

When single-ended sequencing is performed, the index2 before the insert fragment can be read, then the sequence of the insert fragment can be sequentially read (namely Reads1), and then the sequencing primer of index1 is added to perform reading of index 1;

when double-end sequencing is performed, the index2 before the insert fragment can be read, then the sequence of the insert fragment can be sequentially read (namely Reads1), and then the sequencing primer of index1 is added to perform reading of index 1; the sequencing primer of Reads2 was added, and similarly, the data generated would have the sequence information of index3 followed by the information of Reads2 where the DNA was inserted.

Second, double-sample label joint construction sequencing library

1. Preparation of the DNA molecule to be determined

The DNA molecules to be tested are prepared by PCR with blunt or sticky ends.

2. Dual sample label joint connection

1) Method connection of FIG. 3A

The principle is as follows: the double-sample tag adaptor synthesized above is directly connected with the inserted DNA fragment through a ligation reaction to directly construct a complete library (the PCR step in A can be omitted, and can be applied to PCR-free, but the disadvantage is that the adaptor synthesis is relatively long and difficult).

The scheme is as follows:

mixing the DNA molecule to be detected, the double-sample label joint A and the T4DNA ligase for ligation reaction to obtain a ligation product;

and amplifying the ligation product by using a library-building primer pair to obtain a DNA sequencing library.

2) Method connection of FIG. 3B

The principle is as follows: the newly added sample tags at both ends of the inserted DNA fragment are added through a ligation reaction, and then a PCR amplification reaction is carried out to introduce the sample tags in the conventional adapters, so as to finally construct a complete double-sample tag library (the adapter in B is short and easy to synthesize, and the Index added later can be selected according to practical application, but the PCR step is necessary).

The scheme is as follows:

mixing the DNA molecule to be detected, the double-sample label joint B and the T4DNA ligase for ligation reaction to obtain a ligation product;

Thirdly, sequencing

And (4) performing sequencing on the machine.

Example 2 design of double-sample tag linker and its application to sequencing library construction

First, construct the double-sample tag adapter

1. Construction of double-sample tag linker sequences

According to the design scheme of example 1, the linker sequences required for constructing the double-sample tag linker B shown in the following table 1 are designed;

the linker sequence in the table was synthesized from the great Gene technology Co., Ltd, Heihua, Beijing in a purification mode of C18 DSL with a subscription volume of 5 OD.

TABLE 1 sequences required for linker construction and amplification primer information

After ordering of the linker sequence, the solution was dissolved to 100. mu.M of the mother liquor using TE buffer.

2. Preparation of double-sample Label adapters

Mixing Ad01L and Ad01S according to the amount of the same substances, preparing 25 mu M of adaptor mixed solution by using TE buffer solution, standing at room temperature for more than 30 minutes, and annealing to form a Y-shaped structure containing partial double chains, which is named as Ad01M (double-sample tag adaptor B);

ad02L and Ad02S were mixed in equal amounts, and prepared into 25. mu.M adaptor mixture with TE buffer, and left at room temperature for 30 minutes or more, and annealed into a partially double-stranded Y-shaped structure, designated Ad02M (double-sample tag adaptor B).

Second, double-sample label joint construction sequencing library

1. Preparation of the DNA molecule to be determined

Using lambda phage DNA as a template, PCR primers shown in table 2 of table 2 were designed and PCR amplified to obtain lambda P1 and lambda P2 (blunt ends) as DNA molecules to be detected.

TABLE 2 PCR primers for standards

The primers in the tables were synthesized from Beijing Liu He Hua Dagen technology Co., Ltd, purified in C18 DSL, and ordered at 5 OD. After primer ordering, the primers were dissolved in TE buffer to 100. mu.M of the mother solution and diluted to 10. mu.M of the working solution.

The reaction system and procedure for the PCR amplification are shown in Table 3.

rTaqDNA polymerase (Shenzhen Huazhi Zhi science and technology Limited, 01K01201MS) was used for PCR amplification, and the reaction system and conditions are shown in Table 3.

Table 3 shows PCR reaction System and procedure

PCR products were purified 1.5 Xwith Ampure XP magnetic beads (Beckman, A63880) and quantitatively diluted to a concentration of 1 ng/. mu.L.

2. Dual sample label joint connection

The library was constructed using the KAPA Hyper Prep Kit library building Kit (Kapa Biosystems, KR0961) as follows:

1. dual sample label joint connection

After purification, 10ng of each of the PCR products λ P1 and λ P2 was taken after adding A using a kit, 1 μ L of linkers Ad01M and Ad02M with a concentration of 10 μ M were added, respectively, and in order to simulate sample tag crosstalk, 0.01 μ L of Ad02M was added to the ligation reaction of Ad01M to obtain a linker Ad01M ligation product and a linker Ad02M ligation product.

2. Amplification of library-building primers

PF in Table 1 was mixed with amounts of PR01 and PR02, respectively, and 10. mu.M of primer working solutions, designated as P01M and P02M, were prepared in TE buffer as PCR primers for library construction.

The ligation product was amplified with P01M using linker Ad01M as template to give a lambda P1 sequencing library.

The ligation product was amplified with P02M using linker Ad02M as template to give a lambda P2 sequencing library.

The experimental design is shown in table 4.

Table 4 example experimental design

Name of liberty	S1	S2
			Insert DNA	λP1	λP2
Connecting joint	Ad01M	Ad02M
			Mixing 1% of the linker	Ad02M	Ad01M
PCR primer	P01M	P02M

Thirdly, sequencing

A BGISEQ-500 sequencer made by Huada Ching is adopted to sequence more than 1 ten thousand reads in each sample according to a sequencing mode of single-ended sequencing 100 basic groups.

Analysis statistics were performed on reads generated by sequencing, as shown in table 5. The statistical calculation result shows that the actual detection of the sample label pollution rate of 1% of the simulation is 0.95% and 1.37% respectively. In actual sample testing, only unexpected sequencing data (such as Ad01_ P02 and Ad02_ P01 in Table 5) needs to be filtered and deleted, so that the problem of false positive caused by sample tag crosstalk can be avoided. I.e. only care has to be taken that the two sample labels perfectly fit the expected reads.

TABLE 5 number of sequencing reads

SEQUENCE LISTING

<110> Shenzhen Hua Dagen stock Limited Shenzhen Hua Dai clinical verification center

<120> design method and preparation method of double-label joint

<160>7

<170>PatentIn version 3.5

<210>1

<211>36

<212>DNA

<213>Artificial sequence

<400>1

cacgaagtcg gaggccaagc ggtcttagga agacaa 36

<210>2

<211>36

<212>DNA

<213>Artificial sequence

<400>2

gtgcaagtcg gaggccaagc ggtcttagga agacaa 36

<210>3

<211>30

<212>DNA

<213>Artificial sequence

<400>3

gaacgacatg gctacgatcc gacttcgtgt 30

<210>4

<211>30

<212>DNA

<213>Artificial sequence

<400>4

gaacgacatg gctacgatcc gacttgcact 30

<210>5

<211>17

<212>DNA

<213>Artificial sequence

<400>5

gaacgacatg gctacga 17

<210>6

<211>45

<212>DNA

<213>Artificial sequence

<400>6

tgtgagccaa ggagttgatc ggacctattg tcttcctaag accgc 45

<210>7

<211>45

<212>DNA

<213>Artificial sequence

<400>7

tgtgagccaa ggagttggat tccgtccttg tcttcctaag accgc 45

Claims

1. A kit for constructing a DNA molecule sequencing library to be detected comprises a double-sample label joint;

2. The kit of claim 1, wherein:

the kit also comprises the library building primer pair;

the library building primer pair consists of the primer PF and the primer PR;

3. The kit of claim 1, wherein: the region E comprises a first sample tag sequence and a binding region of a primer PR in the library building primer pair from the end close to the DNA molecule to be detected;

the kit also comprises the library building primer pair;

the library building primer pair consists of the primer PF and the primer PR;

4. The kit according to any one of claims 1 to 3, wherein:

the length of the second sample label sequence is more than 3 nt;

or the length of the second sample label sequence is 3-10 nt.

5. The kit according to any one of claims 1 to 4, wherein:

the dual sample label tab is a drum bubble or Y-shaped structure.

Or, the last base phosphorylation modification of the joint sequence L from the end near the DNA molecule to be detected.

6. The kit according to any one of claims 1 to 5, wherein:

the double-sample label joint is formed by annealing a joint sequence L shown in a sequence 1 and a joint sequence S shown in a sequence 3;

7. A method for constructing a sequencing library of test DNA molecules using the kit of claims 1-6, comprising the steps of:

8. The method of claim 7, wherein: the method comprises the following steps:

1) connecting the double-sample label adaptor of any one of claims 1-6 with the DNA molecule to be tested to obtain a ligation product;

2) amplifying the ligation products by using the pair of library-constructing primers of any one of claims 1 to 6 to obtain a sequencing library of the DNA molecules to be tested; and the second sample label sequence in the DNA molecule sequencing library to be detected is close to the two ends of the DNA molecule to be detected.

9. Use of a kit according to any one of claims 1 to 6 or a method according to claim 7 or 8 for constructing a sequencing library of test DNA molecules;

or, the use of a kit according to any one of claims 1 to 6 or a method according to claim 7 or 8 for single-ended sequencing of a DNA molecule to be tested;

or, the use of a kit according to any one of claims 1 to 6 or a method according to claim 7 or 8 for paired-end sequencing of a test DNA molecule;

or, the use of the double-sample tag adaptor and the corresponding pooling primer of any one of claims 1-6 for constructing a sequencing library of test DNA molecules;

or, the use of the double-sample tag adaptor of any one of claims 1-6 and the corresponding pool primer in single-ended sequencing of a DNA molecule to be tested;

or, the use of the double-sample tag adaptor and the corresponding pool primer of any one of claims 1-6 in paired end sequencing of a test DNA molecule.

10. Use according to claim 9, characterized in that: the single-ended sequencing is noninvasive prenatal gene sequencing, pathogenic microorganism gene sequencing or RNA sequencing.