HK40113932A

HK40113932A - Short tandem repeat sequence sequencing and analysis methods

Info

Publication number: HK40113932A
Application number: HK42024100489.4A
Authority: HK
Inventors: 孙雷; 尚欢; 金欢
Original assignee: 深圳市真迈生物科技有限公司
Filing date: 2024-12-05
Publication date: 2025-02-28

Description

Short tandem repeat sequence sequencing methods and analysis methods

技术领域Technical Field

本发明涉及生物领域。具体地，本发明涉及短串联重复序列测序方法和分析方法。This invention relates to the field of biology. Specifically, it relates to methods for sequencing and analyzing short tandem repeat sequences.

背景技术Background Technology

短串联重复序列(short tandem repeat，简称STR)，也称为微卫星DNA(microsatellite DNA)或简单重复序列(simple repeat sequence，SRS；simplesequences of repeats，SSR)。由于核心单位(也称为“重复单位”或“基序序列”)及其重复次数的不同，STR在不同种族、不同人群、不同个体之间的分布具有很大的差异性，构成了STR的遗传多态性。在人类基因组中，平均每15～20kb就存在一个STR位点，人类23对染色体上分布着近8000多个STR位点[马威,张亮,李岩,徐飞.人类短串联重复序列(STR)及其研究进展[J].大连医科大学学报,2007,29(1):78-81.]。鉴于多态性和广泛分布，STR位点可作为特异性高、灵敏度强的遗传标记。结合PCR技术，STR检测已经广泛应用于遗传制图、基因定位、法医鉴定、产前诊断等许多领域。例如在法医的身份鉴定领域，目前主要使用大约几十种常染色体STR和Y-染色体STR。Short tandem repeats (STRs), also known as microsatellite DNA or simple repeat sequences (SRSs; simple sequences of repeats, SSRs), exhibit significant variations in distribution among different races, populations, and individuals due to differences in their core units (also called "repetition units" or "motif sequences") and the number of repetitions. This constitutes the genetic polymorphism of STRs. On average, there is one STR locus every 15–20 kb in the human genome, with nearly 8,000 STR loci distributed across the 23 pairs of human chromosomes [Ma Wei, Zhang Liang, Li Yan, Xu Fei. Human Short Tandem Repeats (STRs) and Their Research Progress [J]. Journal of Dalian Medical University, 2007, 29(1):78-81.]. Given their polymorphism and wide distribution, STR loci can serve as highly specific and sensitive genetic markers. Combined with PCR technology, STR testing has been widely used in many fields such as genetic mapping, gene localization, forensic identification, and prenatal diagnosis. For example, in the field of forensic identification, approximately a few dozen autosomal STRs and Y-chromosome STRs are currently mainly used.

检测STR的方法主要包括PCR和测序，基于测序的STR检测方法仍有待研究或改进。The main methods for detecting STRs include PCR and sequencing, but sequencing-based STR detection methods still need further research and improvement.

发明内容Summary of the Invention

本发明旨在至少在一定程度上解决上述技术问题至少之一或者提供一种实用的STR检测方案。在本发明的一个方面，本发明提出了一种短串联重复序列(STR)测序方法。根据本发明的实施例，方法包括：获取文库，所述文库的插入片段中含有短串联重复序列和与所述短串联重复序列相连的特异性序列；在固相表面上对所述文库进行边合成边测序，以便获得测序数据，所述测序数据包括多个读段，其中，所述固相表面携带第一测序引物和第二测序引物的至少之一，所述第一测序引物具有下列核苷酸序列：(S0)x，其中，所述S0表示所述短串联重复序列的基序序列或其互补序列，x不小于1，所述第二测序引物具有下列核苷酸序列：S1-(S0)y，其中，S1为所述特异性序列或与所述特异性序列对应的序列，S0为所述短串联重复序列的基序序列或其互补序列，y大于或等于0。This invention aims to at least partially solve at least one of the aforementioned technical problems or provide a practical STR detection solution. In one aspect of this invention, a method for sequencing short tandem repeat (STR) sequences is proposed. According to an embodiment of the invention, the method includes: acquiring a library, wherein the insert fragment of the library contains a short tandem repeat sequence and a specific sequence linked to the short tandem repeat sequence; performing sequencing-while-synthesizing on the library on a solid surface to obtain sequencing data, the sequencing data comprising multiple reads, wherein the solid surface carries at least one of a first sequencing primer and a second sequencing primer, the first sequencing primer having the following nucleotide sequence: (S0)x, wherein S0 represents the motif sequence of the short tandem repeat sequence or its complementary sequence, and x is not less than 1; the second sequencing primer having the following nucleotide sequence: S1-(S0)y, wherein S1 is the specific sequence or a sequence corresponding to the specific sequence, S0 is the motif sequence of the short tandem repeat sequence or its complementary sequence, and y is greater than or equal to 0.

常规的测序引物通常不含待测序列，例如商业化的边合成边测序方法或平台，如ILLUMINA、华大智造、ThermoFisher等的测序平台，所采用的测序引物通常是据文库末端的接头设计的，为能够与接头或接头的一部分匹配的序列，据此，基于引物延伸和相应的信号检测识别部分或全部的插入片段/待测序列的碱基类型和排列次序。Conventional sequencing primers typically do not contain the sequence to be tested. For example, commercial sequencing-by-synthesis methods or platforms, such as those from ILLUMINA, BGI Genomics, and ThermoFisher, typically use sequencing primers designed based on the adapters at the ends of the library. These primers are sequences that can match the adapter or a portion of the adapter. Based on this, primer extension and corresponding signal detection identify the base type and sequence of part or all of the insert/test sequence.

不同于上述常规的测序引物，发明人基于以下认识、预测和测试发现目标序列STR由2-6个碱基的重复单位串联构成以及短串联重复序列之间的“错位杂交”行为基本是随机发生的，因而，巧妙地设计出了该第一测序引物。具体地，使第一测序引物例如为待测序列的一部分，例如为由数个目标STR的重复单位(基序序列)串联形成的多核苷酸序列，而鉴于测试发现该第一测序引物杂交匹配到包含目标STR的文库的不同位置(亦即错位杂交)具有一定的随机性，因此，利用该第一测序引物捕获或杂交该文库，可以理解地，会产生一系列杂交复合物。进一步地，延伸该第一测序引物对该些杂交复合物进行测序，测序数据将包含对应该些杂交复合物或者说对应目标STR不同位置的多种读段(reads)，进而，通过对该测序数据的分析可以确定目标STR的完整序列，亦即确定该STR的基序序列的重复次数，实现对该STR的分型。Unlike conventional sequencing primers, the inventors, based on the following understanding, prediction, and testing, discovered that the target sequence STR is composed of tandem repeat units of 2-6 bases and that the "misalignment" behavior between short tandem repeat sequences occurs largely randomly. Therefore, they ingeniously designed this first sequencing primer. Specifically, the first sequencing primer is, for example, a part of the sequence to be tested, such as a polynucleotide sequence formed by tandem repeat units (motif sequences) of several target STRs. Given that testing revealed a degree of randomness in the first sequencing primer's hybridization matching to different positions in a library containing the target STR (i.e., misalignment), capturing or hybridizing this library using this first sequencing primer will understandably generate a series of hybridization complexes. Furthermore, extending the first sequencing primer to sequence these hybridization complexes will yield sequencing data containing multiple reads corresponding to these hybridization complexes, or rather, different positions of the target STR. Analysis of this sequencing data can then determine the complete sequence of the target STR, that is, the number of repeats of the STR's motif sequence, thus enabling STR genotyping.

类似地，也是基于上述认识、预测和试验测试，发明人巧妙地设计出第二测序引物，使第二测序引物例如为包含特异性序列或与特异性序列对应的序列以及一个或数个目标STR的重复单位(基序序列)串联形成的多核苷酸序列，例如，利用5’端固定在指定表面的第二测序引物5’-S1-(S0)y-3’捕获或杂交包含目标STR的文库，能获得5’端为双链的杂交复合物，也就是说，第二测序引物和该文库的杂交互补位置一端是固定的，如此，可进一步对该杂交复合物进行测序，以确定该STR的序列，实现对该STR的检测。Similarly, based on the above understanding, predictions, and experimental tests, the inventors ingeniously designed a second sequencing primer. This second sequencing primer is, for example, a polynucleotide sequence consisting of a specific sequence or a sequence corresponding to the specific sequence and one or more repeating units (motif sequences) of the target STR tandemly. For example, by using the second sequencing primer 5'-S1-(S0)y-3' with its 5' end fixed on a designated surface to capture or hybridize a library containing the target STR, a hybridization complex with a double strand at its 5' end can be obtained. In other words, the complementary hybridization position between the second sequencing primer and the library is fixed at one end. Thus, the hybridization complex can be further sequenced to determine the sequence of the STR and achieve the detection of the STR.

更具体地，设计使该第二测序引物包含一组不同长度的序列，并且使其中的每种长度的序列分别包含特异性序列或与特异性序列对应的序列以及不同长度的(S0)y，亦即y具有多个不同取值的一组序列，可以理解地，利用该组序列捕获或杂交包含目标STR的文库，能获得5’端为双链的一系列杂交复合物，对该些杂交复合物进行测序，可以理解地，测序数据将包含对应该些杂交复合物或者说对应目标STR不同位置的多种读段(reads)，进而，通过分析该测序数据可以确定该STR的完整序列，亦即确定该STR的基序序列的重复次数，准确地实现对该STR的分型，缩短测序时间，节省测序成本，应用前景好。More specifically, the second sequencing primer is designed to contain a set of sequences of different lengths, with each length containing a specific sequence or a sequence corresponding to the specific sequence, as well as (S0)y of different lengths, i.e., a set of sequences with multiple different values for y. Understandably, by using this set of sequences to capture or hybridize a library containing the target STR, a series of hybridization complexes with double strands at the 5' end can be obtained. Sequencing these hybridization complexes will, understandably, produce sequencing data containing multiple reads corresponding to these hybridization complexes or different positions of the target STR. Furthermore, by analyzing this sequencing data, the complete sequence of the STR can be determined, i.e., the number of repetitions of the motif sequence of the STR can be determined, accurately achieving genotyping of the STR, shortening sequencing time, saving sequencing costs, and showing promising application prospects.

在本发明的另一方面，本发明提出了一种确定短串联重复序列的方法。根据本发明的实施例，方法包括：(i)获取短串联重复序列的测序数据，测序数据是根据前面所述短串联重复序列测序方法获得的；(ii)将测序数据与参考序列进行比对，以便将多个读段分类为测穿读段和未测穿读段，测穿读段包含特异性序列或特异性序列的互补序列的至少一部分，未测穿读段不包含特异性序列或特异性序列的互补序列；和(iii)基于测穿读段和未测穿读段，确定短串联重复序列。In another aspect of the invention, a method for identifying short tandem repeat sequences is proposed. According to an embodiment of the invention, the method includes: (i) acquiring sequencing data of the short tandem repeat sequence, the sequencing data being obtained according to the short tandem repeat sequence sequencing method described above; (ii) aligning the sequencing data with a reference sequence to classify multiple reads into test-through reads and non-test-through reads, wherein test-through reads contain at least a portion of a specific sequence or a complementary sequence of the specific sequence, and non-test-through reads do not contain a specific sequence or a complementary sequence of the specific sequence; and (iii) identifying the short tandem repeat sequence based on the test-through reads and non-test-through reads.

在本发明的又一方面，本发明提出了一种测序芯片。根据本发明的实施例，测序芯片包括：基底，具有表面；第一测序引物，第一测序引物固定在表面上，并且，第一测序引物具有下列核苷酸序列：(S0)x，其中，S0表示所述短串联重复序列的基序序列或其互补序列，x不小于1。In another aspect of the invention, a sequencing chip is provided. According to an embodiment of the invention, the sequencing chip includes: a substrate having a surface; a first sequencing primer fixed on the surface, and the first sequencing primer having the following nucleotide sequence: (S0)x, wherein S0 represents the motif sequence of the short tandem repeat sequence or its complementary sequence, and x is not less than 1.

在本发明的又一方面，本发明提出了一种测序芯片。根据本发明的实施例，测序芯片包括：基底，具有表面；第二测序引物，第二测序引物固定在表面上，并且，第二测序引物具有下列核苷酸序列：S1-(S0)y，其中，S1为特异性序列或者与特异性序列对应的序列，S0为短串联重复序列的基序序列或其互补序列，特异性序列能够指示短串联重复序列在参考序列上的位置，y大于或等于0。In another aspect of the invention, a sequencing chip is provided. According to an embodiment of the invention, the sequencing chip includes: a substrate having a surface; a second sequencing primer fixed on the surface, and the second sequencing primer having the following nucleotide sequence: S1-(S0)y, wherein S1 is a specific sequence or a sequence corresponding to the specific sequence, S0 is a motif sequence of a short tandem repeat sequence or its complementary sequence, the specific sequence is capable of indicating the position of the short tandem repeat sequence on a reference sequence, and y is greater than or equal to 0.

在本发明的又一方面，本发明提出了一种试剂盒。根据本发明的实施例，试剂盒包括：(1)如前面所述短串联重复序列测序方法中的第一测序引物和/或第二测序引物；或(2)如前面所述的芯片。In another aspect of the invention, a kit is provided. According to an embodiment of the invention, the kit comprises: (1) a first sequencing primer and/or a second sequencing primer as described above in the short tandem repeat sequencing method; or (2) a chip as described above.

在本发明的又一方面，本发明提出了一种识别个体的方法。根据本发明的实施例，方法包括：根据前面短串联重复序列测序方法，确定待测样本的短串联重复序列的序列，待测样本包含核酸；基于序列，确定待测样本源自的一个或多个个体。In another aspect, the present invention provides a method for identifying an individual. According to an embodiment of the present invention, the method includes: determining the sequence of a short tandem repeat sequence of a sample to be tested, the sample containing nucleic acid, according to the preceding short tandem repeat sequencing method; and determining, based on the sequence, one or more individuals from which the sample to be tested originates.

本发明的附加方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

附图说明Attached Figure Description

本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from the description of the embodiments taken in conjunction with the following drawings, in which:

图1显示了根据本发明一个实施例的STR扩增产物结构示意图；Figure 1 shows a schematic diagram of the STR amplification product structure according to an embodiment of the present invention;

图2显示了根据本发明一个实施例的短串联重复序列测序方法流程示意图；Figure 2 shows a schematic flowchart of a short tandem repeat sequence sequencing method according to an embodiment of the present invention;

图3显示了根据本发明一个实施例的错位杂交原理示意图；Figure 3 shows a schematic diagram of the principle of mismatched hybridization according to an embodiment of the present invention;

图4显示了根据本发明一个实施例的构建测序文库示意图；Figure 4 shows a schematic diagram of constructing a sequencing library according to an embodiment of the present invention;

图5显示了根据本发明一个实施例的测序文库结构示意图；Figure 5 shows a schematic diagram of a sequencing library structure according to an embodiment of the present invention;

图6显示了根据本发明一个实施例的测穿比和重复序列个数关系分析示意图；Figure 6 shows a schematic diagram illustrating the relationship between penetration test ratio and the number of repeat sequences according to an embodiment of the present invention;

图7显示了根据本发明一个实施例的读长分布为avg＝40,std＝5高斯分布下，使用1MReads模拟得到的分布示意图；Figure 7 shows a schematic diagram of the distribution obtained by simulation using 1MReads under a Gaussian distribution with read length distribution of avg=40 and std=5 according to an embodiment of the present invention.

图8显示了根据本发明一个实施例的读长分布为avg＝40,std＝5高斯分布下，使用10MReads模拟得到的分布示意图；Figure 8 shows a schematic diagram of the read length distribution obtained by simulation using 10MReads under a Gaussian distribution with avg=40 and std=5 according to an embodiment of the present invention.

图9显示了根据本发明一个实施例的读长分布为avg＝40,std＝10高斯分布下，使用1M Reads模拟得到的分布示意图；Figure 9 shows a schematic diagram of the distribution obtained by simulation using 1M Reads under a Gaussian distribution with read length distribution of avg=40 and std=10 according to an embodiment of the present invention;

图10显示了根据本发明一个实施例的读长分布为avg＝40,std＝10高斯分布下，使用10MReads模拟得到的分布示意图。Figure 10 shows a schematic diagram of the distribution obtained by simulation using 10MReads under a Gaussian distribution with read length distribution of avg=40 and std=10 according to an embodiment of the present invention.

具体实施方式Detailed Implementation

下面详细描述本发明的实施例。下面描述的实施例是示例性的，仅用于解释本发明，而不能理解为对本发明的限制。The embodiments of the present invention are described in detail below. The embodiments described below are exemplary and are only used to explain the present invention, and should not be construed as limiting the present invention.

需要说明的是，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量或顺序。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。进一步地，在本发明的描述中，除非另有说明，“多个”的含义是两个或两个以上。It should be noted that the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number or order of the indicated technical features. Therefore, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. Furthermore, in the description of this invention, unless otherwise stated, "a plurality of" means two or more.

在本文中，所称的“芯片”为固相基底，具有能够固定或连接或承载待测样本的表面，例如可以是玻片、磁珠或具有容纳液体的空间的反应小室等。较佳地，例如具有容纳液体的空间的小室，能够支持在其上进行生化反应或检测，也可称为流动池或流动小室(flow-cell)。基底或固相基底可以是任何可用于固定核酸序列的固体支持物，例如尼龙膜、玻璃片、塑料、硅片、磁珠等。In this paper, the term "chip" refers to a solid substrate with a surface capable of fixing, connecting, or supporting the sample to be tested. Examples include glass slides, magnetic beads, or reaction chambers with liquid-containing spaces. Preferably, a chamber with liquid-containing spaces, capable of supporting biochemical reactions or detection, may also be called a flow cell or flow-phase cell. The substrate or solid substrate can be any solid support suitable for fixing nucleic acid sequences, such as nylon membranes, glass slides, plastics, silicon wafers, magnetic beads, etc.

在本文中，所称的“模板链”为待测的核酸序列；有时也称为“杂交链”，与固定链是相对的概念，例如，与固定在表面上的探针或固定链互补配对以连接在该表面的DNA单链或双链或杂交复合物等。In this article, the term "template strand" refers to the nucleic acid sequence to be tested; it is sometimes also called "hybrid strand," which is the opposite of the fixed strand, such as a single-stranded or double-stranded DNA or a hybrid complex that is complementary to a probe or fixed strand immobilized on a surface and attached to that surface.

在本文中，所称的“测序”为序列测定，一般指核酸序列中的碱基次序的测定，包括DNA测序和/或RNA测序；包括边合成边测序(SBS)。所称的边合成边测序包括多次重复地使一种或多种核苷酸(包括核苷酸类似物)结合到模板链并采集相应的反应信号。测序可以通过测序平台进行，测序平台包括单分子测序平台。In this article, "sequencing" refers to sequence determination, generally meaning the determination of the base order in a nucleic acid sequence, including DNA sequencing and/or RNA sequencing; including sequencing-by-synthesis (SBS). Sequencing-by-synthesis involves repeatedly binding one or more nucleotides (including nucleotide analogs) to a template strand and collecting the corresponding reaction signals. Sequencing can be performed using sequencing platforms, including single-molecule sequencing platforms.

在本文中，所称的“基序序列”亦可称为“重复序列”，是指短串联重复序列的重复单元，通常基序序列由2～6个碱基组成。在本文中，所称的短串联重复序列的“重复数”又称为“循环数”或者“基序数目”，是指短串联重复序列中的基序序列个数。In this paper, the term "motif sequence" can also be referred to as "repetitive sequence," which refers to the repeating unit of a short tandem repeat sequence. Typically, a motif sequence consists of 2 to 6 bases. In this paper, the term "repetition number" of a short tandem repeat sequence is also called "cycle number" or "motif number," which refers to the number of motif sequences within the short tandem repeat sequence.

在本文中，所称的“读段”(reads)是测读出的碱基序列。In this article, “reads” are the base sequences that are read.

在本文中，所称的“引物”或“测序引物”有时也称为固定链或探针，为序列预设或序列已知的核苷酸片段。优选地，引物为单链。在一些实施例中，引物也可以为双链。如果是双链，可以预先进行处理以分离出单链，然后制备延伸产物。在一些实施例中，引物选自RNA、DNA、PNA或LNA。In this document, the term "primer" or "sequencing primer" is sometimes also referred to as a fixed strand or probe, and is a nucleotide fragment with a pre-defined or known sequence. Preferably, the primer is single-stranded. In some embodiments, the primer may also be double-stranded. If double-stranded, it may be pretreated to isolate the single strands before preparing the extension product. In some embodiments, the primer is selected from RNA, DNA, PNA, or LNA.

短串联重复序列测序方法Short tandem repeat sequencing methods

在本发明的一个方面，本发明提出了一种短串联重复序列测序方法。根据本发明的实施例，该方法包括：获取文库，文库的插入片段中含有短串联重复序列和与短串联重复序列相连的特异性序列；在固相表面上对文库进行边合成边测序，以便获得短串联重复序列的测序数据，其中，芯片同时或分别独立地携带第一测序引物和第二测序引物。In one aspect of the present invention, a method for sequencing short tandem repeat sequences is proposed. According to an embodiment of the present invention, the method includes: acquiring a library, wherein the insert fragment of the library contains short tandem repeat sequences and specific sequences linked to the short tandem repeat sequences; performing sequencing-while-synthesizing on a solid surface to obtain sequencing data of the short tandem repeat sequences, wherein the chip simultaneously or independently carries a first sequencing primer and a second sequencing primer.

在本文中，所称的“特异性序列”或者“与特异性序列对应的序列”为能够指示该短串联重复序列在参考序列上的位置的序列。特异性序列或与特异性序列对应的序列，例如可以是参考序列上的目标STR的上游和/或下游相对保守的一段序列。一般可通过序列比对来明确这种位置关系指示。所称的对测序数据进行分析来确定目标STR的序列，包括通过序列比对来确定这种位置关系指示。具体地，可以利用其起到标识作用，用于识别所连接的STR序列组成，可以基于测序过程中是否测到特异性序列，从而确定STR序列重复数。另外，特异性序列也可用于识别所连接的STR序列位置，以人基因组为例，基因组中同一STR序列可能会分布于同一基因上的不同位置、或者不同基因、不同染色体上，进而基于测序结果可能无法区分不同位置的STR序列。由于每个STR序列会与特异性序列相连，进而，通过对STR序列和特异性序列进行测序分析，可以准确区分不同位置的STR序列。In this paper, the term "specific sequence" or "sequence corresponding to the specific sequence" refers to a sequence that indicates the position of the short tandem repeat sequence on the reference sequence. The specific sequence or the sequence corresponding to the specific sequence can, for example, be a relatively conserved sequence upstream and/or downstream of the target STR on the reference sequence. This positional relationship is generally determined through sequence alignment. The analysis of sequencing data to determine the target STR sequence includes determining this positional relationship through sequence alignment. Specifically, it can be used as an identifier to identify the composition of the linked STR sequences, and the number of STR repeats can be determined based on whether the specific sequence is detected during sequencing. Furthermore, the specific sequence can also be used to identify the position of the linked STR sequences. For example, in the human genome, the same STR sequence may be distributed at different positions on the same gene, or on different genes or different chromosomes, making it impossible to distinguish STR sequences at different positions based on sequencing results. Since each STR sequence is linked to a specific sequence, sequencing analysis of the STR sequence and the specific sequence can accurately distinguish STR sequences at different positions.

在本文中，所称的参考序列(reference)为预先确定的序列，包含染色体编号以及目标位点(比如这里的短串联重复序列)在染色体上的位置信息；参考序列可以是自己预先测定组装的DNA和/或RNA序列，也可以是他人测定公开的DNA和/或RNA序列，可以是预先获得的目标个体所属生物类别中的任意的参考模板，例如，同一生物类别的已公开的基因组组装序列的全部或者至少一部分；若目标个体是人类，其参考序列可以是人基因组参考序列(也称为参考基因组或者参考染色体组)的全部或一部分，如可选择NCBI数据库提供的HG19等。所称的序列比对或比对，这里包括将读段或读段的一部分定位到参考序列上的过程，也包括获得读段或读段的一部分的定位/匹配结果的过程。In this paper, the reference sequence is a pre-determined sequence containing chromosome number and the location information of the target site (such as the short tandem repeat sequence here) on the chromosome. The reference sequence can be a pre-assembled DNA and/or RNA sequence determined by oneself, or a publicly available DNA and/or RNA sequence determined by others. It can be any reference template within the biological category to which the target individual belongs, such as all or at least a portion of a publicly available genome assembly sequence of the same biological category. If the target individual is human, its reference sequence can be all or a portion of a human genome reference sequence (also called a reference genome or reference chromosome set), such as HG19 provided by the NCBI database. The sequence alignment or comparison here includes the process of locating a read or a portion of a read onto the reference sequence, as well as the process of obtaining the location/matching results of the read or a portion of a read.

利用该测序方法对STR进行检测，对测序读长没有要求，原则上，读长能大于1个基序序列(一般是2-6bp)的测序平台都可以使用。而且，在表面杂交之后无需对杂交产物进行扩增，可直接对杂交产物进行测序检测。This sequencing method for STR detection has no requirements on read length; in principle, any sequencing platform with a read length greater than one motif sequence (generally 2-6 bp) can be used. Furthermore, after surface hybridization, there is no need to amplify the hybridization product; sequencing can be performed directly on the hybridization product.

本发明对于特异性序列与STR序列的连接方式不作严格限定，既可以两者直接相连，也可以两者之间间隔着一段间隔序列(短序列)。间隔序列的长度不宜过长，需要满足测序数据中不能全部为间隔序列的测序结果或者测序读取的特异性序列长度过短导致无法准确地识别该特异性序列。另外，短串联重复序列既可以一端上连接特异性序列，也可以是两端均连接特异性序列。以图1中的Th01基因座测序文库为例，(AATG)n为目标STR序列(Th01基因座)，AATG为基序序列，基序序列两端的序列为特异性序列。使用特异性引物对目标STR区域扩增，形成5’端和3’端的至少一个末端具有特异性序列的扩增产物，进而对扩增产物进行建库，获得测序文库。测序文库一般包含插入片段/待测片段和接头，在这个过程中可以在插入片段的一端或两端引入一个或多个标签(index或barcode)，保证多个测序文库可以在同一个流动槽(flowcell)上或者同一个通道(lane)上进行混合测序，以提高测序通量。This invention does not strictly limit the connection method between the specific sequence and the STR sequence; they can be directly connected or separated by a spacer sequence (short sequence). The length of the spacer sequence should not be too long, ensuring that the sequencing data does not consist entirely of spacer sequences or that the length of the specific sequence read is too short to accurately identify the specific sequence. Furthermore, the short tandem repeat sequence can be linked to a specific sequence at one end or at both ends. Taking the Th01 locus sequencing library in Figure 1 as an example, (AATG)n is the target STR sequence (Th01 locus), AATG is the motif sequence, and the sequences at both ends of the motif sequence are specific sequences. Specific primers are used to amplify the target STR region, forming an amplification product with at least one specific sequence at the 5' and 3' ends. The amplified product is then used to construct a library to obtain the sequencing library. Sequencing libraries typically contain insert fragments/test fragments and adapters. During this process, one or more tags (indexes or barcodes) can be introduced at one or both ends of the insert fragment to ensure that multiple sequencing libraries can be mixed and sequenced on the same flow cell or the same lane, thereby improving sequencing throughput.

根据本发明的实施例，第一测序引物具有下列核苷酸序列：(S0)x，其中，S0表示短串联重复序列的基序序列或其互补序列，x不小于1。According to an embodiment of the present invention, the first sequencing primer has the following nucleotide sequence: (S0)x, wherein S0 represents the motif sequence of a short tandem repeat sequence or its complementary sequence, and x is not less than 1.

参见图2，由于测序文库中含有多个短串联重复序列，以测序文库作为模板链进行测序时，第一测序引物(STR)x将与模板链发生错位杂交，即，第一测序引物会随机与文库的任意部分的短串联重复序列杂交，出现不同的杂交模式。由于测序长度有限，会出现测到特异性序列(简称“测穿”，相应地获得测穿读段，)和未测到特异性序列(简称“未测穿”，相应地获得未测穿读段)两种情况，对测穿读段和未测穿读段进行分析，可以获知短串联重复序列的序列信息，如重复数等。Referring to Figure 2, because the sequencing library contains multiple short tandem repeat sequences, when sequencing is performed using the library as a template strand, the first sequencing primer (STR)x will undergo misalignment hybridization with the template strand. That is, the first sequencing primer will randomly hybridize with any part of the short tandem repeat sequences in the library, resulting in different hybridization patterns. Due to the limited sequencing length, two scenarios will occur: specific sequences are detected (referred to as "test-through," corresponding to the test-through read) and no specific sequences are detected (referred to as "non-test-through," corresponding to the non-test-through read). Analyzing the test-through and non-test-through reads can reveal sequence information of the short tandem repeat sequences, such as the repeat number.

在本文中，所称的“测穿”即为测到特异性序列的情形，是指测序过程中测到与短串联重复序列相连的至少一部分特异性序列，相应获得的碱基序列为“测穿读段”。测穿读段具体分为特异性序列和部分STR序列的集合以及全部为特异性序列两种情形。In this article, "test-through" refers to the detection of a specific sequence, meaning that at least a portion of the specific sequence linked to a short tandem repeat sequence is detected during the sequencing process. The corresponding base sequence obtained is called a "test-through read". Test-through reads are specifically divided into two cases: a set of specific sequences and partial STR sequences, and a case where all reads are specific sequences.

在本文中，所称的“未测穿”即为未测到特异性序列的情形，是指测序过程中未测到与短串联重复序列相连的特异性序列，相应获得的碱基序列为“未测穿读段”。In this article, "untested" refers to the situation where no specific sequence was detected. It means that no specific sequence linked to the short tandem repeat sequence was detected during the sequencing process, and the corresponding base sequence obtained is called "untested read".

由此，利用根据本发明实施例的方法可以准确地实现高通量短读长序列测序，节省测序时间和测序成本。Therefore, the method according to the embodiments of the present invention can accurately achieve high-throughput short-read sequence sequencing, saving sequencing time and sequencing costs.

根据本发明的实施例，固相表面携带第一测序引物表示为：L-(S0)x，其中，L表示连接基团，用于将第一测序引物连接在固相表面上。(S0)x通过连接基团L与固相表面相连，即L-(S0)x，由此，以便达到固定目的。According to an embodiment of the present invention, the first sequencing primer carried on the solid surface is represented as L-(S0)x, where L represents a linker group used to link the first sequencing primer to the solid surface. (S0)x is connected to the solid surface through the linker group L, i.e., L-(S0)x, thereby achieving the purpose of fixation.

根据本发明的实施例，x为3～20或3～10。例如，x为3、4、4.5、5、6、6.2、7、8、9、9.7、10、11、12、13、13.2、14、14.9、15、16、17、18、19。According to embodiments of the present invention, x is 3 to 20 or 3 to 10. For example, x is 3, 4, 4.5, 5, 6, 6.2, 7, 8, 9, 9.7, 10, 11, 12, 13, 13.2, 14, 14.9, 15, 16, 17, 18, 19.

需要说明的是，第一测序引物(S0)x中的x既可以是整数，即引物与模板链上的任意一个或多个完整的基序序列相匹配；也可以是非整数，即引物同时与模板链上相连的任意一个或多个完整的基序序列以及一个基序序列中的部分相匹配，优选为整数。It should be noted that x in the first sequencing primer (S0)x can be an integer, meaning that the primer matches any one or more complete motif sequences on the template strand; or it can be a non-integer, meaning that the primer matches any one or more complete motif sequences connected to the template strand as well as a portion of a motif sequence. Preferably, it is an integer.

本发明对于边合成边测序反应时芯片的温度不做严格限定，可以根据实际情况灵活选择，例如，可以为37度。第一测序引物或者第二测序引物与模板链的结合温度最好大于芯片的反应温度，从而避免发生解链，影响测序反应发生。This invention does not strictly limit the chip temperature during the sequencing-by-synthesis reaction; it can be flexibly selected according to actual conditions, for example, it can be 37 degrees Celsius. The binding temperature of the first or second sequencing primer to the template strand is preferably higher than the chip's reaction temperature to avoid strand unwinding, which could affect the sequencing reaction.

根据本发明的实施例，利用芯片对文库进行边合成边测序反应之前，不在芯片上对文库进行扩增反应。According to an embodiment of the present invention, the library is not amplified on the chip before performing the synthesis-sequencing reaction on the chip.

根据本发明的实施例，边合成边测序反应是在单分子测序平台上进行的。According to an embodiment of the present invention, the sequencing-by-synthesis reaction is performed on a single-molecule sequencing platform.

运行边合成边测序的单分子(Single-molecule sequencing)测序平台包括完成以下：使杂交产物置于适于聚合反应的条件下，包括使带有荧光标记的可逆终止碱基通过流动槽(flowcell)进入检测区域，在DNA聚合酶的催化下，延伸测序引物使可逆终止碱基掺入到模板链/杂交复合物上，检测相应的荧光信号。重复进行多次，以测定DNA模板序列。Running a single-molecule sequencing platform (SMS) involves the following steps: placing the hybridization product under conditions suitable for polymerization, including introducing a fluorescently labeled reversible termination base into the detection region via a flow cell; extending the sequencing primers under the catalysis of DNA polymerase to incorporate the reversible termination base into the template strand/hybridization complex; and detecting the corresponding fluorescent signal. This process is repeated multiple times to determine the DNA template sequence.

参见图3，在常见的二代测序技术中，没有错位杂交，一个簇(克隆簇，包含成千上万条一样的待测序列)产生同一种信号以检测该簇内的多个相同分子的相同位置上的一个碱基。在当前主流的高通量测序平台中，“错位杂交”方式是极不利于测序及分析的，一旦发生错位杂交，也即引物杂交到簇内的多个分子的多个位置上，如此，在一次延伸中，一个簇内的多个不同DNA位置都发生延伸反应、发出相应的延伸信号，这会导致难以区分延伸信号，难以识别相应的碱基而致测序失败。因此，较佳地，可以采用单分子测序，亦即，不需要在表面上对待测分子进行扩增成簇。如此，则不存在多个DNA拷贝信号混杂的问题，从而可以准确获取测序结果。Referring to Figure 3, in common next-generation sequencing technologies, there is no misalignment hybridization. A cluster (a clonal cluster containing thousands of identical sequences) generates the same signal to detect a single base at the same position in multiple identical molecules within that cluster. In current mainstream high-throughput sequencing platforms, misalignment hybridization is extremely detrimental to sequencing and analysis. Once misalignment hybridization occurs, meaning the primer hybridizes to multiple positions on multiple molecules within a cluster, multiple different DNA positions within the cluster will undergo extension reactions and emit corresponding extension signals during a single extension. This makes it difficult to distinguish the extension signals and identify the corresponding bases, leading to sequencing failure. Therefore, it is preferable to use single-molecule sequencing, i.e., without amplifying the target molecule into clusters on the surface. This eliminates the problem of mixed signals from multiple DNA copies, allowing for accurate sequencing results.

根据本发明的实施例，第二测序引物具有下列核苷酸序列：S1-(S0)y，其中，S1为特异性序列或与特异性序列对应的序列，S0为短串联重复序列的基序序列或其互补序列，y大于或等于0。第二测序引物中的S1可以与模板链上的特异性序列相结合，以固定杂交位置，由引物中的(S0)y一端开始进行测序。由于测序长度有限，会出现测到特异性序列(相应地获得测穿读段)和未测到特异性序列(相应地获得未测穿读段)两种情况，对测穿读段和未测穿读段进行分析，可以获知短串联重复序列的序列信息，如重复数等。According to an embodiment of the present invention, the second sequencing primer has the following nucleotide sequence: S1-(S0)y, wherein S1 is a specific sequence or a sequence corresponding to the specific sequence, S0 is the motif sequence of a short tandem repeat sequence or its complementary sequence, and y is greater than or equal to 0. S1 in the second sequencing primer can bind to the specific sequence on the template strand to fix the hybridization position, and sequencing begins from the (S0)y end of the primer. Due to the limited sequencing length, two situations may occur: the specific sequence is detected (correspondingly, a test read is obtained) and the specific sequence is not detected (correspondingly, an untested read is obtained). Analyzing the test reads and untested reads can reveal the sequence information of the short tandem repeat sequence, such as the repeat number.

根据本发明的实施例，固相表面还携带第二测序引物，表示为：L-S1-(S0)y，其中，利用L将第二测序引物连接在固相表面上。固相表面同时携带有第一测序引物和第二测序引物，S1-(S0)y通过连接基团L与固相表面相连。According to an embodiment of the present invention, the solid surface further carries a second sequencing primer, denoted as L-S1-(S0)y, wherein the second sequencing primer is linked to the solid surface using L. The solid surface simultaneously carries both the first and second sequencing primers, and S1-(S0)y is connected to the solid surface via the linker group L.

根据本发明的实施例，L不含寡核苷酸。根据本发明的另一实施例，L包含寡核苷酸。根据本发明的又一实施例，L包含多聚核苷酸、聚氨基酸、聚乙二醇和聚乙二醇-聚氨基酸共聚物中的至少一种。According to an embodiment of the invention, L does not contain oligonucleotides. According to another embodiment of the invention, L contains oligonucleotides. According to yet another embodiment of the invention, L contains at least one selected from polynucleotides, polyamino acids, polyethylene glycol, and polyethylene glycol-polyamino acid copolymers.

根据本发明的实施例，S1衍生自下列的至少之一：(a)短串联重复序列的上游序列；(b)短串联重复序列的下游序列；和(c)(a)或(b)的互补序列。According to an embodiment of the invention, S1 is derived from at least one of the following: (a) an upstream sequence of a short tandem repeat sequence; (b) a downstream sequence of a short tandem repeat sequence; and (c) a complementary sequence of (a) or (b).

术语“衍生自”应作广义理解，既可以指S1与STR的上/下游序列或其互补序列完全一致，也可以存在一个或多个碱基的差异或缺失，只要能够保证S1与(S0)y能够同时与插入片段的同一条链(正链或负链)匹配即可。The term "derived from" should be interpreted broadly. It can mean that S1 is completely identical to the upstream/downstream sequence or its complementary sequence of STR, or that there is a difference or deletion of one or more bases, as long as it can be guaranteed that S1 and (S0)y can match the same strand (positive or negative strand) of the inserted fragment at the same time.

根据本发明的实施例，针对给定的一种短串联重复序列，固相表面上携带多种第二测序引物，多种第二测序引物分别具有不同的y取值。通过采用一系列第二测序引物进行测序，基于测穿读段和未测穿读段分析，获知短串联重复序列信息，如重复数等。According to embodiments of the present invention, for a given short tandem repeat sequence, a variety of second sequencing primers are carried on a solid-phase surface, each of which has a different y value. By using a series of second sequencing primers for sequencing, and based on the analysis of test-broken and untested reads, information about the short tandem repeat sequence, such as the repeat number, is obtained.

可以理解地，y取值和y取值的数目(对应一系列不同长度的第二测序引物)不受特别限制，只要存在y取值能产出相应的测穿读段就可以。例如，y可以从0开始连续取值直至不小于该种STR位点最大重复次数，例如据公开数据库等可知晓某种STR的重复次数最大为m，例如，至目前，法医应用常选择的各种STR位点包含的重复单位的重复次数一般小于60，因此，y可以取值至60、或者61或者更大等，以确保存在第二测序引物可以使该(些)STR的测序数据都包含相应的测穿读段。Understandably, the value of y and the number of y values (corresponding to a series of second sequencing primers of different lengths) are not particularly restricted, as long as there exists a y value that can produce the corresponding test-through read. For example, y can start from 0 and take values continuously up to a number not less than the maximum repeat count of that type of STR locus. For example, according to publicly available databases, the maximum repeat count of a certain STR is m. For instance, currently, the repeat count of the repeat units contained in various STR loci commonly selected for forensic applications is generally less than 60. Therefore, y can take values up to 60, 61, or even higher, to ensure that there are second sequencing primers that can ensure that the sequencing data of that STR(s) contain the corresponding test-through read.

根据本发明的实施例，y取值是基于下列公式确定的：y＝a+k×d，其中，a为第一预定常数，且a为整数；d为多个在0～100范围内的不同整数，k为第二预定常数，且k大于0。According to an embodiment of the present invention, the value of y is determined based on the following formula: y = a + k × d, where a is a first predetermined constant and a is an integer; d is a plurality of different integers in the range of 0 to 100; k is a second predetermined constant and k is greater than 0.

根据本发明的实施例，k是基于边合成边测序的读长和/或基序序列的长度确定的。具体地，随着边合成边测序反应的测序长度增加，k的取值增大；随着基序的长度增加，k的取值减小。如此，可减少y取值的数目，亦即只需设计或合成较少种第二测序引物但又能使测序数据包含相应的测穿读段，能降低检测成本，具有较高的工业实用价值。According to embodiments of the present invention, k is determined based on the read length of sequencing-by-synthesis and/or the length of the motif sequence. Specifically, as the sequencing length of the sequencing-by-synthesis reaction increases, the value of k increases; as the length of the motif increases, the value of k decreases. This reduces the number of possible values for y, meaning fewer types of second sequencing primers need to be designed or synthesized while still ensuring the sequencing data contains the corresponding test-break reads, thus reducing detection costs and possessing high industrial practical value.

根据本发明的实施例，k是基于下列公式确定的：其中，b为边合成边测序反应的测序长度，t为基序序列的长度。上述函数为取整函数，k为不超过的整数。如此，可基于各种STR位点和所使用的测序平台得出各STR位点最大的k，进而设计出各位点检测所需的最少种的第二测序引物，如此，能够依赖最少种的第二测序引物来实现指定STR的测序分型，能明显降低检测成本，具有较高的工业实用价值。According to an embodiment of the present invention, k is determined based on the following formula: where b is the sequencing length of the sequencing-by-synthesis reaction, and t is the length of the motif sequence. The above function is a floor function, and k is an integer not exceeding a certain value. Thus, the maximum k for each STR locus can be obtained based on various STR loci and the sequencing platform used, thereby designing the minimum number of second sequencing primers required for each locus detection. This allows for sequencing and genotyping of a specified STR using the minimum number of second sequencing primers, significantly reducing detection costs and possessing high industrial practical value.

根据本发明的实施例，a为小于k的整数，和/或d为多个在0～10范围内的连续整数。According to an embodiment of the present invention, a is an integer less than k, and/or d is a plurality of consecutive integers in the range of 0 to 10.

根据本发明的实施例，固相表面携带第二测序引物，表示为：L-S1-(S0)y，其中，L表示连接基团，用于将第二测序引物连接在固相表面上。S1-(S0)y通过连接基团L固定到固相表面上，S1可以与模板链上的特异性序列相结合，以固定杂交位置，由引物中的(S0)y一端开始进行测序。由于测序长度有限，会出现测到特异性序列(相应地获得测穿读段)和未测到特异性序列(相应地获得未测穿读段)两种情况，对测穿读段和未测穿读段进行分析，可以获知短串联重复序列的序列信息，如重复数等。According to an embodiment of the present invention, a second sequencing primer is carried on the solid-phase surface, denoted as L-S1-(S0)y, where L represents a linker group used to connect the second sequencing primer to the solid-phase surface. S1-(S0)y is fixed to the solid-phase surface by the linker group L, and S1 can bind to a specific sequence on the template strand to fix the hybridization position. Sequencing begins from the (S0)y end of the primer. Due to the limited sequencing length, two situations may occur: the specific sequence is detected (correspondingly, a test read is obtained) and the specific sequence is not detected (correspondingly, an untested read is obtained). Analyzing the test reads and untested reads can reveal the sequence information of the short tandem repeat sequence, such as the repeat number.

根据本发明的实施例，y取值是基于下列公式确定的：y＝a+k×d，其中，a为第一预定常数，且a为整数；d为多个在0～100范围内的整数，k为第二预定常数。According to an embodiment of the present invention, the value of y is determined based on the following formula: y = a + k × d, where a is a first predetermined constant and a is an integer; d is a plurality of integers in the range of 0 to 100, and k is a second predetermined constant.

根据本发明的实施例，k是基于所述边合成边测序反应的读长和/或所述基序的长度确定。According to an embodiment of the present invention, k is determined based on the read length of the sequencing-by-synthesis reaction and/or the length of the motif.

根据本发明的实施例，k是基于下列公式确定的：其中，b为所述边合成边测序反应的读长，t为所述基序序列的长度。如此，可基于各种STR位点和所使用的测序平台得出各STR位点最大的k，进而设计出各位点检测所需的最少种的第二测序引物，如此，能够依赖最少种的第二测序引物来实现指定STR的测序分型，能明显降低检测成本，具有较高的工业实用价值。According to an embodiment of the present invention, k is determined based on the following formula: where b is the read length of the sequencing-by-synthesis reaction, and t is the length of the motif sequence. Thus, the maximum k for each STR locus can be obtained based on various STR loci and the sequencing platform used, thereby designing the minimum number of second sequencing primers required for each locus detection. This allows for sequencing and genotyping of a specified STR using only the minimum number of second sequencing primers, significantly reducing detection costs and possessing high industrial practical value.

根据本发明的实施例，a为不大于k的整数，和/或d为多个在0～10范围内的连续整数。According to an embodiment of the present invention, a is an integer not greater than k, and/or d is a plurality of consecutive integers in the range of 0 to 10.

根据本发明的实施例，固相表面还携带所述第一测序引物，表示为：L-(S0)x，其中，利用所述L将所述第一测序引物连接在所述固相表面上。在固相表面携带有第二测序引物的基础上，还携带有第一测序引物，其中，第一测序引物的(S0)x通过L连接在固相表面。According to an embodiment of the present invention, the solid surface further carries the first sequencing primer, denoted as L-(S0)x, wherein the first sequencing primer is connected to the solid surface using the L. In addition to the second sequencing primer being carried on the solid surface, the first sequencing primer is also carried, wherein the (S0)x of the first sequencing primer is connected to the solid surface via L.

根据本发明的实施例，x为3～20或3～10。在一些实施例中，x为整数。According to embodiments of the present invention, x is 3 to 20 or 3 to 10. In some embodiments, x is an integer.

根据本发明的实施例，L不含寡核苷酸，或者L包含寡核苷酸，或者L包含多聚核苷酸、聚氨基酸、聚乙二醇和聚乙二醇-聚氨基酸共聚物中的至少一种。According to embodiments of the present invention, L does not contain oligonucleotides, or L contains oligonucleotides, or L contains at least one of polynucleotides, polyamino acids, polyethylene glycol, and polyethylene glycol-polyamino acid copolymers.

根据本发明的实施例，所述第一测序引物的长度大于20nt，和/或所述第二测序引物的长度大于20nt。According to an embodiment of the present invention, the length of the first sequencing primer is greater than 20 nt, and/or the length of the second sequencing primer is greater than 20 nt.

根据本发明的实施例，所述在固相表面上对所述文库进行边合成边测序之前，不需在所述固相表面上对该文库进行扩增。According to an embodiment of the present invention, the library does not need to be amplified on the solid surface before performing synthesis-sequencing on the solid surface.

根据本发明的实施例，边合成边测序在单分子测序平台上进行。According to an embodiment of the present invention, sequencing-by-synthesis is performed on a single-molecule sequencing platform.

根据本发明的实施例，第一测序引物和第二测序引物分别独立地固定在相同或者不同的固相表面上。According to embodiments of the present invention, the first sequencing primer and the second sequencing primer are independently immobilized on the same or different solid-phase surfaces.

根据本发明的实施例，对于给定的一种短串联重复序列，测序数据包括对应第一测序引物的第一测序数据和/或对应第二测序引物的第二测序数据，第一测序数据包含的读段数目不少于第二测序数据所包含的读段数目。具体地，针对给定的一种短串联重复序列，测序数据包含至少10万、至少100万或者至少1000万条测序读段。当测序通量达到上述情况下，即使测序读长远低于STR总长，采用该测序方法可通过序列频率准确地分辨短串联重复序列的重复数，例如(STR)₆₂和(STR)₆₃的基因型。According to embodiments of the present invention, for a given short tandem repeat sequence, the sequencing data includes first sequencing data corresponding to a first sequencing primer and/or second sequencing data corresponding to a second sequencing primer, wherein the number of reads contained in the first sequencing data is not less than the number of reads contained in the second sequencing data. Specifically, for a given short tandem repeat sequence, the sequencing data contains at least 100,000, at least 1 million, or at least 10 million sequencing reads. When the sequencing throughput reaches the above levels, even if the sequencing read length is much lower than the total STR length, this sequencing method can accurately distinguish the repeat number of the short tandem repeat sequence, such as the genotypes of (STR) ₆₂ and (STR) ₆₃ , by sequence frequency.

根据本发明的实施例，该短串联重复序列测序方还包括：步骤10：将所述测序数据与参考序列进行比对，以便将所述多个读段分类为测穿读段和未测穿读段，所述测穿读段包含所述特异性序列或所述特异性序列的互补序列的至少一部分，所述未测穿读段不包含所述特异性序列或所述特异性序列的互补序列；和步骤12：基于所述测穿读段和未测穿读段，确定所述短串联重复序列包含的基序序列的数目。According to an embodiment of the present invention, the short tandem repeat sequence sequencing method further includes: step 10: aligning the sequencing data with a reference sequence to classify the plurality of reads into test-through reads and non-test-through reads, wherein the test-through reads contain at least a portion of the specific sequence or a complementary sequence of the specific sequence, and the non-test-through reads do not contain the specific sequence or a complementary sequence of the specific sequence; and step 12: determining the number of motif sequences contained in the short tandem repeat sequence based on the test-through reads and non-test-through reads.

如前所述，第一测序引物包含由数个目标STR的重复单位(基序序列)串联形成的多核苷酸序列，其匹配到包含目标STR的文库的不同位置(亦即错位杂交)具有一定的随机性，因此，利用该第一测序引物捕获或杂交该文库，会产生一系列杂交复合物。进一步地，延伸该第一测序引物对该些杂交复合物进行测序，测序数据将包含对应该些杂交复合物或者说对应目标STR不同位置的多种读段(reads)。由于测序长度有限，会出现测到特异性序列(相应地获得测穿读段)和未测到特异性序列(相应地获得未测穿读段)两种情况，通过对该测序数据与参考序列进行比对分析可以确定目标STR的完整序列，亦即确定该STR的基序序列的重复次数，实现对该STR的分型。As mentioned earlier, the first sequencing primer contains a polynucleotide sequence formed by the tandem repeat units (motif sequences) of several target STRs. Its matching to different positions in the library containing the target STR (i.e., mishybridization) has a certain degree of randomness. Therefore, capturing or hybridizing this library using the first sequencing primer will generate a series of hybridization complexes. Further, extending the first sequencing primer to sequence these hybridization complexes will yield sequencing data containing multiple reads corresponding to these hybridization complexes, or rather, different positions of the target STR. Due to the limited sequencing length, two scenarios may occur: specific sequences are detected (correspondingly, test-through reads are obtained) and no specific sequences are detected (correspondingly, untest-through reads are obtained). By comparing and analyzing this sequencing data with a reference sequence, the complete sequence of the target STR can be determined, that is, the number of repeats of the motif sequence of the STR can be determined, thus enabling genotyping of the STR.

第二测序引物包含与特异性序列或其互补序列对应的序列以及一个或数个目标STR的重复单位(基序序列)串联形成的多核苷酸序列，例如，利用5’端固定在指定表面的第二测序引物5’-S1-(S0)y-3’捕获或杂交包含目标STR的文库，能获得5’端为双链的杂交复合物，也就是说，第二测序引物和该文库的杂交互补位置一端是固定的，如此，可进一步对该杂交复合物进行测序，测序数据将包含对应该些杂交复合物或者说对应目标STR不同位置的多种读段(reads)。由于测序长度有限，会出现测到特异性序列(相应地获得测穿读段)和未测到特异性序列(相应地获得未测穿读段)两种情况，通过对该测序数据与参考序列进行比对分析可以确定目标STR的完整序列，亦即确定该STR的基序序列的重复次数，实现对该STR的分型。The second sequencing primer contains a polynucleotide sequence formed by the tandem of a specific sequence or its complementary sequence and one or more repeating units (motif sequences) of the target STR. For example, using a second sequencing primer 5'-S1-(S0)y-3' with its 5' end fixed to a designated surface to capture or hybridize a library containing the target STR, a hybridization complex with a double-stranded 5' end can be obtained. That is, the complementary position of the second sequencing primer and the library is fixed. Thus, the hybridization complex can be further sequenced, and the sequencing data will contain multiple reads corresponding to these hybridization complexes or different positions of the target STR. Due to the limited sequencing length, there will be two situations: the specific sequence is detected (correspondingly, a probe read is obtained) and the specific sequence is not detected (correspondingly, an unprobe read is obtained). By comparing and analyzing the sequencing data with the reference sequence, the complete sequence of the target STR can be determined, that is, the number of repeats of the motif sequence of the STR can be determined, thus realizing the genotyping of the STR.

根据本发明的实施例，步骤10包括：步骤102：将所述读段与第一参考序列进行比对，所述读段具有与所述第一参考序列匹配的片段、且所述片段长度不小于10bp则指示所述读段为所述测穿读段；步骤104：将所述读段与第二参考序列进行比对，所述读段具有与所述第二参考序列匹配的片段、且所述片段长度大于等于7bp则指示所述读段为所述未测穿读段；所述第一参考序列包含所述特异性序列或其互补序列的至少部分序列，所述第二参考序列为(S0)p，p为2～100的整数。According to an embodiment of the present invention, step 10 includes: step 102: aligning the read segment with a first reference sequence, wherein if the read segment has a segment matching the first reference sequence and the segment length is not less than 10 bp, the read segment is indicated to be the test-through read segment; step 104: aligning the read segment with a second reference sequence, wherein if the read segment has a segment matching the second reference sequence and the segment length is greater than or equal to 7 bp, the read segment is indicated to be the non-test-through read segment; the first reference sequence includes at least a portion of the specific sequence or its complementary sequence, and the second reference sequence is (S0)p, where p is an integer from 2 to 100.

根据本发明的实施例，设置所述比对的错误率小于或等于0.1或0.2。According to an embodiment of the present invention, the error rate of the comparison is set to be less than or equal to 0.1 or 0.2.

根据本发明的实施例，针对所述第一测序引物产生的对应于给定的短串联重复序列的测序数据，步骤12包括：步骤122：确定所述测穿读段的数量以及可选的所述未测穿读段的数量；步骤124：基于所述测穿读段的数量，确定所述短串联重复序列包含的所述基序序列的数目。According to an embodiment of the present invention, for sequencing data corresponding to a given short tandem repeat sequence generated by the first sequencing primer, step 12 includes: step 122: determining the number of test-through reads and the number of optional untest-through reads; step 124: determining the number of motif sequences contained in the short tandem repeat sequence based on the number of test-through reads.

根据本发明的实施例，步骤124包括：步骤1242：基于所述测穿读段的数量，确定所述测穿读段在所述第一测序引物产生的对应于给定的短串联重复序列的测序数据中的比例；步骤1244：基于所述比例，确定所述短串联重复序列包含的所述基序序列的数目。According to an embodiment of the present invention, step 124 includes: step 1242: determining the proportion of the test-break reads in the sequencing data corresponding to a given short tandem repeat sequence generated by the first sequencing primer based on the number of test-break reads; step 1244: determining the number of motif sequences contained in the short tandem repeat sequence based on the proportion.

根据本发明的实施例，步骤1244包括，基于所述比例，按照预先确定的比例-基序数目标准关系，确定所述短串联重复序列包含的所述基序序列的数目。According to an embodiment of the present invention, step 1244 includes determining the number of motif sequences contained in the short tandem repeat sequence based on the ratio and according to a predetermined ratio-motif number standard relationship.

预先对多个已知基序数目的测序文库采用相应的第一测序引物进行测序分析，确定测穿读段占所产生的测序数据的比例。基于该比例与基序数目，确定比例-基序数目标准关系，例如获得比例-基序数目的函数关系。Sequencing analysis is performed on multiple sequencing libraries with known motif numbers using the corresponding first sequencing primers to determine the proportion of test-break reads in the generated sequencing data. Based on this proportion and the motif number, a standard relationship between proportion and motif number is determined, such as obtaining a functional relationship between proportion and motif number.

以第一测序引物(S0)₄为例，将其固定于芯片上，对于一个基因型为(S0)n的样本，会产生(n-3)种杂交模式。在随后的单分子测序中，如果读长无限，会产生n-3种序列。每种序列的频率为1/(n-3)。在读长有限而STR序列总长超过测序读长时，假设读长为L(bp)，S0碱基个数为x，重复数为n时，则未测穿的序列种类为(向下取整；该值小于等于0时，计为0种)，测穿序列种类为则未测穿比率为测穿序列比率和为则各个测穿的序列占比分别为1/(n-3)。Taking the first sequencing primer (S0) ₄ as an example, when it is fixed on a chip, for a sample with genotype (S0)n, (n-3) hybridization patterns will be generated. In subsequent single-molecule sequencing, if the read length is infinite, n-3 sequences will be generated. The frequency of each sequence is 1/(n-3). When the read length is finite but the total length of the STR sequence exceeds the sequencing read length, assuming the read length is L (bp), the number of S0 bases is x, and the number of repeats is n, then the types of undetected sequences are (rounded down; if this value is less than or equal to 0, it is counted as 0), the types of detected sequences are, then the undetected ratio is, and the detected sequence ratio is, then the proportion of each detected sequence is 1/(n-3).

假设测序读长为40，TH01基因座的allele分布为6-n，以(AATG)₆作为第一测序引物，则测序结果预计如下：Assuming a sequencing read length of 40 and an allele distribution of 6-n at the TH01 locus, using (AATG) ₆ as the first sequencing primer, the sequencing results are expected to be as follows:

表1测穿比分析Table 1. Wearability Ratio Analysis

根据本发明的实施例，针对所述第二测序引物产生的对应于给定的短串联重复序列的测序数据，步骤12包括：步骤126：基于所述测穿读段对应的所述第二测序引物以及所述测穿读段的序列信息，确定所述短串联重复序列的序列。According to an embodiment of the present invention, for sequencing data corresponding to a given short tandem repeat sequence generated by the second sequencing primer, step 12 includes: step 126: determining the sequence of the short tandem repeat sequence based on the second sequencing primer corresponding to the test-break read and the sequence information of the test-break read.

如前所述，第二测序引物具有下列核苷酸序列：S1-(S0)y，假设S0为AATG为4nt，假设待测的短串联重复序列的最大重复次数为63，所采用的边合成边测序的测序长度为40bp，则短串联重复序列的测穿间隔(k)最大为40/4＝10，如此，可据示例的公式n＝a+k×d确定y取值以确定出一系列第二测序引物，例如，使a＝4、k＝10，d从0开始连续取整直至使y不小于63，如此，以确保在当前测序参数下，存在第二测序引物能产出测穿读段。具体地，得到系列第二测序引物如下：As mentioned earlier, the second sequencing primer has the following nucleotide sequence: S1-(S0)y. Assuming S0 is AATG and 4nt, and that the maximum repeat count of the short tandem repeat sequence to be tested is 63, and the sequencing length used for sequencing-by-synthesis is 40bp, then the maximum test interval (k) for the short tandem repeat sequence is 40/4 = 10. Thus, the value of y can be determined using the example formula n = a + k × d to determine a series of second sequencing primers. For example, a = 4, k = 10, and d is rounded continuously from 0 until y is not less than 63. This ensures that, under the current sequencing parameters, there exists a second sequencing primer that can produce a test read. Specifically, the series of second sequencing primers is as follows:

STR上游序列+[STR]₄；STR上游序列+[STR]₁₄；STR上游序列+[STR]₂₄；STR upstream sequence + [STR] ₄ ; STR upstream sequence + [STR] ₁₄ ; STR upstream sequence + [STR] ₂₄ ;

STR上游序列+[STR]₃₄；STR上游序列+[STR]₄₄；STR上游序列+[STR]₅₄；STR upstream sequence + [STR] ₃₄ ; STR upstream sequence + [STR] ₄₄ ; STR upstream sequence + [STR] ₅₄ ;

[STR]₄+STR下游序列；[STR]₁₄+STR下游序列；[STR]₂₄+STR下游序列；[STR] ₄ + STR downstream sequence; [STR] ₁₄ + STR downstream sequence; [STR] ₂₄ + STR downstream sequence;

[STR]₃₄+STR下游序列；[STR]₄₄+STR下游序列；[STR]₅₄+STR下游序列。[STR] ₃₄ + STR downstream sequence; [STR] ₄₄ + STR downstream sequence; [STR] ₅₄ + STR downstream sequence.

根据测穿读段的条数与所有读段的条数，确定测穿读段的比例(测穿比)。根据测穿比，判断STR的基序序列的重复次数所在的区间。例如：当测穿比为1时，则可判断STR的重复次数在4到14之间；当测穿比为1/2时，则可判断STR的重复次数在14到24之间；当测穿比为1/3时，则可判断STR的重复次数在24到34之间；当测穿比为1/4时，则判断STR的重复次数在34到44之间；当测穿比为1/5时，则可判断STR的重复次数在44到54之间；当测穿比为1/6时，则可判断STR的重复次数在54到64之间；小于1/6则在64之上。然后，可根据测穿读段和相应的第二测序引物，确定重复次数，例如测穿读段上包含的基序序列或与基序序列对应的序列的数目为m，则该短串联重复序列包含的基序序列的重复次数为基准值+m。其中，“基准值”为产生该测穿读段的第二测序引物的y值。例如，利用第二测序引物“STR上游序列+[STR]₄”和其产出的测穿读段确定该STR的重复次数，这里的基准值为4。The proportion of test segments (test ratio) is determined based on the number of test segments and the total number of segments. The interval in which the repeat count of the STR's motif sequence falls is then determined based on the test ratio. For example: when the test ratio is 1, the repeat count of the STR is between 4 and 14; when the test ratio is 1/2, the repeat count is between 14 and 24; when the test ratio is 1/3, the repeat count is between 24 and 34; when the test ratio is 1/4, the repeat count is between 34 and 44; when the test ratio is 1/5, the repeat count is between 44 and 54; when the test ratio is 1/6, the repeat count is between 54 and 64; and less than 1/6 indicates a repeat count above 64. Then, the number of repeats can be determined based on the test-through read and the corresponding second sequencing primer. For example, if the number of motif sequences or sequences corresponding to motif sequences contained in the test-through read is m, then the number of repeats of the motif sequences contained in the short tandem repeat sequence is the baseline value + m. Here, the "baseline value" is the y-value of the second sequencing primer that generated the test-through read. For example, the number of repeats of the STR can be determined using the second sequencing primer "STR upstream sequence + [STR] ₄ " and the test-through read it produces; here, the baseline value is 4.

(1)以上述引物序列测定基序序列重复次数为10的STR为例(1) Taking the above primer sequence as an example to determine the STR with a motif repeat number of 10.

能测穿的杂交引物为STR上游序列+[STR]₄以及[STR]₄+STR下游序列，且需要测试6个基序序列AATG才能测穿，其余引物序列则难以与文库杂交、或者杂交后无法进行测序。The hybridization primers that can be detected are the upstream STR sequence + [STR] ₄ and the downstream STR sequence + [STR] ₄ . AATG is required to detect six motif sequences. Other primer sequences are difficult to hybridize with the library or cannot be sequenced after hybridization.

(2)以上述引物序列测定基序序列重复次数为48的STR为例(2) Taking the above primer sequence as an example, the STR with a motif repeat number of 48 was determined.

能杂交上并能延伸获得读段的引物有STR上游序列+[STR]₄，STR上游序列+[STR]₁₄，STR上游序列+[STR]₂₄，STR上游序列+[STR]₃₄，STR上游序列+[STR]₄₄，[STR]₄+STR下游序列，[STR]₁₄+STR下游序列，[STR]₂₄+STR下游序列，[STR]₃₄+STR下游序列，[STR]₄₄+STR下游序列；其中，能测穿的引物为STR上游序列+[STR]₄₄与[STR]₄₄+STR下游序列两种，所产生的测穿读段占所有reads的1/5，据此可得出，该STR包含的重复次数大于44、且相应的引物还需要再读4个基序序列才能测穿，据此，可得出该STR的基序序列的重复次数为48。The primers capable of hybridization and extension to obtain reads are: STR upstream sequence + [STR] ₄ , STR upstream sequence + [STR] ₁₄ , STR upstream sequence + [STR] ₂₄ , STR upstream sequence + [STR] ₃₄ , STR upstream sequence + [STR] ₄₄ , [STR] ₄ + STR downstream sequence, [STR] ₁₄ + STR downstream sequence, [STR] ₂₄ + STR downstream sequence, [STR] ₃₄ + STR downstream sequence, and [STR] ₄₄ + STR downstream sequence. Among these, the primers capable of detection are STR upstream sequence + [STR] ₄₄ and [STR] ₄₄ + STR downstream sequence, which generate 1/5 of all reads. Based on this, it can be concluded that the STR contains more than 44 repeats, and the corresponding primers need to read 4 more motif sequences to detect the repeats. Therefore, the motif sequence repeat count of this STR is 48.

确定短串联重复序列的方法Methods for determining short tandem repeat sequences

在本发明的又一方面，本发明提出了一种确定短串联重复序列的方法。根据本发明的实施例，方法包括：(i)获取短串联重复序列的测序数据，测序数据是根据前面短串联重复序列测序方法获得的；(ii)将测序数据与参考序列进行比对，以便将多个测序读段分类为测穿读段和未测穿读段；和(iii)基于测穿读段和未测穿读段，确定待测样本的短串联重复序列分析结果。In another aspect of the invention, a method for identifying short tandem repeat sequences is proposed. According to an embodiment of the invention, the method includes: (i) acquiring sequencing data of the short tandem repeat sequences, the sequencing data being obtained according to the preceding short tandem repeat sequence sequencing method; (ii) aligning the sequencing data with a reference sequence to classify multiple sequencing reads into test-through reads and non-test-through reads; and (iii) determining the short tandem repeat sequence analysis results of the sample to be tested based on the test-through reads and non-test-through reads.

根据本发明的实施例，在步骤(ii)中，通过下列方法确定测穿读段和未测穿读段：(ii-1)将读段与第一参考序列进行比对，读段具有与第一参考序列匹配的片段，且片段长度不小于10bp是读段为测穿读段的指示；(ii-2)将读段与第二参考序列进行比对，读段具有与第二参考序列匹配的片段，且片段长度大于等于7bp是读段为未测穿读段的指示；第一参考序列包含特异性序列或其互补序列的至少部分序列；第二参考序列为(S0)p，p为2～100的整数。According to an embodiment of the present invention, in step (ii), the test-through reads and non-test-through reads are determined by the following methods: (ii-1) the read is aligned with a first reference sequence, and the read has a segment that matches the first reference sequence and the segment length is not less than 10 bp, which indicates that the read is a test-through read; (ii-2) the read is aligned with a second reference sequence, and the read has a segment that matches the second reference sequence and the segment length is greater than or equal to 7 bp, which indicates that the read is a non-test-through read; the first reference sequence contains at least a portion of a specific sequence or its complementary sequence; the second reference sequence is (S0)p, where p is an integer from 2 to 100.

根据本发明的实施例，设置比对的错误率小于或等于0.1或0.2。According to an embodiment of the present invention, the error rate of the comparison is set to be less than or equal to 0.1 or 0.2.

根据本发明的实施例，针对给定第一测序引物产生的测序数据，在步骤(iii)中进一步包括：(iii-1)确定测穿读段以及可选的未测穿读段的数目；(iii-2)基于测穿读段的数目，确定短串联重复序列中的基序序列的数目。According to an embodiment of the present invention, for sequencing data generated by a given first sequencing primer, step (iii) further includes: (iii-1) determining the number of test-through reads and the number of optional untest-through reads; (iii-2) determining the number of motif sequences in the short tandem repeat sequence based on the number of test-through reads.

根据本发明的实施例，步骤(iii-2)进一步包括：(iii-2-1)基于测穿读段的数目，确定测穿读段在第一测序引物产生的对应于给定的短串联重复序列的测序数据的比例；(iii-2-2)基于比例，确定短串联重复序列含有的基序序列的数目。According to an embodiment of the present invention, step (iii-2) further includes: (iii-2-1) determining the proportion of sequencing data corresponding to a given short tandem repeat sequence generated by the first sequencing primer based on the number of test-through reads; (iii-2-2) determining the number of motif sequences contained in the short tandem repeat sequence based on the proportion.

根据本发明的实施例，在步骤(iii-2-2)中，基于比例，按照预先确定的比例-基序序列数目标准关系，确定短串联重复序列包含的基序序列的数目。According to an embodiment of the present invention, in step (iii-2-2), based on the ratio, the number of motif sequences contained in the short tandem repeat sequence is determined according to a predetermined ratio-motif sequence number standard relationship.

预先对多个已知基序序列数目的测序文库采用相应的第一测序引物进行测序分析，确定测穿读段占所产生的测序数据的比例。基于该比例与基序序列数目，确定比例-基序序列数目标准关系，例如获得比例-基序序列数目的函数关系。Sequencing analysis is performed on multiple sequencing libraries with known motif sequence numbers using the corresponding first sequencing primers to determine the proportion of test-break reads in the generated sequencing data. Based on this proportion and the motif sequence number, a standard relationship between the proportion and the motif sequence number is determined, for example, a functional relationship between the proportion and the motif sequence number is obtained.

表1测穿比分析Table 1. Wearability Ratio Analysis

根据本发明的实施例，针对第二测序引物产生的对应于给定的短串联重复序列的测序数据，步骤(iii)进一步包括：(iii-a)基于测穿读段对应的第二测序引物以及测穿读段的序列信息，确定短串联重复序列包含的基序序列的数目。According to an embodiment of the present invention, for sequencing data corresponding to a given short tandem repeat sequence generated by the second sequencing primer, step (iii) further includes: (iii-a) determining the number of motif sequences contained in the short tandem repeat sequence based on the second sequencing primer corresponding to the test-through read and the sequence information of the test-through read.

如前所述，第二测序引物具有下列核苷酸序列：S1-(S0)y，假设S0为AATG为4nt，假设待测的短串联重复序列的最大重复次数为63，所采用的边合成边测序的测序长度为40bp，则短串联重复序列的测穿间隔(k)最大为40/4＝10，，如此，可据示例的公式n＝a+k×d确定y取值以确定出一系列第二测序引物，例如，使a＝4、k＝10，d从0开始连续取整直至使y不小于63，如此，以确保在当前测序参数下，存在第二测序引物能产出测穿读段。具体地，得到系列第二测序引物如下：STR上游序列+[STR]₄；STR上游序列+[STR]₁₄；STR上游序列+[STR]₂₄；As mentioned earlier, the second sequencing primer has the following nucleotide sequence: S1-(S0)y. Assuming S0 is AATG and 4nt, and the maximum number of repetitions of the short tandem repeat sequence to be tested is 63, and the sequencing length used for sequencing-by-synthesis is 40bp, then the maximum test interval (k) for the short tandem repeat sequence is 40/4 = 10. Thus, the value of y can be determined using the example formula n = a + k × d to determine a series of second sequencing primers. For example, a = 4, k = 10, and d is continuously rounded from 0 until y is not less than 63. This ensures that, under the current sequencing parameters, there exists a second sequencing primer that can produce a test read. Specifically, the following series of second sequencing primers are obtained: STR upstream sequence + [STR] ₄ ; STR upstream sequence + [STR] ₁₄ ; STR upstream sequence + [STR] ₂₄ ;

需要说明的是，前面针对短串联重复序列测序方法所描述的特征和优点，同样适用于该确定短串联重复序列的方法，在此不再赘述。It should be noted that the features and advantages described above for sequencing short tandem repeat sequences also apply to this method for identifying short tandem repeat sequences, and will not be repeated here.

芯片chip

在本发明的又一方面，本发明提出了一种芯片。根据本发明的实施例，芯片包括：基底，具有表面；第一测序引物，第一测序引物固定在表面上，并且，第一测序引物具有下列核苷酸序列：(S0)x，其中，S0表示短串联重复序列的基序序列或其互补序列，x不小于1。In another aspect of the invention, a chip is provided. According to an embodiment of the invention, the chip includes: a substrate having a surface; a first sequencing primer fixed on the surface, and the first sequencing primer having the following nucleotide sequence: (S0)x, wherein S0 represents a motif sequence of a short tandem repeat sequence or its complementary sequence, and x is not less than 1.

本发明通过设计第一测序引物使其如为待测序列的一部分，例如为由数个目标STR的重复单位(基序序列)串联形成的多核苷酸序列，而鉴于测试发现该第一测序引物杂交匹配到包含目标STR的文库的不同位置(亦即错位杂交)具有一定的随机性，因此，利用该第一测序引物捕获或杂交该文库，可以理解地，会产生一系列杂交复合物。进一步地，延伸该第一测序引物对该些杂交复合物进行测序，测序数据将包含对应该些杂交复合物或者说对应目标STR不同位置的多种读段(reads)，进而，通过对该测序数据的分析可以确定目标STR的完整序列，亦即确定该STR的基序序列的重复次数，实现对该STR的分型。同时具有节省测序时间和测序成本等优点。This invention designs a first sequencing primer that is part of the sequence to be tested, for example, a polynucleotide sequence formed by the tandem repeat units (motif sequences) of several target STRs. Given that testing has shown that the first sequencing primer's hybridization matching to different positions in a library containing the target STR (i.e., mishybridization) has a certain degree of randomness, capturing or hybridizing this library using the first sequencing primer will understandably generate a series of hybridization complexes. Furthermore, extending the first sequencing primer to sequence these hybridization complexes will yield sequencing data containing multiple reads corresponding to these hybridization complexes, or rather, to different positions of the target STR. Analysis of this sequencing data can then determine the complete sequence of the target STR, i.e., the number of repeats of the motif sequence, thus enabling STR genotyping. This method also offers advantages such as saving sequencing time and sequencing costs.

根据本发明的实施例，第一测序引物具有下列结构：L-(S0)x，其中，L表示连接基团，用于将第一测序引物固定在芯片上。According to an embodiment of the present invention, the first sequencing primer has the following structure: L-(S0)x, wherein L represents a linker group used to immobilize the first sequencing primer on a chip.

根据本发明的实施例，x为3～20或3～10。According to an embodiment of the present invention, x is 3 to 20 or 3 to 10.

根据本发明的实施例，x为整数。According to an embodiment of the present invention, x is an integer.

根据本发明的实施例，芯片还包括第二测序引物，第二测序引物固定在表面上，并且，第二测序引物具有下列核苷酸序列：S1-(S0)y，其中，S1为特异性序列或与特异性序列对应的序列，S0为短串联重复序列的基序序列或其互补序列，特异性序列能够指示短串联重复序列在参考序列上的位置，y大于或等于0。According to an embodiment of the present invention, the chip further includes a second sequencing primer, which is fixed on the surface and has the following nucleotide sequence: S1-(S0)y, wherein S1 is a specific sequence or a sequence corresponding to the specific sequence, S0 is a motif sequence of a short tandem repeat sequence or its complementary sequence, the specific sequence is capable of indicating the position of the short tandem repeat sequence on the reference sequence, and y is greater than or equal to 0.

发明人设计第二测序引物使其例如为包含特异性序列或与特异性序列对应的序列以及一个或数个目标STR的重复单位(基序序列)串联形成的多核苷酸序列，例如，利用5’端固定在指定表面的第二测序引物5’-S1-(S0)y-3’捕获或杂交包含目标STR的文库，能获得5’端为双链的杂交复合物，也就是说，第二测序引物和该文库的杂交互补位置一端是固定的，如此，可进一步对该杂交复合物进行测序，以确定该STR的序列，实现对该STR的检测。The inventors designed a second sequencing primer that is, for example, a polynucleotide sequence consisting of a specific sequence or a sequence corresponding to the specific sequence and one or more repeating units (motif sequences) of the target STR tandemly. For example, by using a second sequencing primer 5'-S1-(S0)y-3' with its 5' end fixed on a designated surface to capture or hybridize a library containing the target STR, a hybridization complex with a double strand at its 5' end can be obtained. That is, the complementary hybridization position between the second sequencing primer and the library is fixed at one end. Thus, the hybridization complex can be further sequenced to determine the sequence of the STR and achieve the detection of the STR.

更具体地，设计使该第二测序引物包含一组不同长度的序列，并且使其中的每种长度的序列分别包含特异性序列或与特异性序列对应的序列以及不同长度的(S0)y，亦即y具有多个不同取值的一组序列，可以理解地，利用该组序列捕获或杂交包含目标STR的文库，能获得5’端为双链的一系列杂交复合物，对该些杂交复合物进行测序，可以理解地，测序数据将包含对应该些杂交复合物或者说对应目标STR不同位置的多种读段(reads)，进而，通过分析该测序数据可以确定该STR的完整序列，亦即确定该STR的基序序列的重复次数，实现对该STR的分型。More specifically, the second sequencing primer is designed to contain a set of sequences of different lengths, and each sequence of length contains a specific sequence or a sequence corresponding to the specific sequence, as well as (S0)y of different lengths, i.e., a set of sequences with multiple different values of y. Understandably, by using this set of sequences to capture or hybridize a library containing the target STR, a series of hybridization complexes with double strands at the 5' end can be obtained. Sequencing these hybridization complexes will, understandably, produce sequencing data containing multiple reads corresponding to these hybridization complexes or different positions of the target STR. Furthermore, by analyzing this sequencing data, the complete sequence of the STR can be determined, i.e., the number of repetitions of the motif sequence of the STR can be determined, thus achieving genotyping of the STR.

根据本发明的实施例，连接在表面的第二测序引物具有下列结构：L-S1-(S0)y，其中，通过L将第二测序引物固定在表面上。According to an embodiment of the present invention, the second sequencing primer connected to the surface has the following structure: L-S1-(S0)y, wherein the second sequencing primer is fixed to the surface by L.

根据本发明的实施例，S1衍生自下列的至少之一：(a)所述短串联重复序列的上游序列；(b)所述短串联重复序列的下游序列；和(c)(a)或(b)的互补序列。According to an embodiment of the invention, S1 is derived from at least one of the following: (a) an upstream sequence of the short tandem repeat sequence; (b) a downstream sequence of the short tandem repeat sequence; and (c) a complementary sequence of (a) or (b).

根据本发明的实施例，针对给定的一种短串联重复序列，所述芯片上携带多种第二测序引物，所述多种第二测序引物分别具有不同的y取值。According to an embodiment of the present invention, for a given short tandem repeat sequence, the chip carries a variety of second sequencing primers, each of which has a different y value.

根据本发明的实施例，y取值是基于下列公式确定的：y＝a+k×d，其中，a为第一预定常数，且a为整数，d为多个在0～100范围内的整数，k为第二预定常数。According to an embodiment of the present invention, the value of y is determined based on the following formula: y = a + k × d, where a is a first predetermined constant and a is an integer, d is a plurality of integers in the range of 0 to 100, and k is a second predetermined constant.

根据本发明的实施例，在利用所述芯片进行边合成边测序以测定所述短串联重复序列的情景中，k是基于所述边合成边测序的读长和/或所述基序序列的长度确定的。According to an embodiment of the present invention, in a scenario where sequencing-by-synthesis is performed using the chip to determine the short tandem repeat sequence, k is determined based on the read length of the sequencing-by-synthesis and/or the length of the motif sequence.

根据本发明的实施例，k是基于下列公式确定的：其中，b为所述边合成边测序反应的读长，t为所述基序序列的长度。According to an embodiment of the present invention, k is determined based on the following formula: where b is the read length of the sequencing-by-synthesis reaction, and t is the length of the motif sequence.

根据本发明的实施例，L不含寡核苷酸，或者L包含寡核苷酸，或者所述L包含多聚核苷酸、聚氨基酸、聚乙二醇和聚乙二醇-聚氨基酸共聚物中的至少一种。According to embodiments of the present invention, L is free of oligonucleotides, or L contains oligonucleotides, or L contains at least one of polynucleotides, polyamino acids, polyethylene glycol, and polyethylene glycol-polyamino acid copolymers.

根据本发明的实施例，芯片用于单分子测序平台。According to an embodiment of the present invention, the chip is used in a single-molecule sequencing platform.

此外，本发明提出了另一种芯片。根据本发明的实施例，芯片包括：基底，具有表面；第二测序引物，第二测序引物固定在表面上，并且，第二测序引物具有下列核苷酸序列：S1-(S0)n，其中，S1为特异性序列或与特异性序列对应的序列，S0为短串联重复序列的基序序列或其互补序列，特异性序列能够指示短串联重复序列在参考序列上的位置，y大于或等于0。Furthermore, the present invention proposes another chip. According to an embodiment of the present invention, the chip includes: a substrate having a surface; a second sequencing primer fixed on the surface, and the second sequencing primer having the following nucleotide sequence: S1-(S0)n, wherein S1 is a specific sequence or a sequence corresponding to the specific sequence, S0 is a motif sequence of a short tandem repeat sequence or its complementary sequence, the specific sequence is capable of indicating the position of the short tandem repeat sequence on a reference sequence, and y is greater than or equal to 0.

根据本发明的实施例，针对给定的一种短串联重复序列，芯片上携带多种第二测序引物，多种第二测序引物分别具有不同的y取值。According to an embodiment of the present invention, for a given short tandem repeat sequence, a chip carries a variety of second sequencing primers, each of which has a different y value.

根据本发明的实施例，在利用芯片进行边合成边测序以测定短串联重复序列的情景中，k是基于边合成边测序反应的读长和/或基序序列的长度确定的。According to an embodiment of the present invention, in a scenario where sequencing-by-synthesis is performed using a chip to determine short tandem repeat sequences, k is determined based on the read length of the sequencing-by-synthesis reaction and/or the length of the motif sequence.

根据本发明的实施例，k是基于下列公式确定的：其中，b为边合成边测序反应的读长，t为基序序列的长度。According to an embodiment of the present invention, k is determined based on the following formula: where b is the read length of the sequencing-by-synthesis reaction, and t is the length of the motif sequence.

根据本发明的实施例，芯片还包括第一测序引物，第一测序引物连接在表面上，并且，第一测序引物具有下列核苷酸序列：(S0)x，其中，S0表示短串联重复序列的基序序列或其互补序列，x不小于1。According to an embodiment of the present invention, the chip further includes a first sequencing primer, which is attached to the surface and has the following nucleotide sequence: (S0)x, wherein S0 represents the motif sequence of a short tandem repeat sequence or its complementary sequence, and x is not less than 1.

根据本发明的实施例，连接在表面的第一测序引物具有下列结构：L-(S0)x，其中，利用所述L将第一测序引物固定在芯片上。According to an embodiment of the present invention, the first sequencing primer connected to the surface has the following structure: L-(S0)x, wherein the first sequencing primer is fixed on the chip by means of the L.

需要说明的是，前面针对短串联重复序列测序方法或者确定短串联重复序列的方法所描述的特征和优点，同样适用于该芯片，在此不再赘述。It should be noted that the features and advantages described above for sequencing or identifying short tandem repeat sequences also apply to this chip, and will not be repeated here.

试剂盒Reagent test kit

在本发明的又一方面，本发明提出了一种试剂盒。根据本发明的实施例，试剂盒包括：(1)如前面短串联重复序列测序方法中的第一测序引物和/或所述第二测序引物；或(2)如前面的芯片。由此，利用该试剂盒可以准确地实现高通量短读长STR序列测序，节省测序时间和测序成本。In another aspect of the invention, a kit is provided. According to an embodiment of the invention, the kit includes: (1) a first sequencing primer and/or a second sequencing primer as described in the preceding short tandem repeat sequencing method; or (2) a chip as described above. Thus, this kit can be used to accurately perform high-throughput short-read STR sequence sequencing, saving sequencing time and costs.

需要说明的是，前面针对短串联重复序列测序方法和芯片所描述的特征和优点，同样适用于该试剂盒，在此不再赘述。It should be noted that the features and advantages described above for the short tandem repeat sequence sequencing method and chip also apply to this kit, and will not be repeated here.

识别个体的方法Methods for identifying individuals

在本发明的又一方面，本发明提出了一种识别个体的方法。根据本发明的实施例，方法包括：根据前面短串联重复序列测序方法，确定待测样本中的多种给定的短串联重复序列的序列，待测样本包含核酸；基于序列，确定待测样本源自的一个或多个个体。由此，利用该方法可以准确地识别个体，例如获知遗传关系等。In another aspect, the present invention proposes a method for identifying individuals. According to an embodiment of the invention, the method includes: determining the sequences of multiple given short tandem repeat sequences in a sample to be tested, the sample containing nucleic acids, using the aforementioned short tandem repeat sequencing method; and determining, based on the sequences, the origin of one or more individuals from which the sample to be tested originates. Thus, this method can accurately identify individuals, for example, to determine genetic relationships.

上述任一实施例提供的STR检测方法或试剂盒，可用于各种基于STR检测的应用，例如法医个体识别、亲子鉴定，也可辅助遗传学变异诊断或筛查，如辅助产前筛查等。The STR testing method or kit provided in any of the above embodiments can be used for various STR testing-based applications, such as forensic individual identification and paternity testing, and can also assist in the diagnosis or screening of genetic variations, such as assisting in prenatal screening.

需要说明的是，前面针对短串联重复序列测序方法所描述的特征和优点，同样适用于该方法，在此不再赘述。It should be noted that the features and advantages described above for short tandem repeat sequencing methods also apply to this method, and will not be repeated here.

下面将结合实施例对本发明的方案进行解释。本领域技术人员将会理解，下面的实施例仅用于说明本发明，而不应视为限定本发明的范围。实施例中未注明具体技术或条件的，按照本领域内的文献所描述的技术或条件或者按照产品说明书进行。所用试剂或仪器未注明生产厂商者，均为可以通过市购获得的常规产品。除非另有说明，本文示例的单链核酸分子或双链核酸分子的具体序列，包括文库、插入片段、核酸片段、目标序列、位点、基序序列、多核苷酸序列、接头、测序引物、固定链、探针或杂交链等，均是以5'至3'方向从左到右书写表示的。The present invention will be explained below with reference to embodiments. Those skilled in the art will understand that the following embodiments are for illustrative purposes only and should not be considered as limiting the scope of the invention. Where specific techniques or conditions are not specified in the embodiments, they are performed according to the techniques or conditions described in the literature in the art or according to the product instructions. Reagents or instruments whose manufacturers are not specified are all commercially available conventional products. Unless otherwise stated, the specific sequences of single-stranded or double-stranded nucleic acid molecules exemplified herein, including libraries, insert fragments, nucleic acid fragments, target sequences, sites, motif sequences, polynucleotide sequences, adapters, sequencing primers, fixed strands, probes, or hybridization strands, are all written from left to right in a 5' to 3' direction.

实施例1Example 1

在该实施例中，按照下列方法确定STR重复数：In this embodiment, the number of STR repeats is determined according to the following method:

1、构建文库1. Building a document library

参考图4的建库流程构建文库，文库结构参见图5。Refer to Figure 4 for the library creation process to construct the library. The library structure is shown in Figure 5.

2、固定链设计2. Fixed chain design

固定链(测序引物)设计成“L-[S0]₅”以及“L-[S0]₆”，其中L为TTTTTTTTTT，S0为AATG。具体序列如表1所示。The fixed strands (sequencing primers) were designed as “L-[S0] ₅ ” and “L-[S0] ₆ ”, where L stands for TTTTTTTTTT and S0 stands for AATG. The specific sequences are shown in Table 1.

3、杂交链(合成靶)设计3. Hybridization chain (synthetic target) design

杂交链设计如下表所示。The hybridization chain design is shown in the table below.

表1固定链和杂交链设计Table 1. Design of fixed-chain and hybrid chains

4、操作流程4. Operating Procedures

将固定链按照单分子固定流程，固定到芯片上，然后将构建好的文库(杂交链)杂交到上述芯片上。杂交完成之后，根据单分子测序流程从固定链的3'端进行测序。单分子荧光测序仪器：GenoCare1600，80轮双色测序，参考表2。The immobilized strand was fixed onto the chip using a single-molecule immobilization procedure, and then the constructed library (hybridized strand) was hybridized onto the chip. After hybridization, sequencing was performed from the 3' end of the immobilized strand according to the single-molecule sequencing procedure. Single-molecule fluorescence sequencing instrument: GenoCare 1600, 80 rounds of two-color sequencing, refer to Table 2.

表2测序条件Table 2 Sequencing conditions

5、分析5. Analysis

1)下机Fasta数据首先使用局部比对方法与第一参考序列1) The Fast data from the machine is first compared with the first reference sequence using a local alignment method.

“AGGGAAATAAGGGAGGAACAGGCCAATGGGAATC”对比。若比对结果错误率小于等于预设阈值(0.1或0.2)且相匹配的序列长度大于等于10bp，则认为是Repeat区域测穿的序列。The sequence “AGGGAAATAAGGGAGGAACAGGCCAATGGGAATC” is compared. If the error rate of the alignment result is less than or equal to the preset threshold (0.1 or 0.2) and the length of the matched sequence is greater than or equal to 10 bp, it is considered to be a sequence that has been tested in the Repeat region.

2)将剩余序列与第二参考序列“AATGAATG”比对，若对比结果错误率小于0.2，且相匹配的序列长度大于等于7bp，则认为是测到repeat区域且未测穿。2) Align the remaining sequence with the second reference sequence “AATGAATG”. If the error rate of the alignment result is less than 0.2 and the length of the matched sequence is greater than or equal to 7bp, it is considered that the repeat region has been detected but not penetrated.

3)对步骤二的序列进行分析，先找到序列中repeat区域，即去除序列中“AGGGAAATAAGGGAGGAACAGGCCAATGGGAATC”序列片段。再由程序判断Repeat的个数。分别统计Repeat为0，1，2，3…的reads个数。3) Analyze the sequence from step two. First, locate the repeat regions in the sequence, i.e., remove the sequence segment "AGGGAAATAAGGGAGGAACAGGCCAATGGGAATC". Then, the program determines the number of repeats. Count the number of reads with repeat values of 0, 1, 2, 3... respectively.

4)将步骤1的reads数除以步骤1、2的reads数之合，得到“测穿比”：即测序结果为STR上(下)游特异性序列的比率。4) Divide the number of reads in step 1 by the sum of the number of reads in steps 1 and 2 to obtain the "test-through ratio": that is, the ratio of sequencing results that are upstream (downstream) specific sequences of STR.

结果如图6所示，使用(AATG)₆引物为固定链，对包含STR重复序列长度为36、40、44、48(即杂交链重复数(Repeat)个数9，10，11，12)的文库测序，得到的“测穿比”和Repeat个数成线性负相关。这一结果和理论预测情况符合。由此表明利用本方法可以确定STR重复数。另外，顺便提一下，对于具体的待测STR位点，可据其基序序列的碱基组成和聚合酶延伸反应的经验来测试确定出较适的引物长度、杂交和/或测序反应条件，以顺利获得测序数据；例如，倘若据具体基序序列设计出的引物的AT比例较高，且测序数据的分布与预测的差别较大，则可参照常规的PCR经验进行调整，例如设计更长的引物如包含更多个基序序列，和/或调整聚合反应温度等。The results are shown in Figure 6. Using (AATG) ₆ primers as the fixed strand, sequencing libraries containing STR repeat sequence lengths of 36, 40, 44, and 48 (i.e., 9, 10, 11, and 12 repeats in the hybridization strand) showed a linear negative correlation between the test-through ratio and the number of repeats. This result is consistent with theoretical predictions. This indicates that the STR repeat number can be determined using this method. Additionally, it's worth mentioning that for a specific STR locus to be tested, the appropriate primer length, hybridization, and/or sequencing reaction conditions can be determined based on the base composition of its motif sequence and experience with polymerase extension reactions to successfully obtain sequencing data. For example, if the AT ratio of the primers designed based on the specific motif sequence is high, and the distribution of sequencing data differs significantly from the prediction, adjustments can be made according to conventional PCR experience, such as designing longer primers containing more motif sequences and/or adjusting the polymerization reaction temperature.

实施例2Example 2

在该实施例中，研究在读长分布分别为avg＝40，std＝5和avg＝40，std＝10高斯分布下，若要较好区分63个重复数和62个重复数的STR序列(分别简称“63repeat”和“62repeat”)需要的通量大小。默认捕获序列repeat＝10。In this embodiment, the required throughput is investigated to effectively distinguish STR sequences with 63 repeats and 62 repeats (referred to as "63repeat" and "62repeat" respectively) under Gaussian distributions of read lengths avg=40, std=5 and avg=40, std=10. The default capture sequence repeat=10.

图7和图8分别为读长分布为avg＝40,std＝5高斯分布下，使用1M和10M Reads模拟得到的分布情况(该分布为模拟测序中剩余0Repeat的比例倒数)。Figures 7 and 8 show the simulated read distributions using 1M and 10M reads, respectively, under a Gaussian distribution with avg=40 and std=5 (this distribution is the reciprocal of the proportion of 0 repeats remaining in the simulated sequencing).

图9和图10分别为读长分布为avg＝40,std＝10高斯分布下，使用1M和10M Reads模拟得到的分布情况(该分布为模拟测序中剩余0Repeat的比例倒数)。Figures 9 and 10 show the simulated read distributions using 1M and 10M reads, respectively, under a Gaussian distribution with avg=40 and std=10 (this distribution is the reciprocal of the proportion of 0 repeats remaining in the simulated sequencing).

可见在通量达到10Mreads时，即使测序读长远低于STR总长，这一测序方法可通过序列频率准确分辨(STR)₆₂和(STR)₆₃的基因型。It is evident that when the throughput reaches 10Mreads, even if the sequencing read length is much lower than the total STR length, this sequencing method can accurately distinguish the genotypes of (STR) ₆₂ and (STR) ₆₃ by sequence frequency.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外，在不相互矛盾的情况下，本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.

尽管上面已经示出和描述了本发明的实施例，可以理解的是，上述实施例是示例性的，不能理解为对本发明的限制，本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention. Those skilled in the art can make changes, modifications, substitutions and variations to the above embodiments within the scope of the present invention.

Claims

1. A method for sequencing short tandem repeat sequences, characterized in that it comprises:

Obtain a library, wherein the inserted fragments of the library contain short tandem repeat sequences and specific sequences linked to the short tandem repeat sequences;

The library is sequenced while synthesizing on a solid surface to obtain sequencing data, which includes multiple reads. The solid surface carries at least one of a first sequencing primer and a second sequencing primer.

The first sequencing primer has the following nucleotide sequence:

(S0)x, where S0 represents the motif sequence of the short tandem repeat sequence or its complementary sequence, and x is not less than 1.

The second sequencing primer has the following nucleotide sequence:

S1-(S0)y, where S1 is the specific sequence or the sequence corresponding to the specific sequence, S0 is the motif sequence of the short tandem repeat sequence or its complementary sequence, and y is greater than or equal to 0.

2. The method according to claim 1, wherein the solid-phase surface carries the first sequencing primer, as shown in:

L-(S0)x, where,

L represents a linker group, used to link the first sequencing primer to the solid surface;

Optionally, x is 3–20 or 3–10;

Optionally, x is an integer;

Optionally, the solid surface further carries the second sequencing primer, as shown below:

L-S1-(S0)y, where,

The second sequencing primer is attached to the solid surface using the L;

Optionally, the L does not contain oligonucleotides;

Optionally, the L comprises an oligonucleotide;

Optionally, the L comprises at least one of polynucleotides, polyamino acids, polyethylene glycol, and polyethylene glycol-polyamino acid copolymers;

Optionally, S1 is derived from at least one of the following:

(a) The upstream sequence of the short tandem repeat sequence;

(b) the downstream sequence of the short tandem repeat sequence; and

(c)(a) or (b) complementary sequences;

Optionally, for a given short tandem repeat sequence, the solid surface carries a variety of second sequencing primers, each of which has a different y value;

Optionally, the value of y is determined based on the following formula:

y = a + k × d, where,

a is a first predetermined constant, and a is an integer.

d can be multiple distinct integers in the range of 0 to 100.

k is a second predetermined constant, and k is greater than 0;

Optionally, k is determined based on the read length of the sequencing-by-synthesis process and/or the length of the motif sequence;

Optionally, k is determined based on the following formula:

in,

b is the read length of the sequencing-by-synthesis process.

t is the length of the motif sequence;

Optionally, a is an integer less than k, and/or d is a plurality of consecutive integers in the range of 0 to 10.

3. The method according to claim 1, wherein the solid-phase surface carries the second sequencing primer, as shown in:

L-S1-(S0)y, where,

L represents a linker group, used to link the second sequencing primer to the solid surface;

Optionally, S1 is derived from at least one of the following:

(a) The upstream sequence of the short tandem repeat sequence;

(b) the downstream sequence of the short tandem repeat sequence; and

(c)(a) or (b) complementary sequences;

Optionally, the value of y is determined based on the following formula: y = a + k × d, where,

a is a first predetermined constant, and a is an integer;

d can be multiple integers in the range of 0 to 100.

k is a second predetermined constant;

Optionally, k is determined based on the read length of the sequencing-by-synthesis reaction and/or the length of the motif;

Optionally, k is determined based on the following formula:

in,

b represents the read length of the sequencing-by-synthesis reaction.

t is the length of the motif sequence;

Optionally, a is an integer not greater than k, and/or d is a plurality of consecutive integers in the range of 0 to 10;

Optionally, the solid surface further carries the first sequencing primer, as shown below:

L-(S0)x, where,

The first sequencing primer is ligated to the solid surface using the L;

Optionally, x is 3–20 or 3–10;

Optionally, x is an integer;

Optionally, the L does not contain oligonucleotides;

Optionally, the L comprises an oligonucleotide;

Optionally, the L comprises at least one of polynucleotides, polyamino acids, polyethylene glycol, and polyethylene glycol-polyamino acid copolymers.

4. The method according to any one of claims 1-3, wherein the length of the first sequencing primer is greater than 20 nt, and/or the length of the second sequencing primer is greater than 20 nt;

Optionally, the library does not need to be amplified on the solid surface before performing sequencing-by-synthesis on the solid surface;

Optionally, the sequencing-by-synthesis is performed on a single-molecule sequencing platform;

Optionally, the first sequencing primer and the second sequencing primer are independently immobilized on the same or different solid-phase surfaces;

Optionally, for a given short tandem repeat sequence, the sequencing data includes first sequencing data corresponding to the first sequencing primer and/or second sequencing data corresponding to the second sequencing primer, wherein the number of reads contained in the first sequencing data is not less than the number of reads contained in the second sequencing data.

Optionally, for a given short tandem repeat sequence, the sequencing data contains at least 100,000, at least 1 million, or at least 10 million reads.

5. The method according to any one of claims 1-4, characterized in that the method further comprises:

Step 10: Align the sequencing data with a reference sequence to classify the multiple reads into test-through reads and non-test-through reads. The test-through reads contain at least a portion of the specific sequence or its complementary sequence, while the non-test-through reads do not contain the specific sequence or its complementary sequence.

Step 12: Based on the measured and unmeasured reads, determine the number of motif sequences contained in the short tandem repeat sequence;

Optionally, step 10 includes:

Step 102: Compare the read segment with the first reference sequence. If the read segment has a segment that matches the first reference sequence and the segment length is not less than 10bp, then the read segment is indicated to be the test segment.

Step 104: Compare the read segment with the second reference sequence. If the read segment has a segment that matches the second reference sequence and the segment length is greater than or equal to 7bp, then the read segment is indicated to be the undetected read segment.

The first reference sequence comprises at least a portion of the specific sequence or its complementary sequence, and the second reference sequence is (S0)p, where p is an integer from 2 to 100;

Optionally, the error rate of the comparison is set to be less than or equal to 0.1 or 0.2;

Optionally, for the sequencing data corresponding to a given short tandem repeat sequence generated by the first sequencing primer, step 12 includes:

Step 122: Determine the number of the penetration test segments and the number of the optional untested penetration test segments;

Step 124: Based on the number of the test-through segments, determine the number of motif sequences contained in the short tandem repeat sequence;

Optionally, step 124 includes:

Step 1242: Based on the number of test-breakthrough reads, determine the proportion of the test-breakthrough reads in the sequencing data corresponding to the given short tandem repeat sequence generated by the first sequencing primer;

Step 1244: Based on the ratio, determine the number of motif sequences contained in the short tandem repeat sequence;

Optionally, step 1244 includes,

Based on the ratio, the number of motif sequences contained in the short tandem repeat sequence is determined according to a predetermined ratio-motif number standard relationship;

Optionally, for the sequencing data corresponding to a given short tandem repeat sequence generated by the second sequencing primer, step 12 includes:

Step 126: Based on the second sequencing primer corresponding to the test-break read and the sequence information of the test-break read, determine the sequence of the short tandem repeat sequence.

6. A method for determining short tandem repeat sequences, characterized in that it comprises:

(i) Obtain sequencing data of the short tandem repeat sequence, wherein the sequencing data is obtained by the method according to any one of claims 1 to 5;

(ii) The sequencing data is aligned with a reference sequence to classify the plurality of reads into test-through reads and non-test-through reads, wherein the test-through reads contain at least a portion of the specific sequence or a complementary sequence of the specific sequence, and the non-test-through reads do not contain the non-specific sequence or a complementary sequence of the non-specific sequence; and

(iii) Determine the short tandem repeat sequence based on the measured and unmeasured segments.

7. The method according to claim 6, characterized in that, in step (ii), the measured penetration reading segment and the unmeasured penetration reading segment are determined by the following method:

(ii-1) The read segment is compared with the first reference sequence, wherein the read segment has a segment that matches the first reference sequence and the segment length is not less than 10bp, which is an indication that the read segment is a test-through read segment;

(ii-2) The read segment is compared with the second reference sequence, wherein the read segment has a segment that matches the second reference sequence and the segment length is greater than or equal to 7bp, which is an indication that the read segment is an undetected read segment;

The first reference sequence comprises at least a portion of the specific sequence or its complementary sequence; the second reference sequence is (S0)p, where p is an integer from 2 to 100;

Optionally, for the sequencing data generated by the given first sequencing primer, step (iii) further includes:

(iii-1) Determine the number of the penetration test segments and the number of optional non-penetration test segments;

(iii-2) Based on the number of the test-through segments, determine the number of the motif sequences in the short tandem repeat sequence;

Optionally, step (iii-2) further includes:

(iii-2-1) Based on the number of test-through reads, determine the proportion of the test-through reads in the sequencing data corresponding to a given short tandem repeat sequence generated by the first sequencing primer;

(iii-2-2) Based on the ratio, determine the number of motif sequences contained in the short tandem repeat sequence;

Optionally, in step (iii-2-2), based on the ratio, the number of motif sequences contained in the short tandem repeat sequence is determined according to a predetermined ratio-motif sequence number standard relationship;

Optionally, step (iii) further includes, for the sequencing data corresponding to a given short tandem repeat sequence generated by the second sequencing primer:

(iii-a) Based on the second sequencing primer corresponding to the test-break read and the sequence information of the test-break read, determine the number of motif sequences contained in the short tandem repeat sequence.

8. A chip, characterized in that it comprises:

A substrate, having a surface;

The first sequencing primer is fixed on the surface and has the following nucleotide sequence:

(S0)x，

Wherein, S0 represents the motif sequence of the short tandem repeat sequence or its complementary sequence.

x is not less than 1.

9. The chip according to claim 8, wherein the first sequencing primer has the following structure:

L-(S0)x,

Wherein, L represents a linker group, used to immobilize the first sequencing primer on the chip;

Optionally, x is 3–20 or 3–10;

Optionally, x is an integer;

Optionally, the system further includes a second sequencing primer, which is immobilized on the surface, and the second sequencing primer has the following nucleotide sequence:

S1-(S0)y, where,

S1 is the specific sequence or a sequence corresponding to the specific sequence, and S0 is the motif sequence of the short tandem repeat sequence or its complementary sequence. The specific sequence can indicate the position of the short tandem repeat sequence on the reference sequence.

y is greater than or equal to 0;

Optionally, the second sequencing primer connected to the surface has the following structure:

L-S1-(S0)y, wherein the second sequencing primer is immobilized on the surface by means of L;

Optionally, S1 is derived from at least one of the following:

(a) The upstream sequence of the short tandem repeat sequence;

(b) the downstream sequence of the short tandem repeat sequence; and

(c)(a) or (b) complementary sequences;

Optionally, for a given short tandem repeat sequence, the chip carries multiple second sequencing primers, each of which has a different y value;

Optionally, the value of y is determined based on the following formula:

y = a + k × d, where,

a is a first predetermined constant, and a is an integer.

d can be multiple integers in the range of 0 to 100.

k is a second predetermined constant;

Optionally, in the scenario of using the chip to perform sequencing-by-synthesis to determine the short tandem repeat sequence, k is determined based on the read length of the sequencing-by-synthesis and/or the length of the motif sequence;

Optionally, k is determined based on the following formula:

in,

b represents the read length of the sequencing-by-synthesis reaction.

t is the length of the motif sequence;

Optionally, the L does not contain oligonucleotides;

Optionally, the L comprises an oligonucleotide;

Optionally, the chip is used in a single-molecule sequencing platform.

10. A chip, characterized in that it comprises:

A substrate, having a surface;

The second sequencing primer is immobilized on the surface and has the following nucleotide sequence:

S1-(S0)y，

Wherein, S1 is the specific sequence or a sequence corresponding to the specific sequence, and S0 is the motif sequence of the short tandem repeat sequence or its complementary sequence. The specific sequence can indicate the position of the short tandem repeat sequence on the reference sequence.

y is greater than or equal to 0.

11. The chip according to claim 10, wherein S1 is derived from at least one of the following:

(a) The upstream sequence of the short tandem repeat sequence;

(b) the downstream sequence of the short tandem repeat sequence; and

(c)(a) or (b) complementary sequences;

a is a first predetermined constant, and a is an integer.

d can be multiple integers in the range of 0 to 100.

k is a second predetermined constant;

Optionally, in the scenario of using the chip to perform sequencing-by-synthesis to determine the short tandem repeat sequence, k is determined based on the read length of the sequencing-by-synthesis reaction and/or the length of the motif sequence;

Optionally, k is determined based on the following formula:

in,

b represents the read length of the sequencing-by-synthesis reaction.

t is the length of the motif sequence;

Optionally, the chip further includes a first sequencing primer connected to the surface, and the first sequencing primer has the following nucleotide sequence:

(S0)x, where,

S0 represents the motif sequence of the short tandem repeat sequence or its complementary sequence.

x is not less than 1;

Optionally, the first sequencing primer connected to the surface has the following structure:

L-(S0)x, wherein the first sequencing primer is immobilized on the chip using the L;

Optionally, x is 3–20 or 3–10;

Optionally, x is an integer;

Optionally, the L does not contain oligonucleotides;

Optionally, the L comprises an oligonucleotide;

12. A reagent kit, characterized in that it comprises:

(1) the first sequencing primer and/or the second sequencing primer as described in any one of claims 1-7; or

(2) The chip as described in any one of claims 8-11.

13. A method for identifying an individual, characterized in that it comprises:

The method according to any one of claims 1-7 determines the sequence of a plurality of given short tandem repeat sequences in a test sample, the test sample comprising nucleic acids;

Based on the sequence, the origin of the test sample can be determined from one or more individuals.