CN105069325B - It is a kind of that matched method is carried out to nucleic acid sequence information - Google Patents
It is a kind of that matched method is carried out to nucleic acid sequence information Download PDFInfo
- Publication number
- CN105069325B CN105069325B CN201510482636.9A CN201510482636A CN105069325B CN 105069325 B CN105069325 B CN 105069325B CN 201510482636 A CN201510482636 A CN 201510482636A CN 105069325 B CN105069325 B CN 105069325B
- Authority
- CN
- China
- Prior art keywords
- nucleic acid
- acid sequence
- matching
- database
- reference sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
本发明涉及信息处理领域,提供了一种对核酸序列信息进行匹配的方法,所述方法包括以下步骤:A、对数据库中的参考序列进行BWT变换,得匹配参考序列,并将匹配参考序列存储在数据库中;B、对数据库中的匹配参考序列进行间隔标记;C、将核酸序列片段依次分别与数据库中的匹配参考序列进行一致性匹配,得匹配核酸序列。本发明的对核酸序列信息进行匹配的方法能实现核酸序列信息与参考序列的快速匹配。
The present invention relates to the field of information processing, and provides a method for matching nucleic acid sequence information. The method includes the following steps: A. Perform BWT transformation on a reference sequence in a database to obtain a matching reference sequence, and store the matching reference sequence In the database; B. Space marking is performed on the matching reference sequence in the database; C. The nucleic acid sequence fragments are sequentially matched with the matching reference sequence in the database to obtain a matching nucleic acid sequence. The method for matching nucleic acid sequence information of the present invention can realize rapid matching between nucleic acid sequence information and reference sequences.
Description
本案为2012年07月08日申请的,申请号为201210263634.7,发明名称为《一种对核酸序列信息进行匹配的系统和方法》的分案申请。This case was filed on July 08, 2012, the application number is 201210263634.7, and the title of the invention is a divisional application of "A System and Method for Matching Nucleic Acid Sequence Information".
技术领域technical field
本发明涉及信息处理领域,更具体地说,涉及一种对核酸序列信息进行匹配的系统和方法。The present invention relates to the field of information processing, more specifically, to a system and method for matching nucleic acid sequence information.
背景技术Background technique
美国科学家于1985年提出人类基因组计划,经过美国、英国、法兰西共和国、德意志联邦共和国、日本和中国科学家的共同努力,于2000年完成了人类基因组“工作框架图”。并于2001年公布了人类基因组图谱及初步分析结果。其研究内容还包括创建计算机分析管理系统(也即通过计算机分析系统对测序的结果进行处理,得到核酸序列信息),检验相关的伦理、法律及社会问题。在人类基因组图谱公布后,国内外开始积极投入到各个生物种族的基因图谱绘制的工作中。利用核酸序列信息与已有的基因组图谱(参考序列)进行比较,通过转录物组学和蛋白质组学等相关技术对基因表达谱、基因突变等进行匹配分析,可获得与疾病相关基因的信息。通过核酸序列信息与基因组图谱进行匹配、分析,并揭秘患病的根源,已成为生化医疗领域高度关注的问题,全球的基因测序技术也因此发展的如火如荼,但要准确快速的从浩瀚的测序结果数据中得到基因信息,却成了当前基因测序技术发展的瓶颈。American scientists proposed the Human Genome Project in 1985, and through the joint efforts of scientists from the United States, the United Kingdom, the French Republic, the Federal Republic of Germany, Japan and China, the "working framework map" of the human genome was completed in 2000. And in 2001 published the human genome map and preliminary analysis results. Its research content also includes creating a computer analysis management system (that is, processing the sequencing results through a computer analysis system to obtain nucleic acid sequence information), and examining related ethical, legal and social issues. After the publication of the human genome map, domestic and foreign countries began to actively invest in the drawing of the genetic maps of various biological races. By comparing the nucleic acid sequence information with the existing genome map (reference sequence), the gene expression profile, gene mutation, etc. can be matched and analyzed through transcriptomics, proteomics and other related technologies to obtain information about genes related to diseases. Matching and analyzing the nucleic acid sequence information with the genome map, and revealing the root cause of the disease, has become a highly concerned issue in the field of biochemical medicine. The global gene sequencing technology is also developing rapidly, but it is necessary to accurately and quickly analyze the vast number of sequencing results Obtaining genetic information from data has become a bottleneck in the development of current gene sequencing technology.
对核酸序列信息进行匹配的系统是利用计算机对测序所得的核酸序列片段在已知的参考序列上进行匹配,也即一一比对,根据匹配的结果进行后续的分析。对核酸序列信息进行匹配的方法是基于对核酸序列信息进行匹配的系统对核酸序列信息进行匹配的过程。The system for matching nucleic acid sequence information uses a computer to match nucleic acid sequence fragments obtained by sequencing with known reference sequences, that is, one-to-one comparison, and perform subsequent analysis based on the matching results. The method for matching nucleic acid sequence information is a process of matching nucleic acid sequence information based on a system for matching nucleic acid sequence information.
现有技术中,一种对核酸序列信息进行匹配的方法,所述方法包括步骤:A、根据允许错配的个数n,将每条核酸序列片段分成至少n+1条参与匹配的短片段,得短片段的数据库;B、根据参与匹配的短片段的长度建立并存储参考序列索引,得数据库;C、把每条核酸序列片段分段建立的短片段分别单独在数据库中进行匹配,得匹配结果。因为参考序列索引是等长的,根据概率的原理,存在完全相同的多个参考序列索引。该技术方案中,每条参与匹配的短片段依次与参考序列索引进行匹配,短片段需要与所有的参考序列索引分别进行匹配(短片段需要分别与多个相同的参考序列索引进行匹配),这将大大降低信息处理的速度。并且参考序列和核酸序列均需要进行分段处理,这将进一步增加了信息处理的工作量,从而进一步降低了信息处理的速度。另外,参考序列建立的参考序列索引和核酸序列分段建立的短片段,将产生大量的信息,这将增加信息处理装置的存储空间。In the prior art, a method for matching nucleic acid sequence information, said method comprising the steps of: A. Dividing each nucleic acid sequence fragment into at least n+1 short fragments participating in matching according to the number n of allowable mismatches , to obtain a database of short fragments; B, to establish and store a reference sequence index according to the length of the short fragments participating in the matching, to obtain a database; C, to separately match the short fragments established by each nucleic acid sequence fragment in the database, to obtain matching results. Because the reference sequence indexes are of equal length, according to the principle of probability, there are multiple identical reference sequence indexes. In this technical solution, each short segment participating in the matching is matched with the reference sequence index in turn, and the short segment needs to be matched with all the reference sequence indices respectively (short segments need to be matched with multiple identical reference sequence indices respectively), which Will greatly reduce the speed of information processing. Moreover, both the reference sequence and the nucleic acid sequence need to be segmented, which will further increase the workload of information processing, thereby further reducing the speed of information processing. In addition, the reference sequence index established by the reference sequence and the short fragments established by nucleic acid sequence segmentation will generate a large amount of information, which will increase the storage space of the information processing device.
因此需要一种新的对核酸序列信息进行匹配的系统和方法,能够实现核酸序列与参考序列快速匹配。Therefore, there is a need for a new system and method for matching nucleic acid sequence information, which can quickly match nucleic acid sequences with reference sequences.
发明内容Contents of the invention
本发明的目的在于提供一种对核酸序列信息进行匹配的系统和方法,旨在解决现有技术核酸序列信息与参考序列进行匹配时,速度慢的问题。The purpose of the present invention is to provide a system and method for matching nucleic acid sequence information, aiming at solving the problem of slow speed when matching nucleic acid sequence information with reference sequences in the prior art.
为了实现发明目的,一种对核酸序列信息进行匹配的系统包括数据库、参考序列变化单元、标记单元和匹配单元。所述数据库,用于存储参考序列;所述参考序列变换单元,用于对数据库中的参考序列进行BWT变换,得匹配参考序列;所述标记单元,用于对数据库中的匹配参考序列进行间隔标记;所述匹配单元,用于将核酸序列片段依次与数据库中的匹配参考序列进行一致性匹配,得匹配核酸序列。In order to achieve the purpose of the invention, a system for matching nucleic acid sequence information includes a database, a reference sequence variation unit, a marking unit and a matching unit. The database is used to store reference sequences; the reference sequence transformation unit is used to perform BWT transformation on the reference sequences in the database to obtain matching reference sequences; the marking unit is used to space the matching reference sequences in the database Marking; the matching unit is used to sequentially perform consistent matching of the nucleic acid sequence fragments with matching reference sequences in the database to obtain a matching nucleic acid sequence.
一致性匹配包括允许错配和不允许错配的情况。在允许N个错配的情况下,核酸序列片段至多有N个碱基与数据库中的匹配参考序列不一致称为一致性匹配;在不允许错配的情况下,核酸序列片段与数据库中的匹配参考序列完全一致称为一致性匹配。N为正整数。Consistent matching includes cases where mismatches are allowed and cases where mismatches are not allowed. In the case where N mismatches are allowed, at most N bases of the nucleic acid sequence fragment are inconsistent with the matching reference sequence in the database, which is called a consensus match; The complete identity of the reference sequence is called a consensus match. N is a positive integer.
其中,所述参考序列变换单元包括参考序列矩阵模块和BWT矩阵模块。所述参考序列矩阵模块,用于对在数据库中的参考序列末端或前端加上标识符,并将该参考序列循环移动,得参考序列矩阵;所述BWT矩阵模块,用于将参考序列矩阵按照字典顺序排序,得BWT参考序列矩阵。所述参考序列变换单元还可包括匹配参考序列模块,所述匹配参考序列模块,用于获取BWT参考序列矩阵第一列与最后一列,得匹配参考序列,并储存在数据库中。Wherein, the reference sequence conversion unit includes a reference sequence matrix module and a BWT matrix module. The reference sequence matrix module is used to add an identifier to the end or front end of the reference sequence in the database, and move the reference sequence cyclically to obtain a reference sequence matrix; the BWT matrix module is used to convert the reference sequence matrix according to Sort in dictionary order to get the BWT reference sequence matrix. The reference sequence conversion unit may further include a matching reference sequence module, which is used to obtain the first column and the last column of the BWT reference sequence matrix to obtain a matching reference sequence and store them in the database.
其中,所述标记单元,用于对数据库中的匹配参考序列按照等差数列进行间隔标记。Wherein, the marking unit is used for marking the matching reference sequences in the database at intervals according to an arithmetic sequence.
进一步的,所述标记单元,还用于在每个等差数列间隔中再利用等差数列对数据库中的匹配参考序列进行进一步标记。Further, the marking unit is further used for further marking the matching reference sequence in the database by using the arithmetic sequence in each arithmetic sequence interval.
上述任一技术方案中,所述匹配单元,用于将核酸序列片段反向互补形成反向互补核酸序列片段,并将反向互补核酸序列片段与数据库中的匹配参考序列进行一致性匹配,得匹配核酸序列。In any of the above technical solutions, the matching unit is used to reverse complement the nucleic acid sequence fragments to form reverse complementary nucleic acid sequence fragments, and perform consistent matching between the reverse complementary nucleic acid sequence fragments and matching reference sequences in the database to obtain Match nucleic acid sequences.
其中,所述匹配单元,利用回溯法依次在反向互补核酸序列片段不能匹配的位置之前的位置上进行碱基替换,并从替换位置开始继续在数据库中进行匹配。Wherein, the matching unit uses the backtracking method to sequentially perform base substitution at the position before the position where the reverse complementary nucleic acid sequence fragment cannot be matched, and continues to perform the matching in the database from the substitution position.
上述任一技术方案中,所述对核酸序列信息进行匹配的系统还包括信息接收单元;所述信息接收单元,用于通过USB接口或光盘驱动接口或INTERNET获取核酸序列片段和参考序列。In any of the above technical solutions, the system for matching nucleic acid sequence information further includes an information receiving unit; the information receiving unit is configured to obtain nucleic acid sequence fragments and reference sequences through a USB interface or an optical disc drive interface or the Internet.
为了更好的实现本发明,本发明还包括一种对核酸序列信息进行匹配的方法。In order to better realize the present invention, the present invention also includes a method for matching nucleic acid sequence information.
所述方法包括步骤:A、对数据库中的参考序列进行BWT变换,得匹配参考序列,并将匹配参考序列存储在数据库中;B、对将数据库中的匹配参考序列进行间隔标记;C、将核酸序列片段依次分别与数据库中的匹配参考序列进行一致性匹配,得匹配核酸序列。其中,数据库中存储有参考序列,步骤A和步骤B分别数据库中的参考序列进行变换。The method comprises the steps: A. performing BWT transformation on the reference sequences in the database to obtain matching reference sequences, and storing the matching reference sequences in the database; B. marking the matching reference sequences in the database with interval marks; C. The nucleic acid sequence fragments are sequentially matched with the matching reference sequences in the database to obtain a matching nucleic acid sequence. Wherein, the reference sequence is stored in the database, and step A and step B respectively transform the reference sequence in the database.
一致性匹配包括允许错配和不允许错配的情况。在允许N个错配的情况下,核酸序列片段至多有N个碱基与数据库中的匹配参考序列不一致称为一致性匹配;在不允许错配的情况下,核酸序列片段与数据库中的匹配参考序列完全一致称为一致性匹配。N为正整数。Consistent matching includes cases where mismatches are allowed and cases where mismatches are not allowed. In the case where N mismatches are allowed, at most N bases of the nucleic acid sequence fragment are inconsistent with the matching reference sequence in the database, which is called a consensus match; The complete identity of the reference sequence is called a consensus match. N is a positive integer.
其中,所述步骤A包括:A1、对数据库中的参考序列末端或前端加上标识符,并将该参考序列经过循环移动,得参考序列矩阵;A2、将参考序列矩阵按照字典顺序排序,得BWT参考序列矩阵,并存储在数据库中。在步骤A2之后还可包括步骤A3、获取BWT参考序列矩阵第一列与最后一列,得匹配参考序列,并存储在数据库中。Wherein, the step A includes: A1, adding an identifier to the end or front end of the reference sequence in the database, and moving the reference sequence cyclically to obtain a reference sequence matrix; A2, sorting the reference sequence matrix according to lexicographical order to obtain The BWT reference sequence matrix is stored in the database. Step A3 may also be included after step A2, obtaining the first column and the last column of the BWT reference sequence matrix to match the reference sequence and store it in the database.
其中,所述步骤B中,对数据库中的匹配参考序列按照等差数列进行间隔标记。Wherein, in the step B, the matching reference sequences in the database are spaced according to the arithmetic sequence.
其中,所述步骤B中,在每个等差数列间隔中再利用等差数列对数据库中的匹配参考序列进行进一步标记。Wherein, in the step B, the arithmetic sequence is used in each arithmetic sequence interval to further mark the matching reference sequence in the database.
上述任一技术方案中,所述步骤C为,将核酸序列片段反向互补形成反向互补核酸序列片段,然后将反向互补核酸序列片段与数据库中的匹配参考序列中进行一致性匹配,得匹配核酸序列。In any of the above-mentioned technical schemes, the step C is to reverse complement the nucleic acid sequence fragments to form reverse complementary nucleic acid sequence fragments, and then carry out consistent matching between the reverse complementary nucleic acid sequence fragments and the matching reference sequences in the database to obtain Match nucleic acid sequences.
其中,所述步骤C中,在允许错配的情况下,利用回溯法依次在反向互补核酸序列片段不能匹配的位置之前的位置上进行碱基替换,并从替换位置继续在数据库上进行匹配。Wherein, in the step C, under the condition that mismatches are allowed, the backtracking method is used to sequentially perform base replacement at the position before the position where the reverse complementary nucleic acid sequence fragment cannot be matched, and continue to perform the matching on the database from the replacement position .
由上可知,本发明通过核酸序列片段无需分段,直接与在数据库中进行匹配,同时,核酸序列片段无需与所有相同的匹配参考序列一一匹配,只需与所有相同的序列进行一次匹配即可,从而从整体上提高了信息处理的速度;另外,数据库中的参考序列无需建立参考序列索引,且数据库中的匹配参考序列无需一一标记,从而大大降低了对系统的存储空间的要求。It can be seen from the above that the nucleic acid sequence fragments of the present invention do not need to be segmented, and are directly matched with the database. At the same time, the nucleic acid sequence fragments do not need to be matched one by one with all the same matching reference sequences, and only need to be matched with all the same sequences once. Yes, thereby improving the speed of information processing as a whole; in addition, the reference sequences in the database do not need to establish a reference sequence index, and the matching reference sequences in the database do not need to be marked one by one, thereby greatly reducing the storage space requirements of the system.
附图说明Description of drawings
图1是本发明一个实施例中对核酸序列信息进行匹配的系统的结构示意图。Fig. 1 is a schematic structural diagram of a system for matching nucleic acid sequence information in an embodiment of the present invention.
图2是本发明另一个实施例中对核酸序列信息进行匹配的系统的结构示意图。Fig. 2 is a schematic structural diagram of a system for matching nucleic acid sequence information in another embodiment of the present invention.
图3是本发明一个实施例中参考序列变换单元的结构示意图。Fig. 3 is a schematic structural diagram of a reference sequence transformation unit in an embodiment of the present invention.
图4是本发明另一个实施例中参考序列变换单元的结构示意图。Fig. 4 is a schematic structural diagram of a reference sequence transformation unit in another embodiment of the present invention.
图5是本发明一个实施例中核酸序列片段进行匹配的方法流程图。Fig. 5 is a flowchart of a method for matching nucleic acid sequence fragments in an embodiment of the present invention.
图6是本发明另一个实施例中对核酸序列信息进行匹配的系统的结构示意图。Fig. 6 is a schematic structural diagram of a system for matching nucleic acid sequence information in another embodiment of the present invention.
图7是本发明一个实施例中对参考序列进行变换的方法流程图。Fig. 7 is a flowchart of a method for transforming a reference sequence in an embodiment of the present invention.
图8是本发明一个实施例中对核酸序列片段进行匹配的方法流程图。Fig. 8 is a flowchart of a method for matching nucleic acid sequence fragments in an embodiment of the present invention.
图9是本发明一个实施例中对正向核酸序列片段进行一致性匹配的示意图。Fig. 9 is a schematic diagram of consensus matching of forward nucleic acid sequence fragments in an embodiment of the present invention.
图10是本发明一个实施例中对反向核酸序列片段进行一致性匹配的示意图。Fig. 10 is a schematic diagram of consensus matching of reverse nucleic acid sequence fragments in an embodiment of the present invention.
图11是本发明一个实施例中对核酸序列片段进行匹配的示意图。Fig. 11 is a schematic diagram of matching nucleic acid sequence fragments in an embodiment of the present invention.
图12是本发明一个实施例中对核酸序列片段进行匹配的示意图。Fig. 12 is a schematic diagram of matching nucleic acid sequence fragments in an embodiment of the present invention.
具体实施方式Detailed ways
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments.
为了说明本发明的技术方案的方便,以下实施例中的核酸序列片段和参考序列都只给出了较短碱基序列,其并不代表真正意义上的核酸序列片段和参考序列片段。一般核酸序列片段长度在20bp或以上,参考序列长度在2000bp或以上,当然这只是一般的情况,也存在核酸序列片段长度在20bp以下,参考序列长度在2000bp以下的情况。For the convenience of illustrating the technical solutions of the present invention, the nucleic acid sequence fragments and reference sequences in the following examples only give short base sequences, which do not represent the true nucleic acid sequence fragments and reference sequence fragments. Generally, the nucleic acid sequence fragment length is 20 bp or more, and the reference sequence length is 2000 bp or more. Of course, this is only a general situation. There are also cases where the nucleic acid sequence fragment length is less than 20 bp and the reference sequence length is less than 2000 bp.
本发明所述核酸序列片段一般可通过对某物种测序得到,也可通过人工合成得到,也即人工序列。所述参考序列为已知的核酸序列,其用于作为匹配的模板,核酸序列片段与参考序列进行匹配,根据匹配的情况可得到测序是否准确等信息。需要说明的是,本发明中的核酸序列片段无特殊限制,可包括由A、G、C、T或A、G、C、U等碱基组成的序列片段,如:ATTACGTTA、UUCCUCAAGGU等。The nucleic acid sequence fragments of the present invention can generally be obtained by sequencing a certain species, or artificially synthesized, that is, artificial sequences. The reference sequence is a known nucleic acid sequence, which is used as a matching template. The nucleic acid sequence fragment is matched with the reference sequence, and information such as whether the sequencing is accurate can be obtained according to the matching situation. It should be noted that the nucleic acid sequence fragments in the present invention are not particularly limited, and may include sequence fragments composed of bases such as A, G, C, T or A, G, C, U, such as: ATTACGTTA, UUCCUCAAGGU, etc.
本发明提出第一实施例,如图1所示,对核酸序列信息进行匹配的系统包括数据库、参考序列变换单元、标记单元和匹配单元。以下将详细说明。The present invention proposes a first embodiment. As shown in FIG. 1 , the system for matching nucleic acid sequence information includes a database, a reference sequence transformation unit, a marking unit and a matching unit. Details will be given below.
(1)数据库1,用于存储参考序列。(1) Database 1, used to store reference sequences.
所述数据库中存储的参考序列可为存储在系统内部或者是存储在系统以外的参考序列。所述参考序列为碱基序列,也即核酸序列信息。该参考序列与核酸序列片段为同一物种的核酸序列信息,比如,核酸序列片段是对草履虫的核酸进行测序得到的,则对应的参考序列就为草履虫的核酸序列信息,也可以是人工序列所得的参考序列和核酸序列片段。对参考序列和核酸序列片段无特别限制,其中,参考序列为已知碱基序列。The reference sequences stored in the database may be stored inside the system or outside the system. The reference sequence is a base sequence, that is, nucleic acid sequence information. The reference sequence and the nucleic acid sequence fragment are nucleic acid sequence information of the same species. For example, the nucleic acid sequence fragment is obtained by sequencing the nucleic acid of Paramecia, and the corresponding reference sequence is the nucleic acid sequence information of Paramecia, or it can be an artificial sequence Resulting reference sequences and nucleic acid sequence fragments. There is no particular limitation on the reference sequence and nucleic acid sequence fragment, wherein the reference sequence is a known base sequence.
(2)参考序列变换单元2,用于对数据库中的参考序列进行BWT变换,得匹配参考序列。(2) The reference sequence transformation unit 2 is used to perform BWT transformation on the reference sequence in the database to match the reference sequence.
所述的BWT变换是Mike Burrows依据David Wheeler提出的变换思想,完善并成功应用于实际数据压缩的变换方法,该变换是目前无损压缩领域的研究热点。BWT是一种以数据块为操作对象的可逆的数据变换方法。The BWT transformation described above is a transformation method that Mike Burrows perfected and successfully applied to actual data compression based on the transformation idea proposed by David Wheeler. This transformation is currently a research hotspot in the field of lossless compression. BWT is a reversible data transformation method that takes data blocks as the operation object.
所述的参考序列变化单元,对数据库中的参考序列进行BWT变换后,得到的匹配参考序列,数据库中自动存储匹配参考序列。The reference sequence changing unit performs BWT transformation on the reference sequences in the database to obtain matching reference sequences, and automatically stores the matching reference sequences in the database.
(3)标记单元3,用于对数据库中的匹配参考序列进行间隔标记。(3) Marking unit 3, used for interval marking the matched reference sequences in the database.
所述对数据库中的匹配参考序列进行间隔标记的方式不限,可以采用等差数列,或者其他数列进行有规律的间隔标记。该标记采用的数据类型可以根据需要来选择,比如Int、Byte等数据类型。There is no limit to the way of marking the intervals of the matched reference sequences in the database, and regular intervals can be marked by arithmetic sequence or other sequence. The data type used by the mark can be selected according to needs, such as Int, Byte and other data types.
(4)匹配单元4,用于将核酸序列片段依次与数据库中的匹配参考序列进行一致性匹配,得匹配核酸序列。(4) The matching unit 4 is used to sequentially perform consistent matching of the nucleic acid sequence fragments with matching reference sequences in the database to obtain a matching nucleic acid sequence.
一致性匹配包括允许错配和不允许错配的情况。在允许N个错配的情况下,核酸序列片段至多有N个碱基与数据库中的匹配参考序列不一致称为一致性匹配;在不允许错配的情况下,核酸序列片段与数据库中的匹配参考序列完全一致称为一致性匹配。N为正整数。Consistent matching includes cases where mismatches are allowed and cases where mismatches are not allowed. In the case where N mismatches are allowed, at most N bases of the nucleic acid sequence fragment are inconsistent with the matching reference sequence in the database, which is called a consensus match; The complete identity of the reference sequence is called a consensus match. N is a positive integer.
所述核酸序列片段为存储在系统内部的核酸序列片段,或者存储在该系统以外的存储器上。将整条核酸序列片段直接与数据库中的匹配参考序列进行一致性匹配或者将整条核酸序列片段首尾同时与数据库中的匹配参考序列进行一致性匹配。所述的一致性匹配是指在允许有N个错配的情况下,整条核酸序列片段有至多N个不能与匹配参考序列匹配上,则认为该整条核酸序列片段匹配上,得到一条匹配核酸序列片段,否则,认为该核酸序列片段不能匹配上,舍弃该核酸序列片段。其他所有的核酸序列片段均按照此种方式在数据库中进行一致性匹配,然后得到匹配核酸序列。该匹配核酸序列可以以可读形式输出,也可以存储在系统中。当匹配核酸序列输出时,其输出的信息可包括每条核酸序列片段与参考序列对应的起始位置和终止位置,每条核酸序列片段错配的位置和错配的个数等信息。The nucleic acid sequence fragment is a nucleic acid sequence fragment stored inside the system, or stored on a memory outside the system. The entire nucleic acid sequence fragment is directly matched with the matching reference sequence in the database or the entire nucleic acid sequence fragment is simultaneously matched with the matching reference sequence in the database. The consistent matching means that under the condition that N mismatches are allowed, at most N pieces of the entire nucleic acid sequence fragment cannot be matched with the matching reference sequence, then the entire nucleic acid sequence fragment is considered to be matched, and a match is obtained. Nucleic acid sequence fragment, otherwise, consider that this nucleic acid sequence fragment cannot be matched, and discard this nucleic acid sequence fragment. All other nucleic acid sequence fragments are subjected to consistent matching in the database in this way, and then a matched nucleic acid sequence is obtained. The matching nucleic acid sequence can be output in readable form or stored in the system. When the matched nucleic acid sequence is output, the output information may include information such as the start position and end position of each nucleic acid sequence fragment corresponding to the reference sequence, the position and number of mismatches of each nucleic acid sequence fragment, and the like.
本实施例中,本实施例中所述的对核酸序列信息进行匹配的系统可包括计算机和计算机上的对核酸序列信息进行匹配的程序。在对核酸序列信息进行匹配时,首先参考序列变换单元对数据库中的参考序列进行BWT变换,然后标记单元对数据库中经过BWT变换的参考序列进行间隔标记,最后匹配单元将核酸序列片段依次在数据库中进行一致性匹配。本实施例的技术方案中,通过整体核酸序列片段直接在数据库中进行一致性匹配,并且对于相同的匹配参考序列只匹配一次,从而提高了匹配的效率。同时,存储在数据库中的参考序列无需进行分段建立参考序列索引(假设参考序列索引长为K,则相邻两个参考序列索引中,前一个参考序列索引的后K-1个碱基和后一个参考序列索引前K-1个碱基完全相同),并且进行间隔标记,相对与现有技术,大大减少了存储空间。In this embodiment, the system for matching nucleic acid sequence information described in this embodiment may include a computer and a program for matching nucleic acid sequence information on the computer. When matching nucleic acid sequence information, the reference sequence transformation unit first performs BWT transformation on the reference sequences in the database, then the marking unit performs interval marking on the reference sequences in the database through BWT transformation, and finally the matching unit sequentially places the nucleic acid sequence fragments in the database. for consistent matching. In the technical solution of this embodiment, the whole nucleic acid sequence fragments are directly matched in the database, and the same matching reference sequence is only matched once, thereby improving the matching efficiency. At the same time, the reference sequence stored in the database does not need to be segmented to establish a reference sequence index (assuming that the length of the reference sequence index is K, then in two adjacent reference sequence indexes, the last K-1 bases of the previous reference sequence index and The first K-1 bases of the latter reference sequence index are exactly the same), and interval marks are carried out, which greatly reduces the storage space compared with the prior art.
基于第一实施例,本发明提出第二实施例,本发明的一种对核酸序列信息进行匹配的系统包括计算机和其上的对核酸序列信息进行匹配的程序,所述的计算机上还可包括对测序仪进行控制的程序。以下给出具体的说明,如图2所示。计算机与多台测序仪连接,该计算机接收测序仪所测得的测序数据,并对测序数据进行处理,得到核酸序列片段。其中,所述核酸序列片段可以市场上销售的任意的测序仪测序所得到的测序数据经过处理得到的核酸序列片段。优选的,所述核酸序列片段可以是通过对Pstar系列测序仪、MiSeq系列测序仪、GS Junior/Senior测序仪和SOLID测序仪测序产生的测序数据进行处理得到的核酸序列片段。更优选的,所述核酸序列片段可以通过对Pstar系列测序仪测序产生的测序数据进行处理得到的核酸序列片段。所述计算机为任意市场上销售的具有信息处理功能和数据存储功能的信息处理装置。Based on the first embodiment, the present invention proposes a second embodiment. A system for matching nucleic acid sequence information of the present invention includes a computer and a program for matching nucleic acid sequence information on it. The computer may also include A program that controls the sequencer. A specific description is given below, as shown in FIG. 2 . The computer is connected with multiple sequencers, and the computer receives the sequencing data measured by the sequencer, and processes the sequencing data to obtain nucleic acid sequence fragments. Wherein, the nucleic acid sequence fragment may be a nucleic acid sequence fragment obtained by processing sequencing data obtained by sequencing with any sequencer on the market. Preferably, the nucleic acid sequence fragment may be a nucleic acid sequence fragment obtained by processing the sequencing data generated by Pstar series sequencers, MiSeq series sequencers, GS Junior/Senior sequencers and SOLID sequencers. More preferably, the nucleic acid sequence fragment can be a nucleic acid sequence fragment obtained by processing the sequencing data generated by the Pstar series sequencer. The computer is any information processing device sold on the market with information processing function and data storage function.
需要说明的是,本发明的计算机中的核酸序列片段可以为接收测序仪的测序数据,然后经过处理得到的核酸序列片段,也可以是计算机中直接存储或者计算机直接从外界接收的核酸序列片段,对核酸序列片段的来源无特殊限制。It should be noted that the nucleic acid sequence fragment in the computer of the present invention may be a nucleic acid sequence fragment obtained by receiving the sequencing data of the sequencer and then processed, or may be a nucleic acid sequence fragment directly stored in the computer or directly received by the computer from the outside world, There is no particular limitation on the source of the nucleic acid sequence fragment.
下面将对上述实施例中的参考序列变换单元做进一步的详细说明,如图3所示,所述参考序列变换单元包括参考序列矩阵模块、BWT矩阵模块,以下将对各模块做详细说明。The reference sequence transformation unit in the above embodiment will be further described in detail below. As shown in FIG. 3 , the reference sequence transformation unit includes a reference sequence matrix module and a BWT matrix module. Each module will be described in detail below.
(1)参考序列矩阵模块21,用于对数据库中的参考序列末端或前端添加标识符,并将该参考序列经过循环移动,得参考序列矩阵。(1) The reference sequence matrix module 21 is used to add an identifier to the end or front end of the reference sequence in the database, and move the reference sequence cyclically to obtain a reference sequence matrix.
为了使参考序列矩阵模块的工作原理更容易理解,以下给出一示例。参考序列一般都较长,长度一般在几千到几亿之间,甚至更长。以下给出的示例仅仅是为了帮助理解,并不是真正意义上的参考序列。假设该参考序列为ACCACCTG,首先在参考序列的前端或末端添加标记符,标记符的符号无特殊限制,只是为了区分参考序列的首尾端,本示例中在末端加上$标记符,可得到ACCACCTG$;然后再将参考序列循环移动,得到参考序列矩阵,具体结果如下表所示。In order to make the working principle of the reference sequence matrix module easier to understand, an example is given below. The reference sequence is generally long, and the length is generally between several thousand to hundreds of millions, or even longer. The examples given below are just for understanding, not a real reference sequence. Assuming that the reference sequence is ACCACCTG, first add a marker at the front or end of the reference sequence. There are no special restrictions on the symbol of the marker, just to distinguish the beginning and end of the reference sequence. In this example, add the $ marker at the end to get ACCACCTG $; Then move the reference sequence circularly to obtain the reference sequence matrix. The specific results are shown in the table below.
表 1Table 1
表1中即为参考序列ACCACCTG经过参考序列矩阵模块的处理得到的参考序列矩阵。其中,上述的A、G、C、T为生化领域中对应的核酸。Table 1 is the reference sequence matrix obtained by processing the reference sequence ACCACCTG through the reference sequence matrix module. Wherein, the above-mentioned A, G, C, T are the corresponding nucleic acids in the field of biochemistry.
(2)BWT矩阵模块22,用于将参考序列矩阵按照字典顺序排序,得BWT参考序列矩阵。(2) The BWT matrix module 22 is used to sort the reference sequence matrix in lexicographical order to obtain the BWT reference sequence matrix.
为了使说明更容易理解,以下将以表1中的参考序列矩阵为例,假设$标记符比A、G、C、T都小,将参考序列矩阵按照字典顺序排序,得到的BWT参考序列矩阵如表2所示。In order to make the description easier to understand, the following will take the reference sequence matrix in Table 1 as an example, assuming that the $ marker is smaller than A, G, C, and T, and sort the reference sequence matrix in lexicographical order, the obtained BWT reference sequence matrix As shown in table 2.
表 2Table 2
其中,BWT参考序列矩阵存储在数据库中。所述的字典顺序是指按照汉语字典的查找顺序A、B、C、……、Z来进行排序。Wherein, the BWT reference sequence matrix is stored in the database. The dictionary order refers to sorting according to the search order A, B, C, . . . , Z of the Chinese dictionary.
上述技术方案中,核酸序列片段可以在相同的参考序列中进行一次匹配即可,原因在于参考序列经过BWT变换后,使得数据库中BWT参考序列矩阵的相邻的每行序列有最大公共前缀,在匹配时,如果核酸序列片段与第r行匹配上,核酸序列片段的长度为m,则与BWT参考序列矩阵的第r行的最大公共前缀至少为m的,均是可比对上的,只需确定最大公共前缀即可,无需在将核酸序列片段在进行一致性匹配,也即就只需要匹配一次。比如核酸序列片段长度为3,其为ACC,BWT参考序列矩阵的第二行和第三行的最大公共前缀为3,都是ACC,此时只要进行与公共前缀比较即可,就实现了核酸序列片段在数据库中的匹配。该技术方案,大大提高了核酸序列片段匹配的效率。In the above technical solution, the nucleic acid sequence fragments can only be matched once in the same reference sequence. The reason is that after the reference sequence is transformed by BWT, each adjacent row sequence of the BWT reference sequence matrix in the database has the largest common prefix. When matching, if the nucleic acid sequence fragment is matched with the rth row, and the length of the nucleic acid sequence fragment is m, then the largest common prefix of the rth row of the BWT reference sequence matrix is at least m, all of which can be compared. It is enough to determine the largest common prefix, and there is no need to perform consistent matching on the nucleic acid sequence fragments, that is, only one matching is required. For example, the length of the nucleic acid sequence fragment is 3, which is ACC, and the largest common prefix in the second and third rows of the BWT reference sequence matrix is 3, both of which are ACC. At this time, as long as the comparison with the common prefix is performed, the nucleic acid Matching of sequence fragments in the database. This technical solution greatly improves the efficiency of nucleic acid sequence fragment matching.
如图4所示,所述参考序列变换单元还可包括匹配参考序列模块,以下将对该模块进行详细说明。匹配参考序列模块23,用于获取BWT参考序列矩阵第一列与最后一列,得匹配参考序列,并储存在数据库中。As shown in Fig. 4, the reference sequence conversion unit may further include a matching reference sequence module, which will be described in detail below. The matching reference sequence module 23 is used to obtain the first column and the last column of the BWT reference sequence matrix to match the reference sequence and store it in the database.
为了节约存储空间,采用辅助矩阵,以下以表2中的矩阵为例,得到的匹配参考序列如表3所示。In order to save storage space, an auxiliary matrix is used. Taking the matrix in Table 2 as an example, the obtained matching reference sequence is shown in Table 3.
表 3table 3
表3中对应的数据库更简洁,从而大大降低了数据库对存储空间的要求。The corresponding database in Table 3 is more concise, thereby greatly reducing the storage space requirements of the database.
为了更快捷的查找,可进一步的采用辅助矩阵,以下继续以表2中的矩阵为例,得到的存储在数据库中的匹配参考序列信息如表4所示。In order to search more quickly, an auxiliary matrix can be further used. The following continues to take the matrix in Table 2 as an example, and the obtained matching reference sequence information stored in the database is shown in Table 4.
表 4Table 4
表中第3列为第一列在参考序列中所处的位置,第4列为参考序列矩阵的第一列,在后续核酸序列片段进行匹配时,可以直接得到该核酸序列片段在参考序列中的位置,使得数据库更易用,提高了后续核酸序列片段匹配的效率。The third column in the table is the position of the first column in the reference sequence, and the fourth column is the first column of the reference sequence matrix. When the subsequent nucleic acid sequence fragments are matched, the nucleic acid sequence fragments in the reference sequence can be directly obtained. The position makes the database easier to use and improves the efficiency of subsequent nucleic acid sequence fragment matching.
第二实施例中的参考序列变换单元对数据库进行处理,从而使得数据库更易用,且进行核酸序列片段匹配时的效率更高,相对与现有技术中的数据库,大大节约了存储空间。本技术方案从整体上克服了传统技术中存在在序列匹配速度慢,同时存储空间大的问题。The reference sequence conversion unit in the second embodiment processes the database, thereby making the database easier to use and more efficient in nucleic acid sequence fragment matching, which greatly saves storage space compared with the database in the prior art. The technical scheme overcomes the problems of slow sequence matching speed and large storage space in the traditional technology as a whole.
以下将对第二实施例中的标记单元进行详细说明。所述标记单元3,用于对数据库中的匹配参考序列按照等差数列进行间隔标记。基于上述的方案,对参考序列或匹配参考序列进行标记,从而使得核酸序列片段在进行匹配时,能够确定核酸序列片段所在的位置。以下将对标记单元的标记方式给予具体的说明。The marking unit in the second embodiment will be described in detail below. The marking unit 3 is used for marking the matching reference sequences in the database at intervals according to arithmetic progressions. Based on the above scheme, the reference sequence or matching reference sequence is marked, so that when the nucleic acid sequence fragments are matched, the position of the nucleic acid sequence fragments can be determined. The marking method of the marking unit will be specifically described below.
表 5table 5
表5中的标记是按照等差数列来进行标记的,其等差数列的公差不限,本实施例中的公差仅选用了256。该技术方案采用间隔标记,大大降低了数据库占用的存储空间。另外,当参考序列或匹配参考序列较长时,优选采用Int类型进行标记(可标记的参考序列长度为231),当参考序列或匹配参考序列较短时,优选采用Byte类型进行标记,相对与采用LongInt类型进行标记来说,本技术方案进一步减小了数据库占用的存储空间。The marks in Table 5 are marked according to the arithmetic sequence, and the tolerance of the arithmetic sequence is not limited, and only 256 is selected for the tolerance in this embodiment. The technical scheme adopts interval marks, which greatly reduces the storage space occupied by the database. In addition, when the reference sequence or matching reference sequence is long, it is preferable to use the Int type for marking (the length of the reference sequence that can be marked is 2 31 ), and when the reference sequence or matching reference sequence is short, it is preferable to use the Byte type for marking. Compared with using the LongInt type for marking, this technical solution further reduces the storage space occupied by the database.
以下将对上述标记单元进行进一步的详细说明。标记单元,还用于在每个等差数列间隔中再利用等差数列对数据库中的匹配参考序列进行进一步标记。为了使说明更清晰易懂,以下在表4的基础上,给出标记单元的进一步的功能。见表6。The above marking unit will be further described in detail below. The marking unit is also used for further marking the matching reference sequence in the database by using the arithmetic sequence in each arithmetic sequence interval. In order to make the description clearer and easier to understand, further functions of the marking unit are given below on the basis of Table 4. See Table 6.
表 6table 6
本实施例中的技术方案,能够实现核酸序列片段在参考序列或匹配参考序列上匹配之后,能够更快获得匹配上的核酸序列片段在参考序列上的具体位置。比如:核酸序列片段匹配上的起始位置为参考序列的274,当参考序列仅进行一次标记时,需要从256向后推18位才能获知该具体位置,而当对参考序列或匹配序列做了进一步的标记后,可以知道256+16=272,只需在向后推进两位即可得到该匹配位置的起始位置,从而大大提高了匹配单元匹配的效率。其中,本方案标记的数据类型不做特殊限制,优选为Byte类型,相对其他数据类型,本优选方案大大降低了数据库占用的存储空间。The technical solution in this embodiment can realize the specific position of the matched nucleic acid sequence fragment on the reference sequence after the nucleic acid sequence fragment is matched on the reference sequence or matching reference sequence. For example: the starting position of the nucleic acid sequence fragment matching is 274 of the reference sequence. When the reference sequence is marked only once, it is necessary to push back 18 bits from 256 to know the specific position. However, when the reference sequence or matching sequence is done After further marking, it can be known that 256+16=272, and the starting position of the matching position can be obtained by only advancing two bits backward, thus greatly improving the matching efficiency of the matching unit. Wherein, there is no special restriction on the data type marked in this solution, and it is preferably a Byte type. Compared with other data types, this preferred solution greatly reduces the storage space occupied by the database.
需要说明的是,上述示例中仅对参考序列或匹配参考序列的标记进行了两层标记,若需进行多层标记,其多层标记的方式可参考上述示例,在此不再赘述,本发明的标记方式不限于上述给出的示例。It should be noted that, in the above example, only two layers of marking are performed on the reference sequence or the label matching the reference sequence. The notation of is not limited to the examples given above.
以下将对第二实施例中的匹配单元4做进一步的说明。所述匹配单元4,用于将核酸序列片段进行反向形成反向核酸序列片段或将核酸序列片段反向互补形成反向互补核酸序列片段,并将核酸序列片段反向核酸序列或反向互补核酸序列片段与数据库中的匹配参考序列进行一致性匹配。The matching unit 4 in the second embodiment will be further described below. The matching unit 4 is used to reverse the nucleic acid sequence fragments to form reverse nucleic acid sequence fragments or to reverse complement nucleic acid sequence fragments to form reverse complementary nucleic acid sequence fragments, and to reverse nucleic acid sequence fragments or reverse complementary nucleic acid sequence fragments Nucleic acid sequence fragments are congruently matched to matching reference sequences in databases.
以下将分别对核酸序列片段、反向互补核酸序列片段和反向互补核酸序列片段与数据库中的参考序列或匹配参考序列进行一致性匹配的技术方案给出相应的具体实施方案。Corresponding specific embodiments will be given below for the technical schemes for consistent matching of nucleic acid sequence fragments, reverse complementary nucleic acid sequence fragments, and reverse complementary nucleic acid sequence fragments with reference sequences or matching reference sequences in the database.
其中,我们将做如下的假设:核酸序列片段为ACC;数据库中的参考序列G$CCAACTC,数据库中对应的匹配参考序列信息见表7。Among them, we will make the following assumptions: the nucleic acid sequence fragment is ACC; the reference sequence G$CCAACTC in the database, and the matching reference sequence information in the database is shown in Table 7.
表 7table 7
(1)正向核酸序列片段进行一致性匹配。其具体过程如图9所示。(1) Consistent matching of forward nucleic acid sequence fragments. The specific process is shown in Figure 9.
为了方便说明,分别在每列下方标记了①②③④⑤⑥。其中,第①到第⑤列是数据库中匹配参考序列信息,第⑥列是核酸序列片段。第①列和第②列分别为BWT参考序列矩阵的第①列和最后一列,第③列标记了第①列的碱基在参考序列中的位置,第④列标记参考序列的位置(参考序列位置的标记方式可采用间隔式的标记方式),第⑤列为参考序列。本方案中,ACC匹配上的最后一个位置为3,从第③列中,找到3所在的位置,从该位置开始,在第①列中找到碱基C的位置,再在BWT参考序列矩阵中获得最大公共前缀,如果最大前缀大于等于3,则根据第③列即可获得核酸序列片段在参考序列上的所有位置。该技术方案,核酸序列片段只需与参考序列中相同的参考序列匹配一次,就可得到核酸序列片段对应的所有参考序列的起始位置,从而大大提高了核酸序列片段进行一致性匹配的效率。For the convenience of explanation, ①②③④⑤⑥ are marked below each column respectively. Among them, columns ① to ⑤ are matching reference sequence information in the database, and column ⑥ is nucleic acid sequence fragments. Column ① and column ② are the first column and the last column of the BWT reference sequence matrix respectively. Column ③ marks the position of the base in column ① in the reference sequence, and column ④ marks the position of the reference sequence (reference sequence The marking method of the position can be marked in a spaced way), and the fifth column is the reference sequence. In this scheme, the last position on the ACC match is 3, from the third column, find the position of 3, start from this position, find the position of base C in the first column, and then in the BWT reference sequence matrix Obtain the largest common prefix, if the largest prefix is greater than or equal to 3, then all positions of the nucleic acid sequence fragment on the reference sequence can be obtained according to the third column. In this technical solution, the nucleic acid sequence fragment only needs to be matched once with the same reference sequence in the reference sequence, and the starting positions of all reference sequences corresponding to the nucleic acid sequence fragment can be obtained, thereby greatly improving the efficiency of consistent matching of nucleic acid sequence fragments.
(2)反向核酸序列片段进行一致性匹配。(2) Consistent matching of reverse nucleic acid sequence fragments.
为了更清楚的帮助理解本技术方案,如图10所示,以下从BWT参考序列矩阵开始描述本技术方案。In order to help understand the technical solution more clearly, as shown in FIG. 10 , the technical solution is described below starting from the BWT reference sequence matrix.
以下对上述技术方案进行详细解释,其中,第一行C、CC、ACC是反向核酸序列片段的第一位、前两位和前三位,依次从第一位进行一致性匹配,箭头所指的上、下两个位置分别表示匹配上的位置的起始位置和结束位置,反向核酸序列片段每一位的匹配均按照表中的方式进行匹配,表中共给出了前三位的匹配。从上表中的匹配结果可以看到匹配上的位置有两个。从数据库中对应的匹配参考序列信息,我们可以知道匹配上的起始位置在参考序列的第一位和第四位。本技术方案中,核酸序列片段只需与参考序列中相同的参考序列匹配一次,就可得到核酸序列片段对应的所有参考序列的起始位置,从而大大提高了核酸序列片段进行一致性匹配的效率。The above-mentioned technical scheme is explained in detail below, wherein, the first line C, CC, and ACC are the first, first two and first three positions of the reverse nucleic acid sequence fragment, and the consistent matching is performed from the first position in turn, indicated by the arrow The upper and lower positions of the pointer represent the start position and end position of the matching position respectively. The matching of each bit of the reverse nucleic acid sequence fragment is matched according to the method in the table, and the first three positions are given in the table. match. From the matching results in the above table, we can see that there are two matching positions. From the corresponding matching reference sequence information in the database, we can know that the starting position of the matching is at the first and fourth positions of the reference sequence. In this technical solution, the nucleic acid sequence fragment only needs to be matched with the same reference sequence in the reference sequence once, and the starting positions of all reference sequences corresponding to the nucleic acid sequence fragment can be obtained, thereby greatly improving the efficiency of consistent matching of nucleic acid sequence fragments .
(3)反向互补核酸序列片段进行一致性匹配。(3) Consistent matching of reverse complementary nucleic acid sequence fragments.
以下给出一示例,如图11中A所示,由核酸序列片段经过变换得到的反向互补核酸序列片段,继续(2)进行步骤(3),其具体方案为,ACC的反向互补核酸序列片段为GGT,其匹配过程如图9中B所示。本技术方案中,反向互补核酸序列片段从后往下进行匹配,在第③列中找到对应的参考序列所在的位置,该具体实施例中,GGT匹配上的最后一个位置为6,从第③列中,找到6所在的位置,从该位置开始,在第①列中找到碱基G的互补碱基C的最后一次出现的位置,再在BWT参考序列矩阵中获得最大公共前缀,如果最大前缀大于等于3,则根据第③列即可获得核酸序列片段在参考序列上的所有位置。该技术方案,核酸序列片段只需与参考序列中相同的参考序列匹配一次,就可得到核酸序列片段对应的所有参考序列的起始位置,从而大大提高了核酸序列片段进行一致性匹配的效率。An example is given below, as shown in A in Figure 11, the reverse complementary nucleic acid sequence fragment obtained by transforming the nucleic acid sequence fragment, proceed to (2) to step (3), and the specific scheme is, the reverse complementary nucleic acid of ACC The sequence fragment is GGT, and its matching process is shown in B in Figure 9 . In this technical solution, the reverse complementary nucleic acid sequence fragments are matched from the back to the bottom, and the position of the corresponding reference sequence is found in the third column. In this specific embodiment, the last position on the GGT match is 6, starting from the third column In column ③, find the position of 6. From this position, find the position of the last occurrence of the complementary base C of base G in column ①, and then obtain the largest common prefix in the BWT reference sequence matrix. If the largest If the prefix is greater than or equal to 3, all positions of the nucleic acid sequence fragment on the reference sequence can be obtained according to the third column. In this technical solution, the nucleic acid sequence fragment only needs to be matched once with the same reference sequence in the reference sequence, and the starting positions of all reference sequences corresponding to the nucleic acid sequence fragment can be obtained, thereby greatly improving the efficiency of consistent matching of nucleic acid sequence fragments.
本方案实施例中所述的一致性匹配包括完全匹配上或在允许每个核酸序列片段有N个错配的情况下,至多N个不能匹配上,以下将对有错配的情况给出一具体方案,见图5所示。图中,当匹配到某一位碱基C时,不能够匹配上,则将碱基C中的前一位碱基“C”进行碱基替换,然后再匹配,碱基“C”依次换成碱基“T、G、A”之后,仍然无法匹配,再对更前一位进行碱基替换,当更前一位上的碱基“C”换成碱基“T”后,则能够匹配上后,继续进行其他匹配。本技术方案中给出了允许错配的情况,当允许一个错配时,核酸序列片段至多有一个碱基不能在数据库中匹配上;当允许有N个错配时,核酸序列片段中核酸序列片段至多有N个碱基不能在数据库中匹配上,也即核酸序列片段上任意N个位置修改后可以在数据库中匹配上。本技术方案在满足快速匹配的同时可实现基因突变的检测,同时,利用碱基替换(也即校正碱基识别错误)的方法解决了由于碱基识别错误造成的核酸序列不能匹配的问题,从而为核酸测序的准确度提高提供了保证。The consistent matching described in the embodiment of this scheme includes complete matching or under the condition that each nucleic acid sequence fragment is allowed to have N mismatches, at most N mismatches cannot be matched, and the following will give a list of mismatches The specific scheme is shown in Figure 5. In the figure, when a certain base C is matched, it cannot be matched, then the previous base "C" in the base C is replaced by a base, and then matched, and the base "C" is replaced in turn After the bases "T, G, A" are still unable to match, the previous base is replaced. When the previous base "C" is replaced with the base "T", it can be Once matched, proceed to other matches. The situation of allowing mismatches is given in this technical solution. When one mismatch is allowed, at most one base in the nucleic acid sequence fragment cannot be matched in the database; when N mismatches are allowed, the nucleic acid sequence in the nucleic acid sequence fragment The fragment has at most N bases that cannot be matched in the database, that is, any N positions on the nucleic acid sequence fragment can be matched in the database after modification. This technical solution can realize the detection of gene mutations while satisfying fast matching. At the same time, the method of base replacement (that is, correcting base recognition errors) solves the problem that the nucleic acid sequence cannot be matched due to base recognition errors, thereby It provides a guarantee for the accuracy improvement of nucleic acid sequencing.
针对上述任一技术方案,本发明提出第三实施例,所述对核酸序列信息进行匹配的系统包括信息接收单元、数据库、参考序列变换单元、标记单元和匹配单元。如图6所示。本实施例中将不再对数据库、参考序列变换单元、标记单元和匹配单元进行赘述,具体技术方案参照上述任一技术方案,以下仅对信息接收单元进行说明进一步的说明。所述信息接收单元,用于接收核酸序列片段信息和参考序列信息。所述系统可包括计算机,所述计算机可包括USB接口或光盘驱动接口或INTERNET网络接口。优选的,信息接收单元,通过USB接口或光盘驱动接口或INTERNET获取核酸序列片段和参考序列。其中,信息接收单元将接收到的信息进行存储,其中核酸序列片段和参考序列分别进行存储,匹配单元可从存储核酸序列片段的数据库中获取核酸序列片段,与存储参考序列的数据库中的匹配参考序列进行一致性匹配,获得匹配结果。其匹配结果可以以可读新型输出,比如包括:每条核酸序列片段的长度、每条核酸序列片段匹配不能匹配上的个数、核酸序列片段匹配上的位置等信息,其输出仅是形式而已,在此不再具体详细阐述。For any of the above technical solutions, the present invention proposes a third embodiment, the system for matching nucleic acid sequence information includes an information receiving unit, a database, a reference sequence conversion unit, a marking unit and a matching unit. As shown in Figure 6. In this embodiment, the database, the reference sequence transformation unit, the marking unit, and the matching unit will not be described in detail. For a specific technical solution, refer to any of the above technical solutions, and only the information receiving unit will be further described below. The information receiving unit is configured to receive nucleic acid sequence fragment information and reference sequence information. The system may include a computer, which may include a USB interface or an optical disk drive interface or an INTERNET network interface. Preferably, the information receiving unit obtains the nucleic acid sequence fragment and the reference sequence through a USB interface or an optical disk drive interface or the Internet. Wherein, the information receiving unit stores the received information, wherein the nucleic acid sequence fragment and the reference sequence are stored separately, and the matching unit can obtain the nucleic acid sequence fragment from the database storing the nucleic acid sequence fragment, and match the reference sequence fragment in the database storing the reference sequence Sequences are matched consistently to obtain matching results. The matching results can be output in a readable format, for example, including: the length of each nucleic acid sequence fragment, the number of unmatched matches for each nucleic acid sequence fragment, the matching position of the nucleic acid sequence fragment, etc. The output is only a formality , which will not be described in detail here.
基于第一实施例,本发明提出第四实施例。所述对核酸序列信息进行匹配的方法包括以下步骤。Based on the first embodiment, the present invention proposes a fourth embodiment. The method for matching nucleic acid sequence information includes the following steps.
步骤S1、对数据库中的参考序列进行BWT变换,得匹配参考序列,并将匹配参考序列存储在数据库中。Step S1, perform BWT transformation on the reference sequence in the database to obtain a matching reference sequence, and store the matching reference sequence in the database.
所述数据库中存储的参考序列为存储在计算机内部或者是存储在该计算机外的存储器中的参考序列。所述参考序列为碱基序列,也即核酸序列信息。该参考序列与核酸序列片段为同一物种的核酸序列信息,比如,核酸序列片段是对草履虫的核酸进行测序得到的,则对应的参考序列就为草履虫的核酸序列信息,也可以是人工序列所得的参考序列和核酸序列片段。对参考序列和核酸序列片段无特别限制,其中,参考序列为已知碱基序列。The reference sequences stored in the database are reference sequences stored inside the computer or in a memory outside the computer. The reference sequence is a base sequence, that is, nucleic acid sequence information. The reference sequence and the nucleic acid sequence fragment are nucleic acid sequence information of the same species. For example, the nucleic acid sequence fragment is obtained by sequencing the nucleic acid of Paramecia, and the corresponding reference sequence is the nucleic acid sequence information of Paramecia, or it can be an artificial sequence Resulting reference sequences and nucleic acid sequence fragments. There is no particular limitation on the reference sequence and nucleic acid sequence fragment, wherein the reference sequence is a known base sequence.
所述的BWT变换是Mike Burrows 依据David Wheeler提出的变换思想,完善并成功应用于实际数据压缩的变换方法,该变换是目前无损压缩领域的研究热点。BWT是一种以数据块为操作对象的可逆的数据变换方法,其核心思想是对字符串轮转后得到的字符矩阵进行排序和变换。对数据库中的参考序列进行BWT变换后,得到的匹配参考序列存储在数据库中。The BWT transformation described above is a transformation method perfected and successfully applied to actual data compression by Mike Burrows based on the transformation idea proposed by David Wheeler. This transformation is currently a research hotspot in the field of lossless compression. BWT is a reversible data transformation method that takes data blocks as the operation object. Its core idea is to sort and transform the character matrix obtained after the string rotation. After performing BWT transformation on the reference sequences in the database, the obtained matching reference sequences are stored in the database.
步骤S2、对将数据库中的匹配参考序列进行间隔标记。Step S2, marking the matching reference sequences in the database with intervals.
所述对数据库中的匹配参考序列进行间隔标记的方式不限,可以采用等差数列,或者其他数列进行有规律的间隔标记。该标记采用的数据类型可以根据需要来选择,比如Int、Byte等数据类型。There is no limit to the way of marking the intervals of the matched reference sequences in the database, and regular intervals can be marked by arithmetic sequence or other sequence. The data type used by the mark can be selected according to needs, such as Int, Byte and other data types.
步骤S3、将核酸序列片段反向互补形成反向互补核酸序列片段,然后将反向互补核酸序列片段与数据库中的匹配参考序列中进行一致性匹配,得匹配核酸序列。Step S3, reverse complementing the nucleic acid sequence fragment to form a reverse complementary nucleic acid sequence fragment, and then performing consistent matching between the reverse complementary nucleic acid sequence fragment and the matching reference sequence in the database to obtain a matching nucleic acid sequence.
一致性匹配包括允许错配和不允许错配的情况。在允许N个错配的情况下,核酸序列片段至多有N个碱基与数据库中的匹配参考序列不一致称为一致性匹配;在不允许错配的情况下,核酸序列片段与数据库中的匹配参考序列完全一致称为一致性匹配。N为正整数。Consistent matching includes cases where mismatches are allowed and cases where mismatches are not allowed. In the case where N mismatches are allowed, at most N bases of the nucleic acid sequence fragment are inconsistent with the matching reference sequence in the database, which is called a consensus match; The complete identity of the reference sequence is called a consensus match. N is a positive integer.
所述核酸序列片段为存储在系统内的核酸序列片段,或者存储在该系统以外的存储器上。将整条核酸序列片段直接与数据库中的匹配参考序列进行一致性匹配或者将整条核酸序列片段首尾同时与数据库中的匹配参考序列进行一致性匹配。所述的一致性匹配是指在允许有N个错配的情况下,整条核酸序列片段有至多有N个碱基不能与匹配参考序列匹配上,则认为该整条核酸序列片段匹配上,得到一条匹配核酸序列片段,否则,认为该核酸序列片段不能匹配上,舍弃该核酸序列片段。其他所有的核酸序列片段均按照此种方式在数据库中进行一致性匹配,然后得到匹配核酸序列。该匹配核酸序列可以以可读形式输出,也可以存储在数据库中。当匹配核酸序列输出时,其输出的信息可包括每条核酸序列片段与参考序列对应的起始位置和终止位置,每条核酸序列片段错配的位置和错配的个数等信息。The nucleic acid sequence fragment is a nucleic acid sequence fragment stored in the system, or stored in a memory outside the system. The entire nucleic acid sequence fragment is directly matched with the matching reference sequence in the database or the entire nucleic acid sequence fragment is simultaneously matched with the matching reference sequence in the database. The consistent matching means that under the condition that N mismatches are allowed, at most N bases of the entire nucleic acid sequence fragment cannot be matched with the matching reference sequence, then the entire nucleic acid sequence fragment is considered to be matched, A matching nucleic acid sequence fragment is obtained, otherwise, the nucleic acid sequence fragment is considered unmatched, and the nucleic acid sequence fragment is discarded. All other nucleic acid sequence fragments are subjected to consistent matching in the database in this way, and then a matched nucleic acid sequence is obtained. The matching nucleic acid sequence can be output in readable form or stored in a database. When the matched nucleic acid sequence is output, the output information may include information such as the starting position and the ending position of each nucleic acid sequence fragment corresponding to the reference sequence, the position and number of mismatches of each nucleic acid sequence fragment, and the like.
本实施例的技术方案中,通过整体核酸序列片段直接在数据库中进行一致性匹配,并且对于相同的匹配参考序列只匹配一次,从而提高了匹配的效率。同时,存储在数据库中的参考序列无需进行分段建立参考序列索引(假设参考序列索引长为K,则相邻两个参考序列索引中,前一个参考序列索引的后K-1个碱基和后一个参考序列索引前K-1个碱基完全相同),并且进行间隔标记,相对与现有技术,大大减少了存储空间。In the technical solution of this embodiment, the whole nucleic acid sequence fragments are directly matched in the database, and the same matching reference sequence is only matched once, thereby improving the matching efficiency. At the same time, the reference sequence stored in the database does not need to be segmented to establish a reference sequence index (assuming that the length of the reference sequence index is K, then in two adjacent reference sequence indexes, the last K-1 bases of the previous reference sequence index and The first K-1 bases of the latter reference sequence index are exactly the same), and interval marks are carried out, which greatly reduces the storage space compared with the prior art.
上述步骤S1包括:S11、将数据库中的参考序列末端或前端加上标识符,并将该参考序列经过循环移动,得参考序列矩阵。S12、将参考序列矩阵按照字典顺序排序,得BWT参考序列矩阵。The above step S1 includes: S11. Add an identifier to the end or front end of the reference sequence in the database, and move the reference sequence cyclically to obtain a reference sequence matrix. S12. Sorting the reference sequence matrix in lexicographical order to obtain a BWT reference sequence matrix.
针对该技术方案,给出一示例,如果需要匹配的核酸序列片段为CCACC,BWT参考序列矩阵为如下所示的矩阵。For this technical solution, an example is given. If the nucleic acid sequence segment to be matched is CCACC, the BWT reference sequence matrix is the matrix shown below.
第一行 $ACCACCTGThe first line $ACCACCTG
第二行 ACCACCTG$The second line ACCACCTG$
第三行 ACCTG$ACCThe third line ACCTG$ACC
第四行 CACCTG$ACThe fourth line CACCTG$AC
第五行 CCACCTG$AThe fifth line CCACCTG$A
第六行 CCTG$ACCALine 6 CCTG$ACCA
第七行 CTG$ACCACSeventh line CTG$ACCAC
第八行 G$ACCACCTEighth line G$ACCACCT
第九行 TG$ACCACCLine 9 TG$ACCACC
则在进行比对时,核酸序列片段的第一位碱基为C,则只需要在从BWT矩阵的四行起开始比对,核酸序列片段的第二位和BWT矩阵的第四行的第二位进行比对,如果第二位比对上,再比对核酸序列片段与BWT矩阵的第四行的第三位上的碱基,……,如果第二位没有比对上,则移至第五行,循环上述的比对方式,直到比对到第七行。需要注意的是,如果核酸序列片段只有M个碱基,则只需要将核酸序列片段的第n位于BWT矩阵的比对所在的行的第n位比较即可,只比较到第M位。本技术方案中,按字典顺序进行排序的BWT参考序列矩阵为匹配参考序列,核酸序列片段在与匹配参考序列比对时,当核酸序列片段的第一个碱基为A时,只需要在BWT参考序列矩阵第一列为A的所在的行进行比对即可,当核酸序列片段的第一个碱基为G、C、T时,只需要在BWT参考序列矩阵第一列为G、C、T的所在的行进行比对即可。从而大大提高了比对的速度。Then, when performing alignment, the first base of the nucleic acid sequence fragment is C, and it is only necessary to start the alignment from the fourth row of the BWT matrix, the second base of the nucleic acid sequence fragment and the first base of the fourth row of the BWT matrix Two positions are compared, if the second position is compared, then compare the nucleic acid sequence fragment with the base on the third position of the fourth row of the BWT matrix, ..., if the second position is not compared, then shift To the fifth row, the above-mentioned comparison method is repeated until the comparison reaches the seventh row. It should be noted that if the nucleic acid sequence fragment has only M bases, it is only necessary to compare the nth position of the nucleic acid sequence fragment in the row where the alignment of the BWT matrix is located, and only compare to the Mth position. In this technical solution, the BWT reference sequence matrix sorted in lexicographical order is a matching reference sequence. When the nucleic acid sequence fragment is compared with the matching reference sequence, when the first base of the nucleic acid sequence fragment is A, it only needs to be in the BWT The row where the first column of the reference sequence matrix is A can be compared. When the first base of the nucleic acid sequence fragment is G, C, T, only the first column of the BWT reference sequence matrix is G, C , and the row where T is located can be compared. Thereby, the comparison speed is greatly improved.
所述步骤S12之后还可包括:S13、获取BWT参考序列矩阵第一列与最后一列,得匹配参考序列,并存储在数据库中。步骤S1的具体方法流程图如图7所示。首先,在参考序列的参考序列的末端或前端加上的标识符,该标识符可为除了A、G、C、T的任意字符,该字符的添加是为了区分参考序列的首尾;长度为X的参考序列在添加一个字符后长度变为X+1;然后,将该添加标识符的参考序列进行循环移动,可得(X+1)*(X+1)的参考序列矩阵;接着,对所述参考序列矩阵进行字典顺序排序,的BWT参考序列矩阵,所述字典顺序排序是按照汉语拼音的A、B、C、D……依次排序。最后,提取第一列与最后一列,存储。优选的,所添加的标识符被认为最大或最小的。After the step S12, it may further include: S13. Obtain the first column and the last column of the BWT reference sequence matrix to match the reference sequence and store it in the database. The specific method flowchart of step S1 is shown in FIG. 7 . First, the identifier added at the end or front of the reference sequence of the reference sequence, the identifier can be any character except A, G, C, T, the character is added to distinguish the beginning and the end of the reference sequence; the length is X After adding a character, the length of the reference sequence becomes X+1; then, the reference sequence with the added identifier is cyclically moved, and the reference sequence matrix of (X+1)*(X+1) can be obtained; then, the The reference sequence matrix is sorted in lexicographical order, and the BWT reference sequence matrix is sorted according to A, B, C, D... in Chinese Pinyin. Finally, the first and last columns are extracted and stored. Preferably, the added identifier is considered the largest or smallest.
上述技术方案中,核酸序列片段可以在相同的参考序列中进行一次匹配即可,原因在于参考序列经过BWT变换后,使得数据库中BWT参考序列矩阵的相邻的每行序列有最大公共前缀,在匹配时,如果核酸序列片段与第r行匹配上,核酸序列片段的长度为m,则与BWT参考序列矩阵的第r行的最大公共前缀至少为m的,均是可比对上的,只需确定最大公共前缀即可,无需在将核酸序列片段在进行一致性匹配,也即就只需要匹配一次。比如核酸序列片段长度为3,其为ACC,BWT参考序列矩阵的第二行和第三行的最大公共前缀为3,都是ACC,此时只要进行与公共前缀比较即可,就实现了核酸序列片段在数据库中的匹配。该技术方案,不仅解决的传统技术上核酸序列片段匹配速度慢的问题,也解决了存储空间大的问题,实现了核酸序列片段匹配速度快,且匹配参考序列占用空间小。In the above technical solution, the nucleic acid sequence fragments can only be matched once in the same reference sequence. The reason is that after the reference sequence is transformed by BWT, each adjacent row sequence of the BWT reference sequence matrix in the database has the largest common prefix. When matching, if the nucleic acid sequence fragment is matched with the rth row, and the length of the nucleic acid sequence fragment is m, then the largest common prefix of the rth row of the BWT reference sequence matrix is at least m, all of which can be compared. It is enough to determine the largest common prefix, and there is no need to perform consistent matching on the nucleic acid sequence fragments, that is, only one matching is required. For example, the length of the nucleic acid sequence fragment is 3, which is ACC, and the largest common prefix in the second and third rows of the BWT reference sequence matrix is 3, both of which are ACC. At this time, as long as the comparison with the common prefix is performed, the nucleic acid Matching of sequence fragments in the database. This technical solution not only solves the problem of slow matching speed of nucleic acid sequence fragments in the traditional technology, but also solves the problem of large storage space, realizes fast matching speed of nucleic acid sequence fragments, and occupies less space for matching reference sequences.
本实施例中,上述步骤S2中,对数据库中的匹配参考序列进行间隔标记,该技术方案使得核酸序列片段进行匹配时,能够快速获得匹配的起始位置。优选的,上述步骤S2中,对数据库中的匹配参考序列按照等差数列进行间隔标记,该技术方案中采用等差数列进行标记,从而大大减少了数据库的存储空间。更优选的,在步骤S2中,在每个等差数列间隔中再利用等差数列对数据库中的匹配参考序列进行进一步标记,该技术方案不但能快速获得核酸序列片段匹配的位置,并且可以进一步减少数据库的储存空间。本技术方案中,在进一步标记时,可进行重新编号,相对优选方案中的标记而言,可采用占用空间更小的数据类型进行储存,比如优选方案中采用Int类型进行标记,更优选的技术方案可采用Byte类型进行进一步的标记。In this embodiment, in the above step S2, the matching reference sequence in the database is spaced, and this technical solution enables the matching start position to be quickly obtained when nucleic acid sequence fragments are matched. Preferably, in the above step S2, the matching reference sequence in the database is spaced according to an arithmetic sequence, and in this technical solution, an arithmetic sequence is used for marking, thereby greatly reducing the storage space of the database. More preferably, in step S2, the matching reference sequence in the database is further marked by using the arithmetic sequence in each arithmetic sequence interval. This technical solution can not only quickly obtain the matching position of the nucleic acid sequence fragment, but also can further Reduce database storage space. In this technical scheme, when further marking, renumbering can be carried out. Compared with the marking in the preferred scheme, a data type with a smaller footprint can be used for storage. For example, the Int type is used for marking in the preferred scheme. The more preferred technology The scheme can use the Byte type for further marking.
本实施例中,上述步骤S3中,用于将核酸序列片段反向互补形成反向互补核酸序列片段,并将反向互补核酸序列片段与数据库中的匹配参考序列进行一致性匹配,得匹配核酸序列。进一步的,上述步骤S3中,利用回溯法依次在反向互补核酸序列片段不能匹配的位置之前的位置上进行碱基替换,并从替换位置继续在数据库上进行匹配。In this embodiment, in the above step S3, it is used to reverse complement the nucleic acid sequence fragment to form a reverse complementary nucleic acid sequence fragment, and perform consistent matching between the reverse complementary nucleic acid sequence fragment and the matching reference sequence in the database to obtain a matching nucleic acid sequence. Further, in the above step S3, the backtracking method is used to sequentially perform base replacement at the position before the position where the reverse complementary nucleic acid sequence fragment cannot be matched, and continue to perform the matching on the database from the replacement position.
基于第四实施例,本发明提出第五实施例,对核酸序列片段进行匹配的方法的流程图如图8所示。首先,将一条核酸序列片段与数据库中的匹配参考序列进行匹配,如果匹配成功,则结束该次匹配;如果匹配不成功,则判断是否允许错配;如果不允许错配,则结束该次匹配;如果允许错配,则在不能匹配上的位置之前的一位开始,进行碱基替换,然后再进行匹配。Based on the fourth embodiment, the present invention proposes a fifth embodiment, a flowchart of a method for matching nucleic acid sequence fragments is shown in FIG. 8 . First, match a nucleic acid sequence fragment with the matching reference sequence in the database. If the match is successful, then end the match; if the match is unsuccessful, judge whether to allow mismatch; if no mismatch is allowed, end the match ; If a mismatch is allowed, start at the position before the position that cannot be matched, perform base replacement, and then perform a match.
以下给出一示例,如图12所示。本示例中,当核酸序列片段中的碱基进行替换后,继续与数据库中的匹配参考序列进行一致性匹配,本示例中,当核酸序列第三个位置的碱基A替换成T后,与数据库中的匹配参考序列完全匹配上,此时,完成该条核酸序列片段的匹配。An example is given below, as shown in Figure 12. In this example, when the base in the nucleic acid sequence fragment is replaced, it continues to be matched with the matching reference sequence in the database. In this example, when the base A in the third position of the nucleic acid sequence is replaced with T, it is matched with The matching reference sequence in the database is completely matched, and at this time, the matching of the nucleic acid sequence fragment is completed.
本实施例中,对允许错配的个数无特殊限制,允许错配的个数根据核酸序列片段的长短来确定,当核酸序列片段较长时,允许错配的个数可以多,当核酸序列片段较短时,允许错配的个数较少,比如:核酸序列片段长为50bp允许错配4个,核酸序列片段长为30bp允许错配2个;核酸序列片段长为10bp允许错配0个。本实施例中,核酸序列片段在参考序列中进行匹配,当不允许错配时,一条核酸序列不能完全匹配上,则认为该核酸序列片段不能匹配;当允许有M个错配时,一条核酸序列片段允许有至多M个位置的碱基进行碱基替换,当进行了M个位置进行了碱基替换后,仍然无法匹配上,则认为该核酸序列片段不能够匹配上,否则认为该核酸序列片段匹配上。In this embodiment, there is no special limit to the number of allowable mismatches. The number of allowable mismatches is determined according to the length of the nucleic acid sequence fragment. When the nucleic acid sequence fragment is longer, the number of allowable mismatches can be more. When the sequence fragment is short, the number of mismatches allowed is small. For example, 4 mismatches are allowed for a nucleic acid sequence fragment length of 50 bp, 2 mismatches are allowed for a nucleic acid sequence fragment length of 30 bp, and 10 bp mismatches are allowed for a nucleic acid sequence fragment length. 0. In this embodiment, the nucleic acid sequence fragments are matched in the reference sequence. When mismatches are not allowed, a nucleic acid sequence cannot be completely matched, and the nucleic acid sequence fragments are considered unmatched; when M mismatches are allowed, a nucleic acid sequence cannot be matched. Sequence fragments allow at most M bases to be replaced. When M positions are replaced and still cannot be matched, it is considered that the nucleic acid sequence fragment cannot be matched. Otherwise, the nucleic acid sequence is considered to be unmatched. fragment matches.
应当说明的是,本发明典型的应用但不限于生化测序领域中对核酸序列片段的匹配,在其他类似的信息处理领域中也可以应用本发明所阐述的方法。It should be noted that the typical application of the present invention is not limited to the matching of nucleic acid sequence fragments in the field of biochemical sequencing, and the method described in the present invention can also be applied in other similar information processing fields.
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. within range.
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510482636.9A CN105069325B (en) | 2012-07-28 | 2012-07-28 | It is a kind of that matched method is carried out to nucleic acid sequence information |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210263634.7A CN102841988B (en) | 2012-07-28 | 2012-07-28 | A kind of system and method that nucleic acid sequence information is mated |
CN201510482636.9A CN105069325B (en) | 2012-07-28 | 2012-07-28 | It is a kind of that matched method is carried out to nucleic acid sequence information |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210263634.7A Division CN102841988B (en) | 2012-07-28 | 2012-07-28 | A kind of system and method that nucleic acid sequence information is mated |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105069325A CN105069325A (en) | 2015-11-18 |
CN105069325B true CN105069325B (en) | 2018-10-09 |
Family
ID=47369343
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210263634.7A Active CN102841988B (en) | 2012-07-28 | 2012-07-28 | A kind of system and method that nucleic acid sequence information is mated |
CN201510482636.9A Expired - Fee Related CN105069325B (en) | 2012-07-28 | 2012-07-28 | It is a kind of that matched method is carried out to nucleic acid sequence information |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210263634.7A Active CN102841988B (en) | 2012-07-28 | 2012-07-28 | A kind of system and method that nucleic acid sequence information is mated |
Country Status (1)
Country | Link |
---|---|
CN (2) | CN102841988B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016090585A1 (en) * | 2014-12-10 | 2016-06-16 | 深圳华大基因研究院 | Sequencing data processing apparatus and method |
CN104899476A (en) * | 2015-06-15 | 2015-09-09 | 中国人民解放军国防科学技术大学 | Parallel accelerating method for BWT index construction for multiple sequences |
CN110875084B (en) * | 2018-08-13 | 2022-06-21 | 深圳华大基因科技服务有限公司 | Nucleic acid sequence comparison method |
CN111063394B (en) * | 2019-12-13 | 2023-07-11 | 人和未来生物科技(长沙)有限公司 | Method, system and medium for quickly searching and constructing library of species based on gene sequence |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102272763A (en) * | 2008-11-26 | 2011-12-07 | 伊鲁米纳公司 | Methods and systems for analysis of sequencing data |
CN102453751A (en) * | 2010-10-19 | 2012-05-16 | 鼎生科技(北京)有限公司 | Method for short sequence back-pasting genome of DNA sequencer |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1272551A (en) * | 2000-04-13 | 2000-11-08 | 上海交通大学 | Method for determining DNA sequence |
US20040220127A1 (en) * | 2002-08-09 | 2004-11-04 | Paul Sternberg | Methods and compositions relating to 5'-chimeric ribonucleic acids |
EP1694859B1 (en) * | 2003-10-29 | 2015-01-07 | Bioarray Solutions Ltd | Multiplexed nucleic acid analysis by fragmentation of double-stranded dna |
US8578058B2 (en) * | 2010-03-10 | 2013-11-05 | International Business Machines Corporation | Real-time multi-block lossless recompression |
US20140229495A1 (en) * | 2011-01-19 | 2014-08-14 | Koninklijke Philips N.V. | Method for processing genomic data |
-
2012
- 2012-07-28 CN CN201210263634.7A patent/CN102841988B/en active Active
- 2012-07-28 CN CN201510482636.9A patent/CN105069325B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102272763A (en) * | 2008-11-26 | 2011-12-07 | 伊鲁米纳公司 | Methods and systems for analysis of sequencing data |
CN102453751A (en) * | 2010-10-19 | 2012-05-16 | 鼎生科技(北京)有限公司 | Method for short sequence back-pasting genome of DNA sequencer |
Non-Patent Citations (2)
Title |
---|
Fast and accurate short read alignment with Burrows-Wheeler transform;Heng Li 等;《Bioinformatics》;20090518;第25卷(第14期);第1756页左栏第2段-右栏第3段 * |
面向基因组重测序的BWT索引压缩算法;熊文林;《中国优秀硕士学位论文全文数据库 信息科技辑》;20120515(第5期);第7-8,13,18,20-21,29页 * |
Also Published As
Publication number | Publication date |
---|---|
CN105069325A (en) | 2015-11-18 |
CN102841988A (en) | 2012-12-26 |
CN102841988B (en) | 2015-10-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102682226B (en) | A kind of nucleic acid sequencing information handling system and method | |
WO2018218788A1 (en) | Third-generation sequencing sequence alignment method based on global seed scoring optimization | |
CA2424031C (en) | System and process for validating, aligning and reordering genetic sequence maps using ordered restriction map | |
CN107798216B (en) | Alignment of high similarity sequences using divide and conquer | |
KR101313087B1 (en) | Method and Apparatus for rearrangement of sequence in Next Generation Sequencing | |
CN105069325B (en) | It is a kind of that matched method is carried out to nucleic acid sequence information | |
CN101714187B (en) | Index acceleration method and corresponding system in scale protein identification | |
CN115631789A (en) | A Pan-Genome-Based Population Joint Variation Detection Method | |
CN112992277A (en) | Construction method and application of microbial genome database | |
US20100293167A1 (en) | Biological database index and query searching | |
CN103699819B (en) | The summit extended method of elongated kmer based on multistep two-way De Bruijn inquiry | |
CN109828785B (en) | An Approximate Code Cloning Detection Method Using GPU Acceleration | |
Bautista et al. | Bit-vector-based hardware accelerator for dna alignment tools | |
EP3663890B1 (en) | Alignment method, device and system | |
JP7560885B2 (en) | Biological Sequencing | |
CN110875084B (en) | Nucleic acid sequence comparison method | |
Vezzi | Next generation sequencing revolution challenges: Search, assemble, and validate genomes | |
Nguyen et al. | A knowledge-based multiple-sequence alignment algorithm | |
TWI785847B (en) | Data processing system for processing gene sequencing data | |
Haritha et al. | A Comprehensive Review on Protein Sequence Analysis Techniques | |
CN103699818A (en) | Bidirectional edge expanding method for multistep bidirectional De Bruijn image-based elongating kmer inquiry | |
CN111916153B (en) | A Parallel Multiple Sequence Alignment Method | |
CN112802554B (en) | An animal mitochondrial genome assembly method based on second-generation data | |
CN112825267B (en) | Method for determining a collection of small nucleic acid sequences and use thereof | |
CN107526942B (en) | A Reverse Retrieval Method for Biomics Sequence Data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20181128 Address after: 518117 Pingshan Street, Pingshan District, Shenzhen City, Guangdong Province, Xinhe Fourth Road Merchants Garden, 8 buildings, 3 floors and 3 rooms Patentee after: SHENZHEN MAIYA ACCELERATOR TECHNOLOGY Co.,Ltd. Address before: 518057 South Mountain High-tech Zone, Shenzhen City, Guangdong Province Patentee before: Sheng Sichong |
|
TR01 | Transfer of patent right | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20181009 |