CN105260395A

CN105260395A - Reverse index structure based STR data storage and paternity test sorting comparison method

Info

Publication number: CN105260395A
Application number: CN201510590067.XA
Authority: CN
Inventors: 刘健; 李宝娟; 高东怀; 许卫中; 孙茂; 许浩; 靳豪杰; 张军超
Original assignee: Fourth Military Medical University FMMU
Current assignee: Fourth Military Medical University FMMU
Priority date: 2015-09-16
Filing date: 2015-09-16
Publication date: 2016-01-20
Anticipated expiration: 2035-09-16
Also published as: CN105260395B

Abstract

The invention discloses a method for storing STR data based on an inverted index structure and sorting and comparing methods for paternity identification, belonging to the technical field of data storage and processing. The STR data storage based on the inverted index structure and the paternity identification sorting comparison method of the present invention mainly include two aspects: one is the STR data storage method based on the inverted index structure. In different data domains, STR data is stored in an inverted index structure in the data domain; the second is the paternity identification sorting and comparison method, which is based on the inverted index structure of the divided domains, and calculates the genetic relationship between the family search samples and the samples in the database , to achieve fast, stable and reliable online family search.

Description

STR data storage and paternity identification sorting comparison method based on inverted index structure

技术领域technical field

本发明属于数据存储及处理技术领域，具体涉及一种基于倒排索引结构的STR数据存储及亲子鉴定排序比对方法。The invention belongs to the technical field of data storage and processing, and in particular relates to an inverted index structure-based STR data storage and paternity identification sorting and comparison method.

背景技术Background technique

据不完全统计，目前全国共有约50万寻亲人员，其中由于历史、自然灾害、社会问题等原因造成的战乱孤儿(日本遗孤)、自然灾害孤儿和被拐卖人口等构成了寻亲人员的主体。近年来，随着生物技术的不断发展，通过基因技术进行寻亲变得越来越可行。According to incomplete statistics, there are currently about 500,000 family-seekers across the country. Among them, war orphans (Japanese orphans), natural disaster orphans, and trafficked people caused by history, natural disasters, and social problems constitute the main body of family-seekers. . In recent years, with the continuous development of biotechnology, family tracing through genetic technology has become more and more feasible.

基于基因技术的寻亲是主要是通过使用亲子鉴定技术对人类遗传标记进行检测，并根据遗传规律分析对疑似父母与子女血缘关系的鉴定。DNA是人类遗传信息的基本载体，人类的染色体主要由DNA构成的，每个人体细胞有22对常染色体染色体和1对性染色体，共计46条，分别来自父亲和母亲。父母双方各自为子代提供一半染色体，在受精后相互配对，构成子代的染色体。由于人体约有30亿个核苷酸构成整个染色体系统，而且在生殖细胞形成前的互换和组合是随机的，所以除同卵双胞胎以外，没有任何两个人具有完全相同的核苷酸序列，这就是人的遗传多态性。尽管存在遗传的多态性，但每一个人的染色体必然也只能来自其父母，这就是DNA亲子鉴定的理论基础。目前，在亲子鉴定过程中应用得较为广泛的是基于短串联重复序列(shorttandemrepeat,STR)的鉴定技术，由于其高度敏感化、高度个性化、完全数字化等特征，该技术已经成为世界通用的鉴别技术。一个典型的常染色体STR基因座数据如下所示：Family tracing based on gene technology mainly uses paternity identification technology to detect human genetic markers, and analyzes the blood relationship between suspected parents and children according to genetic laws. DNA is the basic carrier of human genetic information. Human chromosomes are mainly composed of DNA. Each human cell has 22 pairs of autosomal chromosomes and 1 pair of sex chromosomes, a total of 46, which come from the father and mother respectively. Both parents provide their offspring with half of their chromosomes, which are paired with each other after fertilization to form the offspring's chromosomes. Since there are about 3 billion nucleotides in the human body to form the entire chromosome system, and the exchange and combination before the formation of germ cells is random, so no two people have the exact same nucleotide sequence except for identical twins. This is human genetic polymorphism. Although there are genetic polymorphisms, each person's chromosomes must only come from their parents, which is the theoretical basis of DNA paternity testing. At present, the identification technology based on short tandem repeat (STR) is widely used in the process of paternity identification. Due to its characteristics of high sensitivity, high personalization, and complete digitalization, this technology has become a common identification technology in the world. technology. A typical autosomal STR locus data is shown below:

位点site STRSTR D8S1179D8S1179 13/1413/14 D21S11D21S11 31/3231/32 D7S820D7S820 11/1211/12

CSF1POCSF1PO 10/1310/13 D3S1358D3S1358 15/1615/16 D5S818D5S818 11/1311/13 D13S317D13S317 8/128/12 D16S539D16S539 9/129/12 D2S1338D2S1338 17/2317/23 D19S433D19S433 14/1414/14 vWAwxya 16/1816/18 D12S391D12S391 18/2118/21 D18S51D18S51 13/1313/13 AMELAMEL X/YX/Y D6S1043D6S1043 12/1912/19 FGAFGA 22/2322/23

。.

目前，亲子鉴定问题的解决多依托于关系型数据库来存储和比对STR数据，以实现样本供体亲子关系的判定。对于一个位点而言，其STR数据主要有两个数字组成，其中一个来自父亲，另一个则来源于母亲。在检测过程中，假设每个样品检测16个位点(其中包括一个性别位点)。每个样品的相同位点都会有两个等位基因的数值。具有生物学亲子关系的两个被检人的15个STR位点中，每一个位点的数据都要求至少有一个数值是相同的。对于这一问题而言，要判断两个个体之间是否具有亲子关系，最多需要在每个位点上比较4次，15个位点则最多需要比较60次。当系统中存储的样本量逐渐增加时，其比对计算量也将逐渐增加。因此，尽管使用关系型数据库的存储和比对方式在一定程度上解决了寻亲数据库的存储和检索问题，但由于人体STR基因座数据自身的特点，其并不适合使用“表格”的关系型数据库来存储，并在很大程度上影响了STR数据的比对效率。此外，基因信息还存在变异的可能性，一旦STR数据发生突变，将进一步增大使用STR数据的进行寻亲和亲子鉴定难度。At present, the solution to the paternity test problem mostly relies on relational databases to store and compare STR data, so as to realize the determination of the parent-child relationship of sample donors. For a locus, its STR data mainly consists of two numbers, one from the father and the other from the mother. In the detection process, it is assumed that each sample detects 16 loci (including one sex locus). Each sample will have values for both alleles at the same locus. Among the 15 STR loci of two subjects with biological parent-child relationship, the data of each locus requires at least one value to be the same. For this problem, to determine whether there is a parent-child relationship between two individuals, at most 4 comparisons at each locus are required, and at most 60 comparisons are required for 15 loci. When the amount of samples stored in the system gradually increases, the comparison calculation amount will also gradually increase. Therefore, although the storage and comparison methods of relational databases can solve the problem of storage and retrieval of relative databases to a certain extent, due to the characteristics of human STR locus data, it is not suitable to use "table" relational databases. It is stored in the database and affects the comparison efficiency of STR data to a large extent. In addition, there is a possibility of variation in genetic information. Once the STR data is mutated, it will further increase the difficulty of using STR data for family finding and paternity testing.

发明内容Contents of the invention

为了克服上述现有技术存在的缺陷，本发明的目的在于提供一种基于倒排索引结构的STR数据存储及亲子鉴定排序比对方法，该方法能够有效提高寻亲数据库的鲁棒性和比对效率，同时能够有效保证寻亲结果的可靠性。In order to overcome the above-mentioned defects in the prior art, the object of the present invention is to provide a method for storing STR data based on an inverted index structure and sorting and comparing methods for paternity identification, which can effectively improve the robustness and comparison of relative databases Efficiency, and at the same time can effectively guarantee the reliability of the family search results.

本发明是通过以下技术方案来实现：The present invention is achieved through the following technical solutions:

本发明公开的基于倒排索引结构的STR数据存储及亲子鉴定排序比对方法，包括以下步骤：The method for storing STR data based on an inverted index structure and sorting and comparing methods for paternity testing disclosed by the present invention comprises the following steps:

1)基于倒排索引结构的STR数据存储1) STR data storage based on inverted index structure

首先，将所有STR数据进行预处理，将每个样本的STR数据集整理为标准格式；然后，将每一个位点作为一个数据域，每个数据域中将存储各自的STR数据；最后，将STR数据以倒排索引的方式存储；First, preprocess all STR data, organize the STR data set of each sample into a standard format; then, use each site as a data domain, and each data domain will store its own STR data; finally, the STR data is stored as an inverted index;

2)基于以倒排索引的方式存储的STR数据的亲子鉴定排序比对2) Sorting comparison of paternity based on STR data stored in the form of inverted index

首先，将待寻亲STR数据进行预处理，将每个样本的STR数据集整理为标准格式；然后，将每个位点的STR数据在各自的数据域中进行比对，并形成最终的亲子关系指数；最后，判定样本之间是否存在亲子关系，如果亲子关系指数高于特定的值，则认为候选样本的供体与待寻亲样本的供体具有亲子关系，反之则认为两者之间不存在亲子关系。First, preprocess the STR data of the relatives to be searched, and organize the STR data sets of each sample into a standard format; then, compare the STR data of each site in their respective data fields to form the final parent-child Relationship index; finally, determine whether there is a parent-child relationship between the samples. If the parent-child relationship index is higher than a specific value, it is considered that the donor of the candidate sample has a parent-child relationship with the donor of the sample to be found, otherwise, it is considered that there is a parent-child relationship between the two. No parent-child relationship exists.

步骤1)中，对STR数据进行预处理，将每个样本的STR数据集整理为标准格式，具体如下：In step 1), the STR data is preprocessed, and the STR data set of each sample is organized into a standard format, as follows:

将样本数据集记为X＝{x₁,x₂,…，x_n}；Denote the sample data set as X={x ₁ ,x ₂ ,...,x _n };

其中，x_i表示第i个个体的STR数据，其中，表示第j个STR基因座的名称，v_jk表示第k个染色体上基因座j上STR的特征值。Among them, x _i represents the STR data of the i-th individual, in, Indicates the name of the jth STR locus, and _vjk indicates the eigenvalue of the STR at locus j on the kth chromosome.

步骤1)中数据域的建立如下：The establishment of the data field in step 1) is as follows:

遍历所有样本的STR数据，建立STR基因座名称的集合STR_N＝{str₁,str₂,…,str_m}，针对STR_N中的每个str_i，建立不同的数据域，记为d_i；i＝1,2……m。Traverse the STR data of all samples, establish a set of STR locus names STR _N = {str ₁ ,str ₂ ,...,str _m }, and establish a different data field for each str _i in STR _N , denoted as d _i ;i=1,2...m.

步骤1)中将STR数据以倒排索引的方式存储，遍历样本数据集X，对任意x_i，遍历:(v_j1/v_j2)，如果对应的数据域d_m中存在v_j1的索引，则将x_i添加到该索引中；如果不存在v_j1的倒排索引，则建立该索引，并将x_i添加到索引中；对于v_j2采用相同的方式进行处理。In step 1), store the STR data in the form of an inverted index, traverse the sample data set X, and traverse any x _i :(v _j1 /v _j2 ), if If there is an index of v _j1 in the corresponding data field d _m , then add _xi to the index; if there is no inverted index of v _j1 , build the index and add _xi to the index; for v _j2 Proceed in the same way.

步骤2)中将待寻亲STR数据进行预处理，具体如下：In step 2), the STR data to be found are preprocessed, as follows:

将寻亲样本整理为如下格式：y＝{str_j:(v_j1/v_j2)}，其中str_j表示第j个STR基因座的名称，v_jk表示第k个染色体上基因座j上STR的特征值。Organize the family searching samples into the following format: y={str _j :(v _j1 /v _j2 )}, where str _j represents the name of the jth STR locus, and v _jk represents the STR at locus j on the kth chromosome eigenvalues of .

步骤2)中亲子关系指数的计算如下：The calculation of the parent-child relationship index in step 2) is as follows:

对于样本y，遍历str_j:(v_j1/v_j2)，如果存在str_j对应的域d_m，则获得v_j1和v_j2索引所对应的样本集合，分别记为X_j1和X_j2；For sample y, traverse str _j : (v _j1 /v _j2 ), if there is a field d _m corresponding to str _j , then obtain the sample sets corresponding to v _j1 and v _j2 indexes, which are recorded as X _j1 and X _j2 respectively;

取X_j1和X_j2的并集，记为X_j＝X_j1∪X_j2；Take the union of X _j1 and X _j2 , denoted as X _j = X _j1 ∪ X _j2 ;

在获得每个str_j所对应的X_j后，计算X_j的并集X中的每个元素均为候选样本；After obtaining the X _j corresponding to each str _j , calculate the union of X _j Each element in X is a candidate sample;

对X中的每一个元素x_i，计算其中：For each element x _i in X, compute in:

则q_i为候选样本x_i的亲子关系指数。Then q _i is the parent-child relationship index of the candidate sample x _i .

步骤2)中判断是否存在亲子关系，具体如下：In step 2), it is judged whether there is a parent-child relationship, as follows:

依据q_i对候选样本x_i进行降序排序，如果q_i≥θ，则认为待寻亲样本y的供体与候选样本x_i的供体具有亲子关系；反之，则认为两者之间不存在亲子关系；其中，θ为系统事先设定的阈值，为基因座的数量减1。According to q _i , the candidate samples x _i are sorted in descending order. If q _i ≥ θ, it is considered that the donor of the sample y to be searched for relatives has a parent-child relationship with the donor of the candidate sample x _i ; otherwise, it is considered that there is no relationship between the two. Parent-child relationship; among them, θ is the threshold value set in advance by the system, which is the number of loci minus 1.

与现有技术相比，本发明具有以下有益的技术效果：Compared with the prior art, the present invention has the following beneficial technical effects:

1、更高的查询效率1. Higher query efficiency

传统的寻亲数据库往往使用关系型数据库，并通过垂直分割、建立视图、建立统计信息等手段来优化、提高系统的查询效率。但是这些方法查询算法相对复杂，且不易于为缺少相关背景知识运维人员所理解。本发明根据基因座点位的不同而设置不同的数据存储域，并在不同的数据域中使用倒排索引结构来存储成对的STR基因座数据，极大地提高了系统的查询效率。Traditional family search databases often use relational databases, and optimize and improve the query efficiency of the system by means of vertical segmentation, establishment of views, and establishment of statistical information. However, the query algorithms of these methods are relatively complicated, and are not easy to be understood by operation and maintenance personnel who lack relevant background knowledge. The present invention sets different data storage domains according to different locus points, and uses inverted index structures in different data domains to store paired STR locus data, which greatly improves the query efficiency of the system.

2、更高的可扩展性2. Higher scalability

传统的基于关系型数据库的寻亲数据库，往往要求寻亲者必须使用特定的基因位点来进行检测，极大地限制了寻亲数据库的使用范围，给广大寻亲用户带来极大的不便。本发明对于寻亲基因点位没有特定的要求，如果需要对系统的点位扩展，只需要增加相应的数据域，而无需对基础数据结构进行修正，极大地提高了系统的适用范围。Traditional relative databases based on relational databases often require relative searchers to use specific genetic loci for detection, which greatly limits the scope of use of relative databases and brings great inconvenience to the majority of relative search users. The present invention has no specific requirements for the gene points of family searching. If the points of the system need to be expanded, only the corresponding data field needs to be added without modifying the basic data structure, which greatly improves the scope of application of the system.

3、避免基因突变对亲子鉴定效果的影响3. Avoid the influence of gene mutation on the effect of paternity test

基因突变是限制寻亲数据库中亲子鉴定的准确性的巨大障碍之一。由于基因变异的存在，使得亲子两代之间的STR基因座数据并不一定完全重合，因此当使用关系型数据库来存储STR基因座数据时，STR基因座数据之间的完全匹配变得难于操作，SQL语句中的WHERE条件难于精确匹配，极大地限制了亲子鉴定的效果。本发明利用不同数据域中的倒排索引结构来存储STR基因座数据，在查询时，只需要根据不同数据域中的比对得分计算出样本之间的亲子关系指数即可获得样本供体之间具有血缘关系的可能性，并依此进行排序，极大地避免了基因突变对亲子鉴定效果的影响，本发明能够有效提高寻亲数据库的鲁棒性和比对效率，同时能够有效保证寻亲结果的可靠性。Genetic mutations are one of the formidable barriers that limit the accuracy of paternity tests in family tracing databases. Due to the existence of genetic variation, the STR locus data between the two generations of parents and children may not necessarily coincide completely. Therefore, when a relational database is used to store STR locus data, the complete matching between STR locus data becomes difficult to operate. , the WHERE condition in the SQL statement is difficult to match precisely, which greatly limits the effect of paternity testing. The present invention utilizes the inverted index structure in different data domains to store STR locus data. When inquiring, it only needs to calculate the parent-child relationship index between samples according to the comparison scores in different data domains to obtain the parent-child relationship index of the sample donor. The possibility of having a blood relationship between them is sorted accordingly, which greatly avoids the influence of gene mutation on the effect of paternity identification. The present invention can effectively improve the robustness and comparison efficiency of the relative search database, and can effectively ensure the reliability of the results.

附图说明Description of drawings

图1为系统总体框架示意图；Figure 1 is a schematic diagram of the overall framework of the system;

图2为本发明所使用的基于倒排索引结构的STR基因座数据存储；Fig. 2 is the STR locus data storage based on the inverted index structure used in the present invention;

图3为基于倒排索引结构的STR数据存储的算法流程图；Fig. 3 is the algorithm flowchart of the STR data storage based on the inverted index structure;

图4基于以倒排索引的方式存储的STR数据的亲子鉴定排序比对的算法流程图；Fig. 4 is based on the algorithm flow chart of the parent-child identification sorting comparison of the STR data stored in the manner of inverted index;

图5和图6分别为未存储00002号样本时的数据结构以及在D8S1179域中存储00002号样本后的数据结构；Figure 5 and Figure 6 are respectively the data structure when the sample number 00002 is not stored and the data structure after storing the sample number 00002 in the D8S1179 domain;

图7和图8为依据本发明原理设计实现的寻亲系统原型，图7为基于倒排索引结构存储的STR数据，图8为寻亲算法得到的寻亲结果。Figure 7 and Figure 8 are the prototypes of the family searching system designed and implemented according to the principles of the present invention, Figure 7 is the STR data stored based on the inverted index structure, and Figure 8 is the family finding result obtained by the family searching algorithm.

具体实施方式detailed description

下面结合具体的实施例对本发明做进一步的详细说明，所述是对本发明的解释而不是限定。The present invention will be further described in detail below in conjunction with specific embodiments, which are explanations of the present invention rather than limitations.

参见图1，本发明基于倒排索引结构的STR数据存储及亲子鉴定排序比对方法，主要包括两个方面：一是基于倒排索引结构的STR数据存储方法，参见图3，该方法会依据样本所选取STR基因座，建立不同的数据域，在数据域中将STR数据以倒排索引结构存储；二是亲子鉴定排序比对方法，参见图4，该方法基于划分域的倒排索引结构，计算寻亲样本与数据库中样本的亲缘关系，实现快速、稳定、可靠的在线寻亲。Referring to Fig. 1, the STR data storage and paternity identification sorting method based on the inverted index structure of the present invention mainly includes two aspects: one is the STR data storage method based on the inverted index structure, see Fig. 3, the method will be based on The STR locus selected by the sample establishes different data domains, and stores the STR data in the data domain with an inverted index structure; the second is the paternity identification sorting and comparison method, see Figure 4, this method is based on the inverted index structure of the divided domains , to calculate the genetic relationship between the family search samples and the samples in the database, and realize fast, stable and reliable online family search.

1.基于倒排索引结构的STR数据存储方法1. STR data storage method based on inverted index structure

基于倒排索引结构的STR数据存储方法，包括以下步骤：首先，将所有STR数据进行预处理，将每个样本的STR数据集整理为标准格式；然后，将每一个位点作为一个数据域，每个数据域中将存储各自的STR数据。最后，将STR数据倒排索引的方式存储。具体过程如图3所示，具体而言：The STR data storage method based on the inverted index structure includes the following steps: firstly, all STR data are preprocessed, and the STR data set of each sample is organized into a standard format; then, each site is used as a data domain, Each data field will store its own STR data. Finally, store the STR data in an inverted index. The specific process is shown in Figure 3, specifically:

步骤一：数据预处理。待存储的数据整理为如下形式：样本数据集记为X＝{x₁,x₂,…，x_n}，其中x_i表示第i个个体的STR数据，可表示为其中表示第j个STR基因座的名称，v_jk表示第k个染色体上基因座j上STR的特征值。Step 1: Data preprocessing. The data to be stored is organized into the following form: the sample data set is recorded as X={x ₁ ,x ₂ ,…,x _n }, where x _i represents the STR data of the i-th individual, which can be expressed as in Indicates the name of the jth STR locus, and _vjk indicates the eigenvalue of the STR at locus j on the kth chromosome.

步骤二：建立数据域。遍历所有的样本的STR数据，建立STR基因座名称的集合STR_N＝{str₁,str₂,…,str_m}，针对STR_N中的每个str_i，建立不同的数据域，记为d_i。Step 2: Create a data domain. Traverse the STR data of all samples, establish a set of STR locus names STR _N = {str ₁ ,str ₂ ,...,str _m }, and establish a different data domain for each str _i in STR _N , denoted as d _i .

步骤三：数据存储。遍历样本数据集X，对任意x_i，遍历:(v_j1/v_j2)，如果对应的数据域d_m中存在v_j1的索引，则将x_i添加到该索引中；如果不存在v_j1的倒排索引，则建立该索引，并将x_i添加到索引中。对于v_j2采用相同的方式进行处理。Step 3: Data storage. Traversing the sample data set X, for any x _i , traversing :(v _j1 /v _j2 ), if If there is an index of v _j1 in the corresponding data field d _m , then add _xi to the index; if there is no inverted index of v _j1 , build the index and add _xi to the index. The same method is adopted for v _j2 .

使用上述方法存储后的STR基因座数据如图2所示。其中，上方所示的D8S1179、D21S11等为STR基因座所对应的数据域；下方左侧为数据对应的数据键，其中的数字表示STR数值；下方右侧为数据键所对应的倒排索引列表，每个数值表示样本供体的ID号。例如STR数值12所对应的列表包含1、5、7、13、22等数字，表示样本1、5、7、13、22的某条染色体在位点D3S1358上的数值为12。The STR locus data stored using the above method is shown in Figure 2. Among them, D8S1179, D21S11, etc. shown above are the data fields corresponding to the STR loci; the left side of the bottom is the data key corresponding to the data, and the numbers in it represent the STR value; the right side of the bottom is the inverted index list corresponding to the data key , each numerical value represents the ID number of the sample donor. For example, the list corresponding to the STR value 12 contains numbers such as 1, 5, 7, 13, 22, etc., indicating that a certain chromosome of samples 1, 5, 7, 13, and 22 has a value of 12 at the site D3S1358.

2.基于倒排索引结构的亲子鉴定排序比对方法2. Paternity identification sorting and comparison method based on inverted index structure

在基于倒排索引结构的STR数据存储方法的基础上，得到如图2所示STR基因座数据存储结构。进行寻亲时，将使用基于倒排索引结构的亲子鉴定排序比对方法，该方法主要包括以下步骤：首先，将待寻亲STR数据进行预处理，将每个样本的STR数据集整理为标准格式；然后，将每个位点的STR数据在各自的数据域中进行比对，并形成最终的亲子关系指数；最后，判定样本之间是否存在亲子关系，如果亲子关系指数高于特定的值，则认为候选样本的供体与待寻亲样本的供体具有亲子关系，反之则认为两者之间不存在亲子关系。On the basis of the STR data storage method based on the inverted index structure, the data storage structure of the STR loci is obtained as shown in FIG. 2 . When searching for relatives, the paternity ranking and comparison method based on the inverted index structure will be used. This method mainly includes the following steps: First, preprocess the STR data to be searched, and organize the STR data sets of each sample into a standard format; then, compare the STR data of each site in their respective data fields, and form the final parent-child relationship index; finally, determine whether there is a parent-child relationship between the samples, if the parent-child relationship index is higher than a specific value , it is considered that the donor of the candidate sample has a parent-child relationship with the donor of the sample to be traced, otherwise it is considered that there is no parent-child relationship between the two.

步骤一：数据预处理。将寻亲样本整理为如下格式：y＝{str_j:(v_j1/v_j2)}，其中str_j表示第j个STR基因座的名称，v_jk表示第k个染色体上基因座j上STR的特征值。Step 1: Data preprocessing. Organize the family searching samples into the following format: y={str _j :(v _j1 /v _j2 )}, where str _j represents the name of the jth STR locus, and v _jk represents the STR at locus j on the kth chromosome eigenvalues of .

步骤二：计算亲子关系指数。对于样本y，遍历str_j:(v_j1/v_j2)，如果存在str_j对应的域d_m，则获得v_j1和v_j2索引所对应的样本集合，分别记为X_j1和X_j2。取X_j1和X_j2的并集，记为在获得每个str_j所对应的X_j后，计算X_j的并集X中的每个元素均为候选样本。对X中的每一个元素x_i，计算其中，Step 2: Calculate the parent-child relationship index. For sample y, traverse str _j :(v _j1 /v _j2 ), if there is a domain d _m corresponding to str _j , then obtain the sample sets corresponding to the indices of v _j1 and v _j2 , denoted as X _j1 and X _j2 respectively. Take the union of X _j1 and X _j2 , denoted as After obtaining the X _j corresponding to each str _j , calculate the union of X _j Each element in X is a candidate sample. For each element x _i in X, compute in,

步骤三：判断是否存在亲子关系。依据q_i对候选样本x_i进行降序排序，如果q_i≥θ则认为待寻亲样本y的供体与候选样本x_i的供体具有亲子关系；反之则认为两者之间不存在亲子关系。其中θ为系统事先设定的阈值，一般可考虑设置为基因座的数量减1。Step 3: Determine whether there is a parent-child relationship. According to q _i , the candidate samples x _i are sorted in descending order. If q _i ≥ θ, it is considered that the donor of the sample y to be searched for relatives has a parent-child relationship with the donor of the candidate sample x _i ; otherwise, there is no parent-child relationship between the two . Among them, θ is the threshold value set in advance by the system, which can generally be considered to be set as the number of loci minus 1.

具体实例如下：Specific examples are as follows:

需要存储的寻亲样本如表1所示，待寻亲样本如表2所示。The samples that need to be stored are shown in Table 1, and the samples to be found are shown in Table 2.

表1寻亲数据库中要存储的样本示例Table 1 Example of samples to be stored in the family tracing database

样本IDSample ID 0000100001 0000200002 0000300003 0000400004 0000500005 ……... D8S1179D8S1179 14/1514/15 13/1513/15 10/1310/13 13/1513/15 13/1413/14 ……... D21S11D21S11 30.2/3130.2/31 29/32.229/32.2 30/31.230/31.2 29/32.229/32.2 29/3029/30 ……... D7S820D7S820 10/1110/11 8/98/9 11/1111/11 11/1111/11 10/1210/12 ……... CSF1POCSF1PO 10/1110/11 12/1412/14 10/1310/13 11/1211/12 10/1010/10 ……... D3S1358D3S1358 15/1615/16 16/1616/16 16/1716/17 15/1515/15 15/1615/16 ……... D5S818D5S818 10/1110/11 11/1211/12 11/1311/13 10/1110/11 10/1310/13 ……... D13S317D13S317 12/1212/12 11/1111/11 11/1211/12 10/1110/11 11/1111/11 ……... D16S539D16S539 10/1310/13 9/119/11 11/1211/12 11/1211/12 10/1210/12 ……... D2S1338D2S1338 20/2320/23 21/2321/23 18/2418/24 20/2320/23 18/1918/19 ……... D19S433D19S433 13/1413/14 13/1313/13 14/1514/15 13/15.213/15.2 13/1413/14 ……... vWAwxya 17/2017/20 14/1614/16 16/1716/17 13/1413/14 17/1917/19 ……... D12S391D12S391 18/2118/21 19/2019/20 17/17.317/17.3 18/1918/19 18/1818/18 ……... D18S51D18S51 13/1413/14 13/1513/15 14/1514/15 13/1613/16 14/1714/17 ……... AMELAMEL X/XX/X X/YX/Y X/YX/Y X/YX/Y X/YX/Y ……... D6S1043D6S1043 14/21.314/21.3 19/2019/20 13/1413/14 10/1910/19 17/1817/18 ……... FGAFGA 19/2219/22 19/2419/24 23/2423/24 24/2624/26 23/2323/23 ……...

表2待寻亲样本示例Table 2 Examples of Samples to be Searched for

位点site STRSTR D8S1179D8S1179 13/1513/15 D21S11D21S11 29/3129/31 D7S820D7S820 11/1111/11 CSF1POCSF1PO 11/1111/11 D3S1358D3S1358 15/1515/15 D5S818D5S818 10/1210/12 D13S317D13S317 10/1010/10 D16S539D16S539 9/119/11 D2S1338D2S1338 18/2318/23 D19S433D19S433 13/1413/14

vWAwxya 14/1414/14 D12S391D12S391 18/1918/19 D18S51D18S51 13/1513/15 AMELAMEL X/YX/Y D6S1043D6S1043 10/1810/18 FGAFGA 23/2423/24

1、基于倒排索引结构的STR数据存储1. STR data storage based on inverted index structure

步骤一：数据预处理。即将所有待存储的样本整理为标准格式，以00001号样本为例，其整理后的结果为：Step 1: Data preprocessing. That is to organize all the samples to be stored into a standard format, taking sample No. 00001 as an example, the result after sorting is:

x₁＝{D8S1179:(14/15),D21S11:(30.2/31),D7S820:(10/11),CSF1PO:(10/11),D3S1358:(15/16),D5S818:(10/11),D13S317:(12/12),D16S539:(10/13),D2S1338:(20/23),D19S433:(13/14),vWA:(17/20),D12S391:(18/21),D18S51:(13/14),AMEL:(X/X),D6S1043:(14/21.3),FGA:(19/22)}x ₁ ＝{D8S1179:(14/15),D21S11:(30.2/31),D7S820:(10/11),CSF1PO:(10/11),D3S1358:(15/16),D5S818:(10/11 ),D13S317:(12/12),D16S539:(10/13),D2S1338:(20/23),D19S433:(13/14),vWA:(17/20),D12S391:(18/21), D18S51:(13/14),AMEL:(X/X),D6S1043:(14/21.3),FGA:(19/22)}

步骤二：建立数据域。在本例中，所有样本数据的基因座名称完全一致，因此建立的数据域共有16个：Step 2: Create a data field. In this example, the locus names of all sample data are exactly the same, so there are 16 data domains created:

步骤三：数据存储。数据存储以ID为00002的样本为例，此时数据库中已经存储了ID为00001的样本，如图5所示。首先获得第一组数据D8S1179:(13/15)，此时在数据库中存在数据域D8S1179，且存在索引15而不存在索引13，因此需要新建13的索引，并将00002加入到13和15的索引中，如图6所示，之后按照上述方式依次遍历00002号样本的每个数据域。Step 3: Data storage. Data storage takes the sample whose ID is 00002 as an example. At this time, the sample whose ID is 00001 has been stored in the database, as shown in Figure 5. First, get the first set of data D8S1179: (13/15). At this time, there is a data field D8S1179 in the database, and there is index 15 but not index 13. Therefore, it is necessary to create a new index for 13, and add 00002 to 13 and 15 In the index, as shown in Figure 6, each data field of sample 00002 is traversed sequentially in the above manner.

2、基于倒排索引结构的亲子鉴定排序比对方法2. Paternity identification sorting and comparison method based on inverted index structure

步骤一：数据预处理。Step 1: Data preprocessing.

将表2中的寻亲样本整理为如下格式：Organize the family search samples in Table 2 into the following format:

y＝{D8S1179:(13/15),D21S11:(29/31),D7S820:(11/11),CSF1PO:(11/11),D3S1358:(15/15),D5S818:(10/12),D13S317:(10/10),D16S539:(9/11),D2S1338:(18/23),D19S433:(13/14),vWA:(14/14),D12S391:(18/19),D18S51:(13/15),AMEL:(X/Y),D6S1043:(10/18),FGA:(23/24)}y＝{D8S1179:(13/15),D21S11:(29/31),D7S820:(11/11),CSF1PO:(11/11),D3S1358:(15/15),D5S818:(10/12) ,D13S317:(10/10),D16S539:(9/11),D2S1338:(18/23),D19S433:(13/14),vWA:(14/14),D12S391:(18/19),D18S51 :(13/15),AMEL:(X/Y),D6S1043:(10/18),FGA:(23/24)}

步骤二：计算亲子关系指数。Step 2: Calculate the parent-child relationship index.

对于样本y，首先对于第一个数据域D8S1179，其存在13和15的索引，取其样本集合的并集X_j＝{00001,00002,00003,00004,00005,...}，在此基础上计算所有域的并集X＝{00001,00002,00003,00004,00005,...}。对X中的每一个元素x_i，计算其得分，如表3所示：For the sample y, firstly, for the first data field D8S1179, which has indexes of 13 and 15, take the union of its sample sets X _j ={00001,00002,00003,00004,00005,...}, based on Compute the union of all domains X = {00001, 00002, 00003, 00004, 00005, . . . }. For each element x _i in X, calculate its score, as shown in Table 3:

表3样本得分Table 3 Sample Scores

样本IDSample ID 0000100001 0000200002 0000300003 0000400004 0000500005 ……... D8S1179D8S1179 11 11 11 11 11 ……... D21S11D21S11 00 11 00 11 11 ……... D7S820D7S820 11 00 11 11 00 ……... CSF1POCSF1PO 11 00 00 11 00 ……... D3S1358D3S1358 11 00 00 11 11 ……... D5S818D5S818 11 11 00 11 11 ……... D13S317D13S317 00 00 00 11 00 ……... D16S539D16S539 00 11 11 11 00 ……... D2S1338D2S1338 11 11 11 11 11 ……... D19S433D19S433 11 11 11 11 11 ……... vWAwxya 00 11 00 11 00 ……... D12S391D12S391 11 11 00 11 11 ……... D18S51D18S51 11 11 11 11 00 ……... AMELAMEL 11 11 11 11 11 ……... D6S1043D6S1043 00 00 00 11 11 ……... FGAFGA 00 11 11 11 11 ……... q_i q _i 1010 1111 88 1616 1010

步骤三：判断是否存在亲子关系。Step 3: Determine whether there is a parent-child relationship.

依据得分对候选样本进行降序排序，令θ＝15，则可判断样本00004与y具有亲子关系。The candidate samples are sorted in descending order according to the scores, and if θ=15, it can be judged that sample 00004 has a parent-child relationship with y.

依据本系统设计的数据库系统原型如图7和图8所示，其中图7展示了本专利描述的基于倒排索引结构存储的STR数据，图8展示了寻亲的结果。The prototype of the database system designed according to this system is shown in Figure 7 and Figure 8, where Figure 7 shows the STR data stored based on the inverted index structure described in this patent, and Figure 8 shows the result of family search.

综上所述，本发明提出的基于倒排索引结构的STR数据存储及亲子鉴定排序比对方法。该方法通过建立数据域、建立STR数据值的索引等方式，将将成对的STR基因座数据以倒排索引的形式进行存储；在此基础上，通过基于倒排索引结构的亲子鉴定排序比对方法，计算样本之间的亲子关系指数，实现了亲子鉴定的快速比对，提高了检索查询的效率，降低了基因突变对亲子鉴定比对效率的影响；并且由于数据域的使用，极大地提高了该方法的适用范围。To sum up, the present invention proposes an inverted index structure-based method for storing STR data and sorting and comparing paternity tests. The method stores the paired STR locus data in the form of an inverted index by establishing a data field and an index of STR data values; method, calculate the parent-child relationship index between samples, realize the rapid comparison of paternity identification, improve the efficiency of retrieval query, reduce the impact of gene mutation on the comparison efficiency of parent-child identification; and due to the use of data fields, greatly improve the scope of application of this method.

Claims

1. The STR data storage and paternity test sequencing comparison method based on the inverted index structure is characterized by comprising the following steps of:

1) STR data storage based on inverted index structure

Firstly, preprocessing all STR data, and arranging an STR data set of each sample into a standard format; then, each locus is used as a data field, and each data field stores respective STR data; finally, storing the STR data in a reverse index mode;

2) paternity test ordering comparison based on STR data stored in inverted index mode

Firstly, preprocessing STR data to be searched, and sorting the STR data set of each sample into a standard format; then, comparing the STR data of each locus in respective data fields, and forming a final paternity index; and finally, judging whether the samples have the parent-child relationship, if the parent-child relationship index is higher than a specific value, considering that the donor of the candidate sample and the donor of the sample to be searched have the parent-child relationship, otherwise, considering that the parent-child relationship does not exist between the candidate sample and the candidate sample.

2. The method for storing STR data and comparing paternity test and rank order based on the inverted index structure of claim 1, wherein in step 1), the STR data is preprocessed, and the STR data set of each sample is arranged into a standard format, specifically as follows:

denote the sample data set as X ═ X₁,x₂,…,x_n}；

Wherein x is_iRepresents the STR data of the i-th individual,wherein,denotes the name of the jth STR locus, v_jkRepresents the characteristic value of STR at locus j on the kth chromosome.

3. The STR data storage and paternity test sequencing comparison method based on the inverted index structure of claim 2, wherein the data fields in step 1) are established as follows:

traversing STR data of all samples, and establishing a set STR of STR locus names_N＝{str₁,str₂,…,str_mFor STR_NEach str in (1)_iEstablishing different data fields, denoted as d_i；i＝1,2……m。

4. The method for STR data storage and paternity test ranking comparison based on inverted index structure of claim 2, wherein in step 1) STR data is stored in inverted index mode, sample data set X is traversed, and any X is compared_iGo throughIf it is notCorresponding data field d_mIn the presence of v_j1Index of (1), then x_iAdding to the index; if v is not present_j1The index is established, and x is added_iAdding into the index; for v_j2The treatment was carried out in the same manner.

5. The STR data storage and paternity test sorting comparison method according to claim 1, wherein the STR data to be searched for is preprocessed in step 2), specifically as follows:

the seeking parent sample is arranged into the following format: y ═ str_j:(v_j1/v_j2) In which str_jDenotes the name of the jth STR locus, v_jkRepresents the characteristic value of STR at locus j on the kth chromosome.

6. The method for STR data storage and paternity test sequencing comparison based on inverted index structure of claim 5, wherein the paternity index in step 2) is calculated as follows:

for sample y, go through str_j:(v_j1/v_j2) If str is present_jCorresponding field d_mThen obtain v_j1And v_j2The sample sets corresponding to the indexes are respectively marked as X_j1And X_j2；

Taking X_j1And X_j2Is a union of (1), denoted X_j＝X_j1∪X_j2；

After obtaining each str_jCorresponding X_jThen, X is calculated_jX ═ X₁∪X₂∪…∪X_JEach element in X is a candidate sample;

q is then_iIs a candidate sample x_iThe parentage relations index of (1).

7. The STR data storage and paternity test ranking comparison method according to claim 6, wherein in step 2), it is determined whether there is a paternity relationship, specifically as follows:

according to q_iFor candidate sample x_iSorting in descending order, if q_iIf the value is more than or equal to theta, the donor of the sample y to be searched and the candidate sample x are considered_iThe donor of (a) has a parent-child relationship; otherwise, the parent-child relationship does not exist between the two; where θ is a threshold value preset by the system, and is the number of loci minus 1.