CN117095747A

CN117095747A - Method for detecting group inversion or transposon endpoint genotype based on linear ubiquitin genome and artificial intelligence model

Info

Publication number: CN117095747A
Application number: CN202311095082.8A
Authority: CN
Inventors: 王健; 胡海飞; 赵均良; 聂帅; 马雅美; 董景芳; 杨梯丰; 杨武; 周炼; 陈建松
Original assignee: Rice Research Institute Guangdong Academy Of Agricultural Sciences
Current assignee: Rice Research Institute Guangdong Academy Of Agricultural Sciences
Priority date: 2023-08-29
Filing date: 2023-08-29
Publication date: 2023-11-21
Anticipated expiration: 2043-08-29
Also published as: CN117095747B

Abstract

The invention discloses a method for detecting group inversion or transposon endpoint genotypes based on a linear ubiquitin genome and an artificial intelligent model, which comprises the steps of mounting group high-depth sequencing data on the linear ubiquitin genome; detecting the condition that each window is completely covered by the sequencing sequence according to the coverage condition of the second generation sequencing sequence of each sample in the population, and recording window position information and the number of reads completely covering the window; constructing an artificial intelligent model, and training the model through simulation data to judge whether a continuous window coverage area contains inversion or transposon endpoints; scanning all areas on one chromosome by using the model to obtain all inversion or transposon endpoint information on the chromosome, scanning a plurality of chromosomes in sequence, and summarizing the inversion or transposon endpoint information on all the chromosomes to form one sample; population-level inversion or transposon endpoint genotype matrices are sorted and screened based on the different samples. The invention realizes the detection of transposon and inversion end point genotypes by using the second generation sequencing data.

Description

A linear pan-genome and artificial intelligence model to detect population inversions or transpositions subendpoint genotype method

技术领域Technical field

本发明涉及自然群体基因型检测技术领域，更具体地说是涉及一种基于线性泛基因组和人工智能模型检测群体倒位或转座子端点基因型的方法。The present invention relates to the technical field of natural population genotype detection, and more specifically to a method for detecting population inversion or transposon endpoint genotypes based on linear pan-genome and artificial intelligence models.

背景技术Background technique

在群体遗传研究领域，群体基因型的获取是群体研究的基石。传统的群体基因型分析包括单核苷酸多态性(SNP)分析和插入缺失(InDel)分析，有研究表明，大部分表型变异是由于结构变异引起的，近年来结构变异(SV)分析也取得了一定进展。结构变异包括倒位、易位、重复、大片段插入或缺失。目前三代测序价格仍然较高，这极大的限制了结构变异的挖掘。二代测序价格不断降低，但是二代测序的读长较短，给结构变异的分析带来很大困难，尤其是倒位和转座子跳跃引起的易位及重复。In the field of population genetic research, the acquisition of population genotypes is the cornerstone of population research. Traditional population genotype analysis includes single nucleotide polymorphism (SNP) analysis and insertion and deletion (InDel) analysis. Studies have shown that most phenotypic variation is caused by structural variation. In recent years, structural variation (SV) analysis Some progress has also been made. Structural variations include inversions, translocations, duplications, and large insertions or deletions. At present, the price of third-generation sequencing is still relatively high, which greatly limits the mining of structural variations. The price of second-generation sequencing continues to decrease, but the read length of second-generation sequencing is short, which brings great difficulties to the analysis of structural variations, especially translocations and duplications caused by inversions and transposon jumping.

当基因型某段序列发生倒位后，基因组内仍然包含该段序列，因此使用二代测序技术，将短序列挂载到参考基因组的方法较难检测出倒位的存在。目前构建倒位图谱主要使用多个已经组装好的高质量基因组进行基因组之间的比对，从而找出倒位位置，但获得高质量基因组需要巨大成本，因此检测出的倒位图谱也局限于某些热门研究物种和有限个体。转座子能在基因组内活跃跳动，可能包含多处重复，因此转座子内的二代测序短序列可能被同时挂载到多处，但某样品中可能只包含一个重复。When a certain sequence of the genotype is inverted, the sequence is still included in the genome. Therefore, it is difficult to detect the existence of the inversion by using second-generation sequencing technology to mount short sequences to the reference genome. Currently, the construction of inversion maps mainly uses multiple assembled high-quality genomes to compare the genomes to find the inversion position. However, obtaining high-quality genomes requires huge costs, so the detected inversion maps are limited to Certain popular research species and limited individuals. Transposons can actively jump within the genome and may contain multiple repeats. Therefore, the short next-generation sequencing sequences within the transposon may be mounted to multiple locations at the same time, but a certain sample may only contain one repeat.

目前许多物种已经有海量的公开测序数据，但是受限于二代测序读长短的缺点，群体转座子和倒位的研究进展较缓。因此，如何提供一种检测群体转座子和倒位的方法，进而分析出基因型用于后续研究，是本领域技术人员亟需解决的问题。At present, there are massive public sequencing data for many species. However, limited by the shortcomings of second-generation sequencing read length, research on population transposons and inversions has been slow. Therefore, how to provide a method for detecting population transposons and inversions, and then analyze the genotype for subsequent research, is an urgent problem that those skilled in the art need to solve.

发明内容Contents of the invention

有鉴于此，本发明提供了一种基于线性泛基因组和人工智能模型检测群体倒位或转座子端点基因型的方法，以解决上述背景技术中的问题，实现了利用二代测序数据检测转座子和倒位端点基因型。In view of this, the present invention provides a method for detecting population inversions or transposon endpoint genotypes based on linear pan-genome and artificial intelligence models to solve the problems in the above background technology and realize the use of second-generation sequencing data to detect transformations. Transposon and inversion endpoint genotypes.

为了实现上述目的，本发明采用如下技术方案：In order to achieve the above objects, the present invention adopts the following technical solutions:

本发明提供一种基于线性泛基因组和人工智能模型检测群体倒位或转座子端点基因型的方法，包括下述步骤：The present invention provides a method for detecting population inversions or transposon endpoint genotypes based on linear pan-genome and artificial intelligence models, which includes the following steps:

步骤一：使用比对软件将群体高深度测序数据挂载到线性泛基因组上；Step 1: Use comparison software to mount the population's high-depth sequencing data onto the linear pan-genome;

步骤二：根据群体中每个样品的二代测序序列(reads)覆盖情况，利用划窗口的方式检测每个窗口被测序序列完全覆盖的情况，记录窗口位置信息和完全覆盖该窗口的reads数量，生成数据框文件；Step 2: Based on the second-generation sequencing sequence (reads) coverage of each sample in the population, use the windowing method to detect whether each window is completely covered by the sequencing sequence, record the window position information and the number of reads that completely cover the window, Generate data frame file;

步骤三：构建出一个人工智能模型(人工神经网络模型)，通过模拟数据训练该模型使其能根据步骤二得到的固定数量、连续的窗口位置和序列数量信息判定该区域(连续窗口覆盖区域)是否包含倒位或转座子端点；Step 3: Construct an artificial intelligence model (artificial neural network model), and train the model through simulated data so that it can determine the area (continuous window coverage area) based on the fixed number, continuous window position and sequence number information obtained in step 2. whether it contains inversion or transposon endpoints;

步骤四：利用步骤三的模型扫描一条染色体上的所有区域，得到该染色体上的所有倒位或转座子端点位置信息，依次扫描多条染色体，汇总成一个样品所有染色体上的倒位或转座子端点信息；Step 4: Use the model in Step 3 to scan all regions on a chromosome to obtain all inversion or transposon endpoint position information on the chromosome, scan multiple chromosomes in sequence, and summarize the inversions or transposons on all chromosomes of a sample. Socket endpoint information;

步骤五：基于不同样品的倒位或转座子端点信息，整合出群体水平的倒位或转座子端点基因型矩阵。Step 5: Based on the inversion or transposon endpoint information of different samples, integrate the population-level inversion or transposon endpoint genotype matrix.

作为上述技术方案优选的技术方案，步骤一中，群体高深度测序数据的测序深度应大于20×。As the preferred technical solution for the above technical solution, in step one, the sequencing depth of the population high-depth sequencing data should be greater than 20×.

作为上述技术方案优选的技术方案，步骤二中，利用划窗口的方式检测每个窗口被测序序列完全覆盖的情况，窗口指基因组上的一段39bp的区域，该窗口的位置命名以所在染色体和窗口中间位置(染色体位置如同放于数轴上，左边小右边大)组合命名，划窗口的步长为20bp。As the preferred technical solution for the above technical solution, in step 2, the windowing method is used to detect whether each window is completely covered by the sequencing sequence. The window refers to a 39bp region on the genome. The position of the window is named after the chromosome and the window. The middle position (the chromosome position is as if it were placed on a number axis, with the smaller on the left and larger on the right) combined to name, and the step size of the window is 20bp.

作为上述技术方案优选的技术方案，步骤二中，被完全覆盖是指测序序列的最左边应在窗口的左边(测序数量左边位置小于窗口左边位置)，并且测序序列左边位置加上序列匹配到泛基因组上的碱基数与检测到的缺失碱基数之和应大于窗口的右边位置。As the preferred technical solution for the above technical solution, in step 2, being completely covered means that the leftmost position of the sequencing sequence should be on the left side of the window (the left position of the sequencing number is smaller than the left position of the window), and the left position of the sequencing sequence plus the sequence matches the pan The sum of the number of bases on the genome and the number of detected missing bases should be greater than the right position of the window.

作为上述技术方案优选的技术方案，步骤二中，记录窗口位置信息和完全覆盖该窗口的测序序列数量，可以得到一个包含两列数据的数据框，该数据框第一列为窗口位置，第二列为完全覆盖该窗口的序列数量，以csv文件保存；由于所有样品比对到同一个线性泛基因组，划窗口基于线性泛基因组，划窗口的大小和步长一致，因此所有样品的数据框csv文件行数一致，且第一列信息完全一样，第二列序列数量信息不同。As the preferred technical solution for the above technical solution, in step 2, the window position information and the number of sequencing sequences that completely cover the window are recorded, and a data frame containing two columns of data can be obtained. The first column of the data frame is the window position, and the second column is the window position. Column is the number of sequences that completely cover the window, and is saved as a csv file; since all samples are aligned to the same linear pan-genome, the window is based on the linear pan-genome, and the size and step size of the window are consistent, so the data frame csv of all samples The number of file lines is the same, the information in the first column is exactly the same, and the sequence number information in the second column is different.

作为上述技术方案优选的技术方案，步骤三中的人工神经网络模型为4层全连接层网络，最后一层激活函数为sigmoid函数，其它激活函数为ReLU函数；模型的输入为一列长度为25的数组，来源于步骤二的数据框csv文件，对csv文件内的第二列数进行划窗口操作，窗口大小为25，步长为22，每个窗口对应基因组序列长度为39bp+(25-1)*20＝519bp；模型的输出为一个值，用于判定该519bp区域是否包含倒位或转座子端点。As the preferred technical solution for the above technical solution, the artificial neural network model in step three is a 4-layer fully connected layer network, the last layer activation function is the sigmoid function, and the other activation functions are ReLU functions; the input of the model is a column of length 25 Array, derived from the data frame csv file in step 2, perform windowing operation on the second column in the csv file, the window size is 25, the step size is 22, and the corresponding genome sequence length of each window is 39bp+(25-1) *20=519bp; the output of the model is a value used to determine whether the 519bp region contains an inversion or transposon endpoint.

作为上述技术方案优选的技术方案，步骤二中的数据框csv文件在输入到模型之前要进行预处理，将csv文件中第二列小于或等于3的数定义为0，表示无序列完全覆盖，大于3的数定义为1，表示有序列完全覆盖。As the preferred technical solution for the above technical solution, the data frame csv file in step 2 needs to be preprocessed before being input to the model. The number in the second column of the csv file that is less than or equal to 3 is defined as 0, indicating no sequence complete coverage. Numbers greater than 3 are defined as 1, indicating complete coverage by the sequence.

作为上述技术方案优选的技术方案，步骤三中的通过模拟数据训练模型，模拟数据由人为模拟生成，包含明显存在倒位或转座子端点的情况(例如连续10个1，第11位为0，第12至25位全部为1)和明显不存在倒位或转座子端点的情况(例如前面15个0，后面10个1)，制作成特征值与标签对应的数据集用于模型训练。As the preferred technical solution for the above technical solution, in step three, the model is trained through simulated data. The simulated data is generated by artificial simulation and includes situations where there are obvious inversions or transposon endpoints (for example, 10 consecutive 1s, and the 11th bit is 0 , the 12th to 25th bits are all 1) and there is obviously no inversion or transposon endpoint (for example, the first 15 0s and the last 10 1s), a data set corresponding to the feature value and label is made for model training. .

作为上述技术方案优选的技术方案，步骤四中，对染色体分别处理和对样品分别处理，因数据的计算相互独立可以并行运算；As the preferred technical solution of the above technical solution, in step four, the chromosomes are processed separately and the samples are processed separately, because the data calculations are independent of each other and can be processed in parallel;

经由上述的技术方案可知，与现有技术相比，达到的技术效果是：It can be seen from the above technical solutions that compared with the existing technology, the technical effects achieved are:

1)本发明创新性地提出了使用完全覆盖某段区域的测序序列(reads)数量鉴定度盖度，以往鉴定一段区域的覆盖度，是先计算单碱基位点被reads覆盖的数量，再通过将多个单碱基覆盖reads数量求平均值。以往的方法是无法检测到倒位或转座子端点的，因为倒位或转座子端点实际是一个位点，将测序数据挂载到参考基因组后，由于比对软件的必须包含容错参数，因此挂载的序列会跨过倒位或转座子端点1-10bp，端点及附近的碱基位置都会被多条reads覆盖，对包含端点的一段区域求覆盖度平均值也是非0的，与其它连续覆盖、不包含端点区域的覆盖度平均值结果相似，因此以往的方法无法检测出端点。本发明提出完全覆盖39bp的reads数，利用步长为20进行划窗口，剔除了挂载序列错误匹配的因素，若存在端点，能检测到包含端点的39bp区域完全覆盖reads数量为0。1) The present invention innovatively proposes to use the number of sequencing sequences (reads) that completely cover a certain region to identify coverage. In the past, to identify the coverage of a region, the number of single base sites covered by reads was first calculated, and then By averaging the number of reads covering multiple single bases. Previous methods were unable to detect inversions or transposon endpoints because the inversion or transposon endpoint is actually a site. After the sequencing data is mounted to the reference genome, the alignment software must include fault-tolerance parameters. Therefore, the mounted sequence will span 1-10 bp across the endpoint of the inversion or transposon, and the endpoint and nearby base positions will be covered by multiple reads. The average coverage of a region containing the endpoint is also non-zero, and The coverage average results of other continuous coverage areas that do not include endpoints are similar, so previous methods cannot detect endpoints. The present invention proposes to completely cover the number of reads of 39 bp, and use a step size of 20 to divide the window, eliminating the factor of mismatching of the mounting sequence. If there is an endpoint, it can be detected that the number of reads completely covering the 39 bp region including the endpoint is 0.

2)本发明创新性地从群体水平研究倒位或转座子，利用二代测序数据结合线性泛基因组寻找倒位或转座子端点(本发明中也称为断点)，能获得更丰富的端点位置信息，以往是通过三代测序组装基因组，用高质量基因组之间比较，获得倒位或转座子位置，成本巨大，因高质量基因组数量有限，获得的倒位或转座子数量有限。本发明可利用已公开的海量二代测序数据，具有明显优势。2) The present invention innovatively studies inversions or transposons at the population level, using second-generation sequencing data combined with linear pan-genomes to find the endpoints of inversions or transposons (also called breakpoints in the present invention), and can obtain richer In the past, genomes were assembled through third-generation sequencing and high-quality genomes were compared to obtain the positions of inversions or transposons. The cost was huge. Due to the limited number of high-quality genomes, the number of inversions or transposons obtained was limited. . The present invention can utilize the massive amount of published second-generation sequencing data and has obvious advantages.

3)本发明创新性地应用了人工智能模型鉴定倒位或转座子端点(断点)，人工智能模型能提取抽象特征，处理复杂情况，准确识别端点。将包含端点的材料测序数据挂载到线性泛基因组上，比对结果具有明显特点：端点(包含端点的39bp区间)不被reads完全覆盖，端点左右两边均能被多条reads完全覆盖。传统的方法较容易对某个位点或某个区域(一个值)进行判断，端点的判定需要结合多个数值，且数值排列顺序对判定结果有影响，传统方法较难处理。人工智能模型擅长处理此类非线性、特征提取的问题。3) The present invention innovatively applies an artificial intelligence model to identify inversions or transposon endpoints (breakpoints). The artificial intelligence model can extract abstract features, handle complex situations, and accurately identify endpoints. When the material sequencing data containing the endpoints are mounted on the linear pan-genome, the comparison results have obvious characteristics: the endpoints (the 39bp interval including the endpoints) are not completely covered by reads, and both the left and right sides of the endpoints can be fully covered by multiple reads. The traditional method is easier to judge a certain site or a certain area (a value). The judgment of the endpoint requires the combination of multiple values, and the order of the numerical values has an impact on the judgment result. The traditional method is difficult to handle. Artificial intelligence models are good at handling such nonlinear and feature extraction problems.

4)本发明的窗口内的计算彼此独立，可以使用多线程并行加速运算，减少程序运行时间。4) The calculations within the window of the present invention are independent of each other, and multi-thread parallel acceleration can be used to reduce program running time.

附图说明Description of the drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据提供的附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on the provided drawings without exerting creative efforts.

图1为本发明的基于线性泛基因组和人工智能模型检测群体倒位或转座子端点基因型的方法的流程示意图；Figure 1 is a schematic flow chart of the method for detecting population inversions or transposon endpoint genotypes based on linear pan-genome and artificial intelligence models according to the present invention;

图2为本发明的reads完全覆盖某段区域示意图；Figure 2 is a schematic diagram showing that the reads of the present invention completely cover a certain area;

图3为本发明的划窗口方法检测基因组水平覆盖度(完全覆盖)示意图；Figure 3 is a schematic diagram of the windowing method of the present invention for detecting genome level coverage (complete coverage);

图4为本发明的训练数据集生成和人工智能模型结构图；Figure 4 is a structural diagram of the training data set generation and artificial intelligence model of the present invention;

图5为本发明的划窗口方法转换数据结构用于人工智能模型的示意图。Figure 5 is a schematic diagram of the windowing method of the present invention converting data structure for use in artificial intelligence models.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

实施例1Example 1

请参阅图1-4，本实施例提供了一种基于线性泛基因组和人工智能模型检测群体倒位或转座子端点基因型的方法，包括：Please refer to Figures 1-4. This embodiment provides a method for detecting population inversions or transposon endpoint genotypes based on linear pan-genome and artificial intelligence models, including:

A.二代测序数据在基因组水平覆盖情况统计A. Statistics on the coverage of next-generation sequencing data at the genome level

A1.提取群体样品的DNA。A1. Extract DNA from population samples.

A2.将DNA利用超声波随机打断后建库，然后利用第二代DNA测序技术进行高深度测序(大于等于20x)。目前二代测序数据的读长为150bp。A2. Use ultrasonic waves to randomly fragment the DNA and build a library, and then use second-generation DNA sequencing technology to perform high-depth sequencing (greater than or equal to 20x). The current read length of second-generation sequencing data is 150bp.

A3.使用比对软件将所有样品的高深度测序分别比对到线性泛基因组上。该对比软件可为BWA或Samtools为软件，均无中文名称，在行业内直接用英文表达，用于提取目标序列。A3. Use alignment software to align the high-depth sequencing of all samples to the linear pan-genome. The comparison software can be BWA or Samtools, both of which have no Chinese names. They are directly expressed in English in the industry and are used to extract target sequences.

A4.以39bp为一个窗口，统计该窗口被完全覆盖的reads数。其中，reads指二代测序过程中产生的测序序列，是一段150bp的碱基序列，在行业内可直接用英文表达。A4. Take 39bp as a window and count the number of reads that are completely covered in this window. Among them, reads refers to the sequencing sequence generated during the second-generation sequencing process. It is a 150bp base sequence and can be directly expressed in English in the industry.

如图2展示，如果一条read的最左端在窗口的左边，read最右端在窗口右边，表明该read完全覆盖(跨过)该窗口，该窗口被完全覆盖的reads数+1。在实际的操作过程中，read最右端应为最左端加上匹配到线性泛基因组上的碱基数量，再加上检测到的缺失碱基数之和。As shown in Figure 2, if the leftmost end of a read is on the left side of the window and the rightmost end of the read is on the right side of the window, it means that the read completely covers (crosses) the window, and the number of reads that the window is completely covered +1. In the actual operation process, the rightmost end of the read should be the sum of the leftmost end plus the number of bases matched to the linear pan-genome, plus the number of detected missing bases.

A5.利用划窗口的方法在线性泛基因组上进行划窗操作，窗口大小为39bp，步长为20bp，分析每一个窗口39bp是否被reads完全覆盖。将窗口的位置信息和完全覆盖reads数合并成2列对应的数据框。为降低后续的运算成本，将完全覆盖reads数与一个阈值(本实例为3，如果测序深度更深，可以适当增加)做比较，如果大于3，用1表示该39bp窗口被reads完全覆盖；如果小于等于3，用0表示该窗口没有被完全覆盖。将窗口位置信息和表示覆盖情况的0、1数合并，形成包含两列数的数据框(图3)。A5. Use the windowing method to perform windowing operations on the linear pan-genome. The window size is 39bp and the step size is 20bp. Analyze whether 39bp of each window is completely covered by reads. Merge the position information of the window and the number of fully covered reads into a corresponding data frame with 2 columns. In order to reduce subsequent computing costs, compare the number of fully covered reads with a threshold (3 in this example, if the sequencing depth is deeper, it can be increased appropriately). If it is greater than 3, use 1 to indicate that the 39bp window is completely covered by reads; if it is less than Equal to 3, 0 indicates that the window is not completely covered. Combine the window position information and the 0 and 1 numbers indicating the coverage to form a data frame containing two columns of numbers (Figure 3).

B.识别倒位或转座子端点(断点)的人工智能模型训练B. Artificial intelligence model training to identify inversions or transposon endpoints (breakpoints)

B1.创建训练集。倒位或转座子端点(断点)处的测序reads覆盖情况具有明显特征：断点前面区域有大量reads完全覆盖，断点后面区域也有大量reads完全覆盖，然而包含断点的39bp窗口却不能被reads完全覆盖。因此A5产生的数据框中，一段区域(519bp，包含25个小窗口)如果出现连续多个的1后，只出现一个0，接着又是多个1，表明该段区域存在断点。通过实际基因组三代测序数据可视化发现，断点大部分是1个碱基位置，但也有可能是好几个连续碱基或一段小序列。因此连续出现多个1后，有可能出现1-10个0，接着后面全是1，也说明该区域包含断点。使用划窗口的方法，中间的(一个或多个)0可能出现在窗口的不同位置，因此情况复杂，传统的算法难以判断。创建所有包含断点的数据集(图4)：窗口数量设定为25，特征值为包含25个数的数组。模拟存在1个0的情况，0可能出现在数组(25个数)的任何位置；模拟存在2个0的情况，0可能出现在数组中间的任何位置……。再创建所有不包含断点的数据集，当一个0或多个0出现在数组的最左端或最右端，说明该窗口是大片段插入缺失的边界；如果数组中0超过10个，说明该窗口包含一个插入缺失。将25个连续的0或1的数据设定为特征值，是否存在断点(1或0)设定为标签值，将特征值和标签值对应起来形成数据集。B1. Create a training set. The coverage of sequencing reads at the endpoint (breakpoint) of an inversion or transposon has obvious characteristics: the area before the breakpoint is fully covered by a large number of reads, and the area behind the breakpoint is also fully covered by a large number of reads. However, the 39bp window containing the breakpoint is not. Completely covered by reads. Therefore, in the data frame generated by A5, if a region (519bp, including 25 small windows) appears with multiple consecutive 1s, only one 0 appears, followed by multiple 1s, indicating that there is a breakpoint in this region. Through the visualization of actual genome third-generation sequencing data, we found that most breakpoints are one base position, but they may also be several consecutive bases or a small sequence. Therefore, after multiple 1s appear in succession, 1-10 0s may appear, followed by all 1s, which also indicates that the area contains a breakpoint. Using the windowing method, the middle (one or more) 0s may appear in different positions of the window, so the situation is complicated and difficult to judge by traditional algorithms. Create all data sets containing breakpoints (Figure 4): the number of windows is set to 25, and the feature values are arrays containing 25 numbers. Simulate the existence of 1 0, 0 may appear at any position in the array (25 numbers); simulate the existence of 2 0, 0 may appear at any position in the middle of the array... Create all data sets that do not contain breakpoints. When one or more 0s appear at the far left or right end of the array, it means that the window is the boundary of large fragment insertion and deletion; if there are more than 10 0s in the array, it means that the window Contains an indel. Set 25 consecutive 0 or 1 data as feature values, and whether there is a breakpoint (1 or 0) as a label value. Correspond the feature values and label values to form a data set.

B2.创建人工智能模型，构建一个包含3个隐藏层，1个输出层的人工神经网络模型，隐藏层的接点数分别为64、16、4，激活函数为ReLU，输出层使用sigmoid函数输出一个值。模型输入为包含25个特征值的数组，与训练集相匹配。B2. Create an artificial intelligence model and construct an artificial neural network model containing 3 hidden layers and 1 output layer. The number of contacts in the hidden layer is 64, 16, and 4 respectively. The activation function is ReLU. The output layer uses the sigmoid function to output a value. The model input is an array of 25 feature values, matched to the training set.

B3.利用训练集对人工智能模型进行训练，将数据集输入到人工智能模型，特征值与模型参数进行运算，经过多层网络输出结果，将输出结果与真实标签值进行比较，经梯度下降算法改良模型参数，不断缩小输出结果和真实标签值之间的差距。输入多组数据，循环4000次训练从而使人工智能模型能较准确地(99.9％)识别断点的存在。B3. Use the training set to train the artificial intelligence model, input the data set into the artificial intelligence model, perform calculations on the feature values and model parameters, output the results through the multi-layer network, compare the output results with the real label values, and use the gradient descent algorithm Improve model parameters to continuously narrow the gap between output results and real label values. Input multiple sets of data and train 4,000 times so that the artificial intelligence model can more accurately (99.9%) identify the existence of breakpoints.

C.利用人工智能模型识别群体倒位或转座子端点(断点)C. Use artificial intelligence models to identify population inversions or transposon endpoints (breakpoints)

C1.对于每个样品，A5步骤后产生一个代表全基因组覆盖度的数据矩阵(第一列为窗口位置信息，第二列为该窗口是否被完全覆盖)。利用划窗口的方法对该数据矩阵进行划窗操作，以25个值为一个窗口，以22为步长(图4)。该数据矩阵被转换成包含25个特征值的矩阵(列为25，行名重新以包含这25个数据的区间位置命名)。C1. For each sample, a data matrix representing the whole genome coverage is generated after step A5 (the first column is the window position information, and the second column is whether the window is completely covered). Use the windowing method to perform windowing operation on the data matrix, with 25 values as a window and 22 as the step size (Figure 4). The data matrix is converted into a matrix containing 25 eigenvalues (the column is 25, and the row names are renamed after the interval positions containing these 25 data).

其中，在划窗口时，先确定每25个小窗口在同一条染色体上，如果一条染色体上最后一个窗口无法满足包含25个小窗口(例如第1染色体划窗到最后，只剩末尾10个数)，舍去该窗口。Among them, when drawing windows, first determine that every 25 small windows are on the same chromosome. If the last window on a chromosome cannot contain 25 small windows (for example, when the first chromosome is windowed to the end, only the last 10 numbers are left). ), discard the window.

C2.数据矩阵的每行25个数，输入人工智能模型，输出该区域(25个小窗口所在的区间)是否包含断点。C2. 25 numbers in each row of the data matrix are input to the artificial intelligence model and output whether the area (the interval where the 25 small windows are located) contains a breakpoint.

C3.将人工智能输出结果进行整理，让位置信息与断点信息对应输出，将结果按照染色体整理，输出到一个文件中，每一个样品输出一个文件结果。C3. Organize the artificial intelligence output results so that position information and breakpoint information are output correspondingly. Organize the results according to chromosomes and output them to a file. Each sample outputs one file result.

D.断点数据的合并与筛选D. Merging and filtering breakpoint data

D1.由于所有样品以同一个线性泛基因组为参考基因组，因此A5步骤输出结果文件的第一列完全一样，包含的是窗口在染色体的位置信息。经过C1划窗口方法后，每个样品的输出结果文件依然行数一致，每行的行名一致。D1. Since all samples use the same linear pan-genome as the reference genome, the first column of the output result file of step A5 is exactly the same and contains the position information of the window on the chromosome. After the C1 windowing method, the output result files of each sample still have the same number of lines, and the line names of each line are the same.

D2.将所有样品的断点文件进行纵向合并。D2. Vertically merge the breakpoint files of all samples.

D3.合并后的文件包含一个矩阵，行名为断点在染色体的位置信息(一个519bp区域)，列为每个样品名。该矩阵为初步的断点基因型。对样品的每一行进行筛选，根据最小等位基因频率(MAF)进行筛选，去除MAF小于0.05的行。得到筛选后的断点基因型，输出至一个文件。D3. The merged file contains a matrix with the row name of the breakpoint position information on the chromosome (a 519bp region) and the column name of each sample. This matrix represents preliminary breakpoint genotypes. Each row of the sample was screened based on the minimum allele frequency (MAF), and rows with MAF less than 0.05 were removed. Get the filtered breakpoint genotypes and output them to a file.

下面结合附图对本发明的应用原理作详细的描述：The application principle of the present invention will be described in detail below with reference to the accompanying drawings:

实施例2基于高深度测序获取单个样品测序reads基因组水平覆盖情况Example 2 Obtaining genome-level coverage of sequencing reads of a single sample based on high-depth sequencing

利用化学去污方法(CTAB法，Cetyltrimethylammonium Bromide)提取水稻群体材料叶片的DNA，使用IlluminaNovaSeq6000测序平台完成DNA的高深度测序，每个样本约产生20Gb的数据。水稻基因组大小为400Mb，样品深度为50X。本实例以IRRI2K_91为例，IRRI2K_91测序数据经过压缩后双端测序文件为6.8Gb(R1.fq.gz)和7.0Gb(R2.fq.gz)大小。The chemical decontamination method (CTAB method, Cetyltrimethylammonium Bromide) was used to extract DNA from the leaves of rice population materials, and the IlluminaNovaSeq6000 sequencing platform was used to complete high-depth sequencing of the DNA. Each sample generated approximately 20Gb of data. The rice genome size is 400Mb and the sample depth is 50X. This example takes IRRI2K_91 as an example. After compression, the paired-end sequencing files of IRRI2K_91 sequencing data are 6.8Gb (R1.fq.gz) and 7.0Gb (R2.fq.gz) in size.

利用BWA软件将样品双端测序数据比对到水稻线性参考基因组上，获得bam格式的比对文件，利用picard软件对bam文件进行排序、标记重复位置和建立引索操作，此时bam大小约为8.1Gb。Use BWA software to align the sample paired-end sequencing data to the rice linear reference genome to obtain an alignment file in bam format. Use picard software to sort the bam files, mark duplicate positions and establish index operations. At this time, the bam size is approximately 8.1Gb.

运行自编程序a0_read_one_bam_absolute_depth1.09.py，设定完全覆盖的窗口大小参数，本实例设定为39bp，设定步长参数为20bp。该程序以bam文件为输入文件，输出一个csv文件，csv文件包含两列数据，以逗号隔开，第一列记录每个窗口在染色体上的位置，第二列记录第一列窗口位置对应的完全覆盖read数，输出文件IRRI2K_91absolute_depth1.09.csv大小为347M，共21,771,459行。Run the self-written program a0_read_one_bam_absolute_depth1.09.py and set the window size parameter for full coverage. In this example, it is set to 39bp and the step size parameter is set to 20bp. This program takes the bam file as the input file and outputs a csv file. The csv file contains two columns of data, separated by commas. The first column records the position of each window on the chromosome, and the second column records the window position corresponding to the first column. Completely covering the number of reads, the size of the output file IRRI2K_91absolute_depth1.09.csv is 347M, with a total of 21,771,459 lines.

实施例3训练人工智能模型判别倒位或转座子端点(断点)Example 3: Training an artificial intelligence model to identify inversions or transposon endpoints (breakpoints)

运行自编程序a1_create_training_data.py，设定窗口数量参数为25，设定断点最大包含10个连续的0，程序运行过程会产生训练集，包含多行特征值和对应标签值，以csv文件输出。Run the self-written program a1_create_training_data.py, set the window number parameter to 25, and set the breakpoint to contain a maximum of 10 consecutive 0s. The program will generate a training set, including multi-line feature values and corresponding label values, and output it as a csv file. .

自编程序会利用python调用tensorflow模块，设定人工智能模型的网络结构，包含3个隐藏层，的接点数分别为64、16、4，激活函数为ReLU，输出层使用sigmoid函数输出一个值。模型输入为包含25个特征值的数组，与程序生成的训练集相匹配。自编程序将训练集输入模型用于训练模型，设定循环次数，训练4000次后，模型对于训练集断点的预测准确性达到99.9％以上，保存训练好的模型。The self-written program will use Python to call the tensorflow module to set the network structure of the artificial intelligence model, which includes 3 hidden layers, with the number of contacts being 64, 16, and 4 respectively. The activation function is ReLU, and the output layer uses the sigmoid function to output a value. The model input is an array of 25 feature values, matched to the procedurally generated training set. The self-written program inputs the training set into the model to train the model, sets the number of cycles, and after training 4,000 times, the model's prediction accuracy for the training set breakpoint reaches more than 99.9%, and the trained model is saved.

实施例4人工智能模型检测倒位或转座子端点(断点)Example 4 Artificial Intelligence Model Detection of Inversions or Transposon Endpoints (Breakpoints)

利用自编程序a2_depth_shape_cast.py将IRRI2K_91absolute_depth1.09.csv文件进行处理，设定最小覆盖阈值为3，IRRI2K_91absolute_depth1.09.csv的第二列中如果小于等于3，转换成0，如果大于3转换成1。程定窗口大小参数为25，该自编程序会使用划窗口的方法，以两列数据为基础，转换成包含25个特征值的矩阵数据(图4)，输出至文件IRRI2K_91absolute_depth_shape25.csv使用自编程序a3_detecting_breakpoint.py加载训练好的模型，对IRRI2K_91absolute_depth_shape25.csv文件中的每一行进行断点的检测，位置信息和预测结果一一对应，输出至一个IRRI2K_91_breakpoint.csv文件中。Use the self-written program a2_depth_shape_cast.py to process the IRRI2K_91absolute_depth1.09.csv file, and set the minimum coverage threshold to 3. If the second column of IRRI2K_91absolute_depth1.09.csv is less than or equal to 3, it is converted to 0, and if it is greater than 3, it is converted to 1. . The programmed window size parameter is 25. The self-written program will use the windowing method to convert the two columns of data into matrix data containing 25 eigenvalues (Figure 4), and output it to the file IRRI2K_91absolute_depth_shape25.csv using the self-written program The program a3_detecting_breakpoint.py loads the trained model, detects breakpoints on each line in the IRRI2K_91absolute_depth_shape25.csv file, and outputs the position information and prediction results to an IRRI2K_91_breakpoint.csv file.

对群体中多个样品依次执行相同步骤，得到多个样品的IRRI2K_**_breakpoint.csv(**代表群体中的其它编号)。由于不同样品的执行不依赖其它样品，因此该步骤可以利用多进程并行运算，加速分析过程；由于多个样品都是基于同一个线性泛基因组做参考基因组，因此最后IRRI2K_**_breakpoint.csv文件内的位置信息一致。利用python中的pandas模块，读取所有IRRI2K_**_breakpoint.csv并合并成一个基因型矩阵，该矩阵行为位置信息，列为每个样品名，输出至一个文件。利用自编程序a4_filter_population_breakpoint.py对基因型矩阵进行最小等位基因频率(MAF)的筛选，得到筛选后的群体倒位或转座子端点基因型。Perform the same steps sequentially on multiple samples in the population to obtain the IRRI2K_**_breakpoint.csv of multiple samples (** represents other numbers in the population). Since the execution of different samples does not depend on other samples, this step can use multi-process parallel operations to speed up the analysis process; since multiple samples are based on the same linear pan-genome as the reference genome, the final IRRI2K_**_breakpoint.csv file The location information is consistent. Using the pandas module in python, read all IRRI2K_**_breakpoint.csv and merge them into a genotype matrix, which contains position information as a column and the name of each sample as a column, and outputs it to a file. Use the self-written program a4_filter_population_breakpoint.py to screen the genotype matrix for minimum allele frequency (MAF), and obtain the screened population inversion or transposon endpoint genotypes.

在筛选后的基因型中随机挑选检测到断点的位置30个，利用IGV可视化软件，对这30个区域进行人工观察检测，经检测，这30个位置完全符合倒位或转座子端点的特征，使用本发明能代替人工观察检测断点。Randomly select 30 positions where breakpoints were detected in the screened genotypes, and use IGV visualization software to conduct manual observation and detection of these 30 regions. After testing, these 30 positions are completely consistent with the inversion or transposon endpoints. Features, using the present invention can replace manual observation to detect breakpoints.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。Each embodiment in this specification is described in a progressive manner. Each embodiment focuses on its differences from other embodiments. The same and similar parts between the various embodiments can be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple. For relevant details, please refer to the description in the method section.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其它实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables those skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be practiced in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for detecting population inversions or transposon endpoint genotypes based on linear pan-genome and artificial intelligence models, which is characterized by including the following steps:

Step 1: Use comparison software to mount the population's high-depth sequencing data onto the linear pan-genome;

Step 2: Based on the second-generation sequencing sequence coverage of each sample in the population, use the windowing method to detect whether each window is completely covered by the sequencing sequence, record the window position information and the number of reads that completely cover the window, and generate a data frame document;

Step 3: Construct an artificial intelligence model and train the model through simulated data so that it can determine whether the continuous window coverage area contains inversion or transposon endpoints based on the fixed number, continuous window position and sequence number information obtained in step 2;

Step 4: Use the model in Step 3 to scan all regions on a chromosome to obtain all inversion or transposon endpoint position information on the chromosome, scan multiple chromosomes in sequence, and summarize the inversions or transposons on all chromosomes of a sample. Socket endpoint information;

Step 5: Based on the inversion or transposon endpoint information of different samples, organize and screen the population-level inversion or transposon endpoint genotype matrix.

2. The method according to claim 1, characterized in that in step one, the sequencing depth of the population high-depth sequencing data is greater than 20×.

3. The method according to claim 1, characterized in that: the window in step 2 refers to a 39 bp region on the genome, and the position of the window is named based on the combination of the chromosome and the middle position of the window; draw the window. The step size is 20bp.

4. The method according to claim 3, characterized in that: the complete coverage in step 2 means that the leftmost position of the sequencing sequence is on the left side of the window, and the left position of the sequencing sequence plus sequence matching is matched to the pan-genome. The sum of the number of bases and the number of detected missing bases is greater than the right position of the window.

5. The method according to claim 4, characterized in that: in step two, the window position information and the number of reads that completely cover the window are recorded to obtain a data frame containing two columns of data, the first of which is The first column is the window position, and the second column is the number of sequences that completely cover the window, which is saved in a csv file.

6. The method according to claim 5, characterized in that: the artificial intelligence model in step three is an artificial neural network model, which is a network containing 4 layers of fully connected layers, and the activation function of the last layer is a sigmoid function. , other activation functions are ReLU functions; the input of the model is an array of length 25, derived from the data frame csv file in step 2;

Perform a windowing operation on the second column in the csv file. The window size is 25 and the step size is 22. The length of the genome sequence corresponding to each window is 39bp+(25-1)*20=519bp; the output of the model is a value, Used to determine whether the 519bp region contains inversions or transposon endpoints.

7. The method according to claim 6, characterized in that: the data frame csv file in step 2 needs to be preprocessed before being input into the model, and the number in the second column of the csv file that is less than or equal to 3 is defined as 0, which means No sequence complete coverage; a number greater than 3 is defined as 1, indicating complete sequence coverage.

8. The method according to claim 6, characterized in that: training the model through simulated data in step three:

Among them, the simulation data is generated by artificial simulation, including situations where inversions or transposon endpoints obviously exist and situations where inversions or transposon endpoints obviously do not exist, and is produced into a data set corresponding to feature values and labels for use in the model. train.

9. The method according to claim 1, characterized in that in step four, chromosomes are processed separately and samples are processed separately, using independent parallel operations.