CN113963746B

CN113963746B - A deep learning-based genome structural variation detection system and method

Info

Publication number: CN113963746B
Application number: CN202111156180.9A
Authority: CN
Inventors: 叶凯; 蔺佳栋; 王松渤; 杨晓飞
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2023-09-19
Anticipated expiration: 2041-09-29
Also published as: CN113963746A

Abstract

The invention provides a genome structure variation detection system and method based on deep learning, wherein the method mainly comprises four steps: (1) Extracting a structural variation characteristic sequence based on the existing sequence alignment technology; (2) Coding the structural variation feature sequence similarity image by utilizing the RGB image; (3) Predicting structural variation contained in the structural variation feature sequence similarity image by utilizing a multi-target recognition framework; (4) The complex structure variant type is systematically represented through the graph data structure. The invention realizes the simultaneous detection of simple and complex structural variations from sequence difference encoded images without relying on any structural variation model.

Description

A deep learning-based genome structural variation detection system and method

技术领域Technical field

本发明属于精准医疗技术领域，涉及一种基于深度学习的基因组结构变异检测系统及方法。The invention belongs to the technical field of precision medicine and relates to a genome structure variation detection system and method based on deep learning.

背景技术Background technique

在过去十年当中，基于第二代测序数据的大型国际合作项目，例如TCGA，ICGC和HGSVC等，不断的揭示了基因组结构变异在种群中的差异，以及其与遗传病、肿瘤等疾病的发生之间的密切关系。近五年来，随着第三代长读长测序的发展和不断普及，人类生殖细胞中已知的结构变异数量已经是第二代测序检出结构变异数量的2.5倍，这些结构变异为后续进化及相关疾病研究提供了重要基础。更重要的是越来越多的简单结构变异通过进一步分析被发现是复杂结构变异，例如2015年《Nature》上首次全面介绍了基因组复杂结构变异。复杂结构变异的特殊性首先表现为与简单结构变异截然不同的形成方式，它们作为基因组上未被挖掘的部分为科研人员研究基因组损伤修复机制提供了新的证据。另一方面，复杂结构变异与遗传性疾病，发育性疾病有着很强的相关性，相关研究极大程度上丰富了研究者对复杂结构变异的认识，比如2017年发表在《Genome Biology》的研究，发现了16种不同的复杂变异类型并深入分析了它们在自闭症形成过程的作用。然而这些在生殖细胞中发生的复杂结构变异往往无法被现有的传统临床手段检测出来，同时由于结构变异自身的复杂性及现有检测方法的局限，基于第三代测序的方法也无法准确的检测出这些复杂事件。相比较生殖细胞结构变异，肿瘤基因组经历了多次且迅速的选择，因此存在更多大尺寸的复杂结构变异，比如染色体碎裂(Chromothripsis)和染色体扭曲(Chromoplexy)。这些复杂结构变异在肿瘤发展过程中被认为是短期内快速形成的事件，在极大程度上促进肿瘤的发展。In the past decade, large-scale international cooperation projects based on second-generation sequencing data, such as TCGA, ICGC and HGSVC, have continuously revealed the differences in genomic structural variation among populations and their relationship with the occurrence of genetic diseases, tumors and other diseases. close relationship between them. In the past five years, with the development and increasing popularity of third-generation long-read sequencing, the number of known structural variations in human germ cells has been 2.5 times the number of structural variations detected by second-generation sequencing. These structural variations provide the basis for subsequent evolution. and provide an important foundation for research on related diseases. What's more important is that more and more simple structural variations are found to be complex structural variations through further analysis. For example, in 2015, "Nature" comprehensively introduced complex structural variations in the genome for the first time. The particularity of complex structural variations is first manifested in a completely different formation method from simple structural variations. As an unexplored part of the genome, they provide new evidence for researchers to study the genome damage repair mechanism. On the other hand, complex structural variation has a strong correlation with genetic diseases and developmental diseases. Related research has greatly enriched researchers' understanding of complex structural variation, such as the research published in "Genome Biology" in 2017. , discovered 16 different complex mutation types and conducted an in-depth analysis of their role in the development of autism. However, these complex structural variations that occur in germ cells often cannot be detected by existing traditional clinical methods. At the same time, due to the complexity of the structural variations themselves and the limitations of existing detection methods, methods based on third-generation sequencing cannot accurately detect Detect these complex events. Compared with germ cell structural variations, tumor genomes have experienced multiple and rapid selections, so there are more large-scale complex structural variations, such as chromothripsis and chromoplexy. These complex structural variations are considered to be events that occur rapidly in a short period of time during tumor development and contribute to tumor development to a great extent.

总体上来说，随着“大健康”这一全局理念被提出，以及我国人口老龄化问题逐渐凸显，遗传性疾病、发育性疾病、癌症的发病率越来越高，因此随着三代测序数据价格的不断下跌，基于第三代测序技术的全基因组检测将会成为临床诊断的必然趋势。在次背景下将会产生大量测序数据，这些数据的解读，尤其是跟临床疾病相关的数据，将会成为制约整个行业发展的关键。Generally speaking, with the overall concept of "big health" being proposed, and the problem of my country's aging population gradually becoming more and more prominent, the incidence of genetic diseases, developmental diseases, and cancer is getting higher and higher. Therefore, as the price of third-generation sequencing data increases, As the number continues to decline, whole-genome testing based on third-generation sequencing technology will become an inevitable trend in clinical diagnosis. In this context, a large amount of sequencing data will be generated, and the interpretation of these data, especially data related to clinical diseases, will become the key to restricting the development of the entire industry.

目前针对第三代测序技术的全基因组结构变异检测的主要步骤包含依然延续了基于第二代测序数据的检测理论，其中主要包含三个步骤：(1)建立已知基因组结构变异的模型；(2)推断该模型在测序数据比对结果中可能反应出的异常比对特征；(3)根据构建的不同结构变异类型的异常比对特征模型，匹配测序读段比对结果，并最终得到检测结果。基于以上检测思路开发的检测工具，例如PBSV、Sniffles、SVIM、NanoVar、CuteSV等方法已经被广泛的用于生殖细胞基因组结构变异检测，以及少量疾病和肿瘤样本分析。为了检测复杂结构变异，大多数检测工具都采用打补丁的方式，也就是将新的结构变异类型所对应的异常比对模型加入到原有工具当中。这其中最具代表性就是Sniffles，它是第一个通过添加额外异常比对模型来检测两种复杂结构变异类型的检测工具。然而测序技术发展至今，研究人员对基因组结构变异的了解仍是冰山一角，这种通过打补丁的方式检测结构变异的方法治标不治本，仍然无法探究基因组当中存在的未知结构变异类型。另一方面，这种基于建模思路开发的工具由于要针对每种变异类型编写特定的代码，因此此类工具代码尤其复杂并且可读性差，这也直接导致了计算效率低和维护困难。这主要是由于复杂结构变异的异常比对特征的复杂性造成了对不同大小范围、不同变异类型的检测灵敏度千差万别，例如图1所示，对于简单的缺失变异和缺失反转复杂结构变异，现有的工具会把复杂结构变异检测成单独的缺失或着反转，甚至有些工具会漏报这个事件。近两三年来，随着越来越多的复杂结构变异通过繁琐的人工分析被发现，生物医学研究人员逐渐认识到复杂结构变异在某些无法确诊的疾病中起到重要的作用；同时，为了达到更好的全方位的结构变异检测结果，全新的检测系统是促进未来临床检测的关键技术。除了模型的限制以外，重复序列长期以来是影响结构变异检测的关键因素，至今仍没有一个有效的解决方案。另外，复杂结构变异的表征方法一直以来没有统一，不同的研究大多采用简单变异的组合表征复杂结构变异类型，同时匹配详细的文字解释，这种方法的最大问题在于不利于不同研究之间比较检测到的复杂结构变异。The current main steps for whole-genome structural variant detection based on third-generation sequencing technology still continue the detection theory based on second-generation sequencing data, which mainly includes three steps: (1) establishing a model of known genome structural variants; ( 2) Infer the abnormal alignment features that the model may reflect in the sequencing data comparison results; (3) Based on the constructed abnormal alignment feature models of different structural variation types, match the sequencing read comparison results, and finally get the detection result. Detection tools developed based on the above detection ideas, such as PBSV, Sniffles, SVIM, NanoVar, CuteSV and other methods, have been widely used to detect germline genome structural variations, as well as analyze a small number of disease and tumor samples. In order to detect complex structural variations, most detection tools use patching, that is, abnormal alignment models corresponding to new structural variation types are added to the original tools. The most representative of these is Sniffles, which is the first detection tool to detect two complex structural variant types by adding additional abnormal alignment models. However, with the development of sequencing technology, researchers' understanding of genome structural variation is still the tip of the iceberg. This method of detecting structural variation through patching only treats the symptoms but not the root cause, and still cannot explore the unknown types of structural variation that exist in the genome. On the other hand, tools developed based on modeling ideas have to write specific codes for each mutation type, so the code of such tools is particularly complex and has poor readability, which directly leads to low computational efficiency and difficulty in maintenance. This is mainly due to the complexity of the abnormal alignment characteristics of complex structural variants, which results in widely varying detection sensitivities for different size ranges and different variant types. For example, as shown in Figure 1, for simple deletion variants and deletion inversion complex structural variants, now Some tools will detect complex structural variations as individual deletions or inversions, and some will even miss this event. In the past two or three years, as more and more complex structural variations have been discovered through tedious manual analysis, biomedical researchers have gradually realized that complex structural variations play an important role in some undiagnosed diseases; at the same time, in order to To achieve better all-round structural variation detection results, the new detection system is a key technology to promote future clinical detection. In addition to model limitations, repetitive sequences have long been a key factor affecting structural variant detection, and there is still no effective solution. In addition, the characterization methods of complex structural variations have not been unified. Different studies mostly use a combination of simple variations to characterize the types of complex structural variations, and at the same time match detailed text explanations. The biggest problem with this method is that it is not conducive to comparison and detection between different studies. complex structural variations.

综上所述，尽管经过近10年的发展，科研人员利用基因组测序数据检测简单类型变异，并将该信息运用到研究人类进化、种群迁移和融合、疾病的机理和治疗方案中，极大的推动了生物医学的发展。然而这种基于建模以及打补丁的结构变异检测理论已经无法满足未来科研、医院和基因检测服务提供商对变异检测的需求，尤其是无法支撑从靶向检测到全基因组检测的转变。In summary, despite nearly 10 years of development, researchers have used genome sequencing data to detect simple types of mutations and applied this information to study human evolution, population migration and fusion, disease mechanisms and treatment options, which has greatly Promoted the development of biomedicine. However, this structural variation detection theory based on modeling and patching is no longer able to meet the needs of future scientific research, hospitals and genetic testing service providers for variation detection, especially cannot support the transition from targeted testing to whole-genome testing.

发明内容Contents of the invention

针对现有全基因组结构变异检测技术存在的问题，本发明提出了一种基于深度学习的基因组结构变异检测系统及方法，实现了不依靠任何结构变异模型从序列差异编码图像中同时检测简单和复杂结构变异。In view of the problems existing in the existing genome-wide structural variation detection technology, the present invention proposes a genome structure variation detection system and method based on deep learning, which realizes the simultaneous detection of simple and complex sequences from sequence difference encoding images without relying on any structural variation model. Structural variation.

为了实现上述目的，本发明的技术方案如下：In order to achieve the above objects, the technical solutions of the present invention are as follows:

一种基于深度学习的基因组结构变异检测方法，包括：A method for detecting genomic structural variation based on deep learning, including:

步骤1，结构变异特征序列提取：将样本序列与参考基因组序列进行比对，得到全局比对结果，根据全局比对结果提取结构变异特征序列；结构变异特征序列中的匹配片段称为主要片段；根据结构变异特征序列的比对特征，将结构变异特征序列中不匹配片段的序列与参考基因组序列进行局部Kmer重比对，经过局部Kmer重比对得到的匹配片段称为次要片段；Step 1. Structural variation feature sequence extraction: compare the sample sequence with the reference genome sequence to obtain the global alignment result, and extract the structural variation feature sequence based on the global alignment result; the matching fragments in the structural variation feature sequence are called main fragments; According to the alignment characteristics of the structural variation characteristic sequence, the sequence of the unmatched fragment in the structural variation characteristic sequence is subjected to local Kmer re-alignment with the reference genome sequence. The matching fragment obtained after the local Kmer re-alignment is called a secondary fragment;

步骤2，结构变异特征序列相似性图像编码：采用RGB图像三通道编码方式，结合主要片段和次要片段，对结构变异特征序列与参考基因组序列进行编码，得到参考基因组序列与样本序列的相似性RGB图像，同时对参考基因组序列进行编码得到参考基因组序列自身相似性图像；两个图像相减得到结构变异特征序列相似性图像；Step 2. Structural variation characteristic sequence similarity image coding: Use RGB image three-channel coding method, combine the main fragments and secondary fragments, encode the structural variation characteristic sequence and the reference genome sequence, and obtain the similarity between the reference genome sequence and the sample sequence. RGB image, while encoding the reference genome sequence to obtain the similarity image of the reference genome sequence itself; subtracting the two images to obtain the structural variation characteristic sequence similarity image;

步骤3，结构变异特征序列相似性图像分割：按主要片段在参考基因组序列上的顺序，在结构变异特征序列相似性图像中组合相邻的两个主要片段，得到只包含单个结构变异的子图像；按主要片段和次要片段在参考基因组序列上的顺序，在子图像中按顺序两两组合相邻的主要片段和次要片段，得到感兴趣片段；Step 3. Segmentation of the structural variation feature sequence similarity image: According to the order of the main fragments on the reference genome sequence, combine the two adjacent main fragments in the structural variation feature sequence similarity image to obtain a sub-image containing only a single structural variation. ;According to the order of the major fragments and minor fragments on the reference genome sequence, combine adjacent major fragments and minor fragments in pairs in the sub-image to obtain the fragment of interest;

步骤4，结构变异特征序列相似性图像的识别及结构变异表征：使用事先训练好的结构变异检测CNN模型对包含单个结构变异的子图像中的所有感兴趣片段进行识别，得到复杂结构变异片段；对复杂结构变异片段使用图数据结构进行系统性表征和分类。Step 4. Identification of structural variation feature sequence similarity images and structural variation characterization: Use the pre-trained structural variation detection CNN model to identify all the fragments of interest in the sub-image containing a single structural variation to obtain complex structural variation fragments; Systematic characterization and classification of complex structural variant fragments using graph data structures.

优选的，步骤1中，根据结构变异特征序列的比对特征，将结构变异特征序列中不匹配片段的序列与参考基因组序列进行局部Kmer重比对，具体是：根据结构变异特征序列的CIGAR字符，从中提取与参考基因组序列的不匹配片段，将不匹配片段的序列与参考基因组序列进行局部Kmer重比对，得到次要片段；Preferably, in step 1, according to the alignment characteristics of the structural variation characteristic sequence, perform a local Kmer re-alignment of the sequence of the unmatched fragment in the structural variation characteristic sequence with the reference genome sequence, specifically: according to the CIGAR characters of the structural variation characteristic sequence , extract the fragments that do not match the reference genome sequence, perform local Kmer re-alignment of the sequence of the mismatched fragments with the reference genome sequence, and obtain the secondary fragments;

优选的，步骤2具体包括：Preferably, step 2 specifically includes:

1)RGB三通道序列相似性编码：将结构变异特征序列与参考基因组序列编码到序列匹配通道(255，0，0)，序列重复通道(0，0，255)以及序列反转通道(0，255，0)中，输出参考基因组序列与样本序列的相似性RGB图像；将参考基因组序列编码到序列匹配通道(255，0，0)，序列重复通道(0，0，255)以及序列反转通道(0，255，0)中，得到参考基因组序列自身相似性图像，同时记录每个匹配片段的位置信息；1) RGB three-channel sequence similarity encoding: encode the structural variation characteristic sequence and the reference genome sequence into the sequence matching channel (255, 0, 0), the sequence repetition channel (0, 0, 255) and the sequence inversion channel (0, 255,0), output the RGB image of the similarity between the reference genome sequence and the sample sequence; encode the reference genome sequence into the sequence matching channel (255,0,0), the sequence repeat channel (0,0,255) and the sequence inversion In channel (0, 255, 0), the similarity image of the reference genome sequence is obtained, and the position information of each matching fragment is recorded at the same time;

2)去除参考基因组序列重复片段：根据参考基因组序列与样本序列相似性的RGB图像和参考基因组序列自身相似性图像中的匹配片段相对于参考基因组序列的坐标位置，在参考基因组序列自身相似性图像寻找中与参考基因组序列与样本序列的相似性RGB图像中的片段相对应的片段，若找到对应片段则将其从参考基因组序列与样本序列相似性的RGB图像中移除，得到结构变异特征序列相似性图像。2) Remove reference genome sequence repetitive fragments: According to the RGB image of the similarity between the reference genome sequence and the sample sequence and the coordinate position of the matching fragment in the reference genome sequence self-similarity image relative to the reference genome sequence, in the reference genome sequence self-similarity image Find the fragments corresponding to the fragments in the RGB image of the similarity between the reference genome sequence and the sample sequence. If the corresponding fragment is found, remove it from the RGB image of the similarity between the reference genome sequence and the sample sequence to obtain the structural variation characteristic sequence. Similarity images.

优选的，步骤3具体包括：Preferably, step 3 specifically includes:

1)单个结构变异分割：按主要片段在参考基因组序列上的顺序，在结构变异特征序列相似性图像中组合相邻的两主要片段，得到只包含单个结构变异的子图像；1) Single structural variation segmentation: According to the order of the main fragments on the reference genome sequence, two adjacent main fragments are combined in the structural variation characteristic sequence similarity image to obtain a sub-image containing only a single structural variation;

2)结构变异图像多目标分割：按照主要片段和次要片段在结构变异特征序列上的坐标进行排序，根据排序结果两两组合子图像中的所有片段，将主要片段和次要片段的组合作为感兴趣片段。2) Multi-target segmentation of structural variation images: Sort the main fragments and secondary fragments according to their coordinates on the structural variation feature sequence, combine all fragments in the sub-image in pairs according to the sorting results, and use the combination of the main fragments and secondary fragments as Fragments of interest.

优选的，步骤4中所述结构变异检测CNN模型训练方法具体包括：Preferably, the CNN model training method for structural variation detection described in step 4 specifically includes:

1)构建结构变异训练数据集：真实数据利用1000Genome Project中的2500个样本的结构变异特征序列作为训练数据，虚拟数据利用VISOR虚拟无噪声干扰的训练样本作为训练数据，两者共同组成训练数据集；1) Construct a structural variation training data set: The real data uses the structural variation characteristic sequences of 2500 samples in the 1000Genome Project as training data, and the virtual data uses VISOR virtual noise-free training samples as training data. The two together form a training data set. ;

2)训练数据集编码：采用步骤2所述方法对训练数据集中的训练数据进行编码，得到训练数据集的感兴趣片段；2) Training data set encoding: Use the method described in step 2 to encode the training data in the training data set to obtain the interesting fragments of the training data set;

3)模型训练：将训练数据集输入卷积神经网络，训练卷积神经网络，训练完成后得到结构变异检测CNN模型。3) Model training: Input the training data set into the convolutional neural network, train the convolutional neural network, and obtain the structural variation detection CNN model after the training is completed.

优选的，步骤4中，使用事先训练好的结构变异检测CNN模型对包含单个结构变异的子图像中的所有感兴趣片段进行识别，得到复杂结构变异片段；对复杂结构变异片段使用图数据结构进行系统性表征和分类，具体是：Preferably, in step 4, a pre-trained structural variation detection CNN model is used to identify all the fragments of interest in the sub-image containing a single structural variation to obtain complex structural variation fragments; a graph data structure is used for complex structural variation fragments. Systematic characterization and classification, specifically:

通过结构变异检测CNN模型识别子图像中的感兴趣片段，得到复杂结构变异片段，基于复杂结构变异片段构建结构变异表征图，并基于结构变异表征图的拓扑结构计算不同结构变异是否属于同一类型，结构变异表征图中的每个节点为子图像中所有感兴趣片段中包含的片段，每条边连接样本序列上连续的两个片段。The structural variation detection CNN model identifies the fragments of interest in the sub-image to obtain complex structural variation fragments, constructs a structural variation representation map based on the complex structural variation fragments, and calculates whether different structural variations belong to the same type based on the topology of the structural variation representation map. Each node in the structural variation representation graph is a fragment contained in all the fragments of interest in the sub-image, and each edge connects two consecutive fragments on the sample sequence.

一种基于深度学习的基因组结构变异检测系统，包括：A deep learning-based genome structural variation detection system, including:

结构变异特征序列提取模块，用于将样本序列与参考基因组序列进行比对，得到全局比对结果，根据全局比对结果提取结构变异特征序列；结构变异特征序列中的匹配片段称为主要片段；根据结构变异特征序列的比对特征，将结构变异特征序列中不匹配片段的序列与参考基因组序列进行局部Kmer重比对，经过局部Kmer重比对得到的匹配片段称为次要片段；The structural variation characteristic sequence extraction module is used to compare the sample sequence with the reference genome sequence to obtain the global comparison result, and extract the structural variation characteristic sequence based on the global comparison result; the matching fragment in the structural variation characteristic sequence is called the main fragment; According to the alignment characteristics of the structural variation characteristic sequence, the sequence of the unmatched fragment in the structural variation characteristic sequence is subjected to local Kmer re-alignment with the reference genome sequence. The matching fragment obtained after the local Kmer re-alignment is called a secondary fragment;

结构变异特征序列编码模块，用于采用RGB图像三通道编码方式对结构变异特征序列与参考基因组序列进行编码，得到参考基因组序列与样本序列的相似性RGB图像；同时对参考基因组序列进行编码得到参考基因组序列自身相似性图像；两个图像相减得到结构变异特征序列相似性图像；The structural variation characteristic sequence encoding module is used to encode the structural variation characteristic sequence and the reference genome sequence using the RGB image three-channel encoding method to obtain an RGB image of similarity between the reference genome sequence and the sample sequence; at the same time, the reference genome sequence is encoded to obtain the reference Genome sequence self-similarity image; two images are subtracted to obtain a sequence similarity image of structural variation characteristics;

结构变异特征序列相似性图像分割模块，用于按主要片段在参考基因组序列上的顺序，在结构变异特征序列相似性图像中组合相邻的两主要片段，得到只包含一个结构变异的子图像；按主要片段和次要片段在参考基因组序列上的顺序，在子图像中组合相邻的主要片段和次要片段，得到感兴趣片段；The structural variation characteristic sequence similarity image segmentation module is used to combine the two adjacent main fragments in the structural variation characteristic sequence similarity image according to the order of the main fragments on the reference genome sequence to obtain a sub-image containing only one structural variation; According to the order of the major fragments and minor fragments on the reference genome sequence, the adjacent major fragments and minor fragments are combined in the sub-image to obtain the fragment of interest;

结构变异识别和表征模块，使用事先训练好的结构变异检测CNN模型对包含单个结构变异的子图像中的所有感兴趣片段进行识别，得到复杂结构变异片段；对复杂结构变异片段使用图数据结构进行系统性表征和分类。The structural variation identification and characterization module uses a pre-trained structural variation detection CNN model to identify all interesting fragments in sub-images containing a single structural variation to obtain complex structural variation fragments; the graph data structure is used to perform complex structural variation fragments Systematic characterization and classification.

优选的，结构变异特征序列编码模块包括：Preferably, the structural variation characteristic sequence coding module includes:

RGB三通道序列相似性编码模块，将结构变异特征序列与参考基因组序列编码到序列匹配通道(255，0，0)、序列重复通道(0，0，255)以及序列反转通道(0，255，0)中，输出参考基因组序列与样本序列的相似性RGB图像；将参考基因组序列编码到序列匹配通道(255，0，0)、序列重复通道(0，0，255)以及序列反转通道(0，255，0)中，得到参考基因组序列自身相似性图像，同时记录每个匹配片段的位置信息；The RGB three-channel sequence similarity coding module encodes the structural variation characteristic sequence and the reference genome sequence into the sequence matching channel (255, 0, 0), the sequence repeat channel (0, 0, 255) and the sequence inversion channel (0, 255 , 0), output the RGB image of the similarity between the reference genome sequence and the sample sequence; encode the reference genome sequence into the sequence matching channel (255, 0, 0), the sequence repeat channel (0, 0, 255) and the sequence inversion channel (0, 255, 0), obtain the similarity image of the reference genome sequence itself, and record the position information of each matching fragment;

去除参考基因组序列重复片段模块，根据参考基因组序列与样本序列相似性的RGB图像和参考基因组序列自身相似性图像中的匹配片段相对于参考基因组序列的坐标位置，在参考基因组序列自身相似性图像寻找中与参考基因组序列与样本序列的相似性RGB图像中的片段相对应的片段，若找到对应片段则将其从参考基因组序列与样本序列相似性的RGB图像中移除，得到结构变异特征序列相似性图像。Remove the reference genome sequence repetitive fragment module, and based on the RGB image of the similarity between the reference genome sequence and the sample sequence and the coordinate position of the matching fragment in the reference genome sequence self-similarity image relative to the reference genome sequence, search in the reference genome sequence self-similarity image Fragments corresponding to the fragments in the RGB image of the similarity between the reference genome sequence and the sample sequence. If the corresponding fragment is found, it is removed from the RGB image of the similarity between the reference genome sequence and the sample sequence, and the structural variation characteristic sequence similarity is obtained. Sexual images.

优选的，结构变异特征序列相似性图像分割模块包括：Preferably, the structural variation feature sequence similarity image segmentation module includes:

单个结构变异分割模块，用于按主要片段在参考基因组序列上的顺序，在结构变异特征序列相似性图像中组合相邻的两主要片段，得到只包含一个结构变异的子图像；A single structural variation segmentation module is used to combine the two adjacent main fragments in the structural variation characteristic sequence similarity image according to the order of the main fragments on the reference genome sequence to obtain a sub-image containing only one structural variation;

结构变异图像分割模块，用于按照主要片段和次要片段在结构变异特征序列上的坐标进行排序，根据排序结果将相邻两个片段组合得到组合片段，过滤由两个次要片段和两个线性片段组合而成的组合片段，将由主要片段和次要片段组合而成的组合片段作为感兴趣片段。The structural variation image segmentation module is used to sort the main fragments and the secondary fragments according to the coordinates on the structural variation feature sequence. According to the sorting results, the two adjacent fragments are combined to obtain the combined fragment. The filter consists of two secondary fragments and two A combined segment formed by combining linear segments, and a combined segment composed of a primary segment and a secondary segment is used as the segment of interest.

优选的，结构变异识别和表征模块，具体包括：Preferably, the structural variation identification and characterization module specifically includes:

构建结构变异训练数据集模块，用于将真实数据和虚拟数据组成训练数据集，真实数据利用1000Genome Project中的2500个样本的结构变异作为训练数据，虚拟数据利用VISOR虚拟无噪声干扰的训练数据；Construct a structural variation training data set module to combine real data and virtual data into a training data set. The real data uses the structural variation of 2500 samples in the 1000Genome Project as training data, and the virtual data uses VISOR virtual training data without noise interference;

训练数据集编码模块，根据结构变异特征序列编码模块编码训练数据集中的所有训练数据，得到训练数据集的感兴趣片段；The training data set encoding module encodes all the training data in the training data set according to the structural variation feature sequence encoding module to obtain the interesting fragments of the training data set;

卷积神经网络训练模块，用于将训练数据集输入卷积神经网络，训练卷积神经网络，训练完成后得到结构变异检测CNN模型；The convolutional neural network training module is used to input the training data set into the convolutional neural network, train the convolutional neural network, and obtain the structural variation detection CNN model after the training is completed;

结构变异图表征模块，用于通过结构变异检测CNN模型识别子图像中的感兴趣片段，得到复杂结构变异片段，基于复杂结构变异片段构建结构变异表征图，并基于结构变异表征图的拓扑结构计算不同结构变异是否属于同一类型，结构变异表征图中的每个节点为子图像中所有感兴趣片段中包含的片段，每条边连接样本序列上连续的两个片段。The structural variation graph representation module is used to identify the fragments of interest in the sub-image through the structural variation detection CNN model, obtain the complex structural variation fragments, construct the structural variation representation graph based on the complex structural variation fragments, and calculate the topology structure based on the structural variation representation graph. Whether different structural variations belong to the same type, each node in the structural variation representation graph is a fragment contained in all the fragments of interest in the sub-image, and each edge connects two consecutive fragments on the sample sequence.

与现有技术相比，本发明具有以下有益的技术效果：Compared with the existing technology, the present invention has the following beneficial technical effects:

本发明首次提出针对长读长测序的模型非依赖的基因组结构变异检测方法，该方法主要包含四个步骤(图2)：(1)基于现有序列比对技术，提取结构变异特征序列；(2)利用RGB图像编码结构变异特征序列相似性图像；(3)利用多目标识别框架预测结构变异特征序列相似性图像中包含的结构变异；(4)通过图数据结构系统性的表征复杂结构变异类型。本发明创新性的优化了现有图像编码技术以及多目标识别框，在步骤(2)中利用优化的图像编码技术编码了参考基因组序列和样本序列的相似性，以及在步骤(3)利用基于卷积神经网络(CNN)的多目标识别框架从结构变异特征序列相似性图像中检测样本序列中携带的不同复杂程度的结构变异。首先使用三通道(RGB)编码方案，编码了参考基因组序列(REF)与样本序列(ALT)的相似性RGB图像，定义为REF-to-ALT图像。其次，为了去除参考基因组序列中背景噪声(例如短串联重复等)对结构变异检测的影响，同时编码了参考基因组序列自身相似性图像(REF-to-REF图像)，然后利用图像相减的方式去除背景噪声的影响，尤其是避免了对于检测复杂结构变异的影响。在基于CNN的多目标识别框架中，本发明提出了两步图像分割方法。其中第一步主要用于将一个结构变异特征序列相似性图像中包含的多个结构变异拆分成只包含一个结构变异的子图像，该包含一个结构变异的子图像被定义为相似性图像(Similarity Image,SI)；第二步将相似性图像分割成感兴趣片段(Segment ofInterest，SOI)，最终使用事先训练好的CNN模型识别SI中包含的所有SOI，并得到最终预测结果。本发明中使用的图像编码方式和多目标识别框架具有很好的可扩展性，使其可以应用于各种长读长测序数据，尤其是针对PacBio和Oxford Nanopore测序平台产生的数据；同时，该发明可应用于基于长读长测序组装得到的contig序列。除了复杂结构变异检测，以往不同相关研究中对于相同复杂结构变异类型有着不同的命名方式，这些命名方式往往取决于研究者本人的主观理解，缺乏系统和统一的命名方式，这也阻碍了复杂结构变异相关的研究。本发明另一大创新点在于首次提出了利于图数据结构表征复杂结构变异，通过计算图的拓扑结构相似性分类不同类型的复杂结构变异。基于以上特点，本发明可以大幅度提升不同复杂程度的结构变异的检出率以及精确度，进一步推动基于第三代测数据的疾病诊断和早筛，为相关领域提供了高效可靠的检测系统。另一方面，本发明从识别序结构变异特征序列的的角度出发，更有利于检测肿瘤当中来自不同克隆的结构变异，为研究结构变异在肿瘤进化过程中的作用提供了全新的检测系统。The present invention proposes for the first time a model-independent genome structural variation detection method for long-read sequencing. This method mainly includes four steps (Figure 2): (1) Based on existing sequence comparison technology, extract structural variation characteristic sequences; ( 2) Use RGB images to encode structural variation feature sequence similarity images; (3) Use a multi-objective recognition framework to predict the structural variations contained in structural variation feature sequence similarity images; (4) Systematically represent complex structural variations through graph data structures type. The present invention innovatively optimizes the existing image coding technology and multi-target recognition frame. In step (2), the optimized image coding technology is used to code the similarity between the reference genome sequence and the sample sequence, and in step (3), the similarity based on The multi-objective recognition framework of convolutional neural network (CNN) detects structural variations of different levels of complexity carried in sample sequences from structural variation feature sequence similarity images. First, a three-channel (RGB) encoding scheme is used to encode the similarity RGB image of the reference genome sequence (REF) and the sample sequence (ALT), which is defined as a REF-to-ALT image. Secondly, in order to remove the impact of background noise (such as short tandem repeats, etc.) in the reference genome sequence on structural variation detection, the reference genome sequence's own similarity image (REF-to-REF image) was also encoded, and then the image subtraction method was used Remove the influence of background noise, especially to avoid the influence of detecting complex structural variations. In the multi-target recognition framework based on CNN, the present invention proposes a two-step image segmentation method. The first step is mainly used to split multiple structural variations contained in a structural variation characteristic sequence similarity image into sub-images containing only one structural variation. The sub-image containing one structural variation is defined as a similarity image ( Similarity Image (SI); the second step is to segment the similarity image into segments of interest (SOI), and finally use the pre-trained CNN model to identify all SOIs contained in the SI and obtain the final prediction result. The image encoding method and multi-target recognition framework used in the present invention have good scalability, so that they can be applied to various long-read sequencing data, especially data generated by PacBio and Oxford Nanopore sequencing platforms; at the same time, the The invention can be applied to contig sequences assembled based on long-read sequencing. In addition to complex structural variation detection, different related studies in the past have used different naming methods for the same complex structural variation types. These naming methods often depend on the subjective understanding of the researchers themselves, and there is a lack of systematic and unified naming methods, which also hinders the development of complex structural variation. Variation-related research. Another major innovation point of the present invention is that it is proposed for the first time that graph data structures can be used to represent complex structural variations, and different types of complex structural variations can be classified by calculating the topological structure similarity of graphs. Based on the above characteristics, the present invention can greatly improve the detection rate and accuracy of structural variations of different levels of complexity, further promote disease diagnosis and early screening based on third-generation test data, and provide an efficient and reliable detection system for related fields. On the other hand, from the perspective of identifying characteristic sequences of structural variation, the present invention is more conducive to detecting structural variation from different clones in tumors, and provides a new detection system for studying the role of structural variation in tumor evolution.

进一步的，由于以往复杂结构变异的低检出率，其表征和比较方式一直依赖于相关研究者的人为定义，本发明在提升复杂结构变异检出率的同时，提出了基于图的复杂结构变异表征方式，其本质是构建了不同SOI相对于参考基因组序列的连接关系，图中的节点是SOI中所包含的片段，边是连接了两个在样本序列上连续的片段Furthermore, due to the low detection rate of complex structural variation in the past, its characterization and comparison methods have always relied on artificial definitions by relevant researchers. While improving the detection rate of complex structural variation, the present invention proposes a graph-based complex structural variation The essence of the representation method is to construct the connection relationship between different SOIs relative to the reference genome sequence. The nodes in the graph are the fragments contained in the SOI, and the edges connect two consecutive fragments on the sample sequence.

本发明是一种基于多目标识别的的基因组结构变异检测系统，以模型非依赖结构变异检测理论为核心，通过结构变异特征序列提取模块、结构变异特征序列相似性图像编码模块、结构变异特征序列相似性图像分割模块、以及结构变异识别和表征模块，实现了不依靠任何变异模型的针对长读长测序的结构变异检测。基于RGB图像的序列相似性编码和多目标识别抓住了结构变异检测的两个根本特征，首先是结构变异表现为参考基因组序列和样本序列间的差异，其次是复杂结构变异表现为多种简单结构变异的组合；然后，本发明通过识别相似性图像中的SOI达到检测不同复杂程度的结构变异。本发明不依靠任何变异模型，对于已知简单结构变异的检测，其灵敏度高于现有基于模型的检测方法，本发明在较低的假阳性率前提下，大幅度提高了复杂结构变异的检出率。The invention is a genome structure variation detection system based on multi-target recognition. It takes the model-independent structural variation detection theory as the core and uses a structural variation feature sequence extraction module, a structural variation feature sequence similarity image coding module, and a structural variation feature sequence. The similarity image segmentation module and the structural variant identification and characterization module realize structural variant detection for long-read sequencing without relying on any mutation model. Sequence similarity encoding and multi-target recognition based on RGB images capture two fundamental characteristics of structural variation detection. First, structural variation is manifested as the difference between the reference genome sequence and sample sequence. Second, complex structural variation is manifested as a variety of simple A combination of structural variations; then, the present invention achieves the detection of structural variations of different levels of complexity by identifying SOI in similarity images. The present invention does not rely on any mutation model. For the detection of known simple structural variations, its sensitivity is higher than the existing model-based detection methods. On the premise of a low false positive rate, the present invention greatly improves the detection of complex structural variations. Out rate.

综上所述，本发明涉及到的基因组结构变异检测系统是实现精准诊断的核心技术，同时抓住第三代测序技术所带来的精准医疗发展的重大机遇，从技术角度创新性的提出了模型非依赖的结构变异检测理论，同时根据该理论设计并实现了全新的结构变异检测系统。该系统在保证简单结构变异高检出率及精确度的同时，大幅提高了复杂结构变异的检出率，为推动第三代测序技术的临床应用提供了重要技术支持。本发明面向国家重大需求，研究国家战略性新兴产业“精准医疗”中的核心问题，有利于我国在基因组结构变异检测这一战略必争领域中，打破重大关键核心技术受制于人的局面，更有利于开辟新的“精准医疗”相关产业发展方向，培育新的经济增长点。In summary, the genome structure variation detection system involved in the present invention is the core technology to achieve accurate diagnosis. At the same time, it seizes the major opportunities for the development of precision medicine brought by third-generation sequencing technology and innovatively proposes from a technical perspective. Model-independent structural variation detection theory, and a new structural variation detection system was designed and implemented based on this theory. This system not only ensures high detection rates and accuracy of simple structural variants, but also greatly improves the detection rate of complex structural variants, providing important technical support for promoting the clinical application of third-generation sequencing technology. The present invention is oriented to the country's major needs and studies the core issues in the national strategic emerging industry "precision medicine", which is beneficial to our country in breaking the situation where major key core technologies are controlled by others in the strategic field of genome structural variation detection, and is more conducive to Open up new development directions for "precision medicine" related industries and cultivate new economic growth points.

附图说明Description of drawings

图1为针对长读长测序技术构造的简单和复杂结构变异的模型；Figure 1 shows models of simple and complex structural variations constructed with long-read sequencing technology;

图2为针对长读长和组装序列的基于深度学习的基因组结构变异检测系统的流程；Figure 2 shows the process of a deep learning-based genomic structural variation detection system for long reads and assembled sequences;

图3为针对不同长测序平台的简单结构变异检测结果对比；Figure 3 shows a comparison of simple structural variation detection results for different long sequencing platforms;

图4为针对虚拟的复杂结构变异的检测结构对比；Figure 4 is a comparison of detection structures for virtual complex structural variations;

图5为针对高保真(HiFi)测序的不同尺寸结构变异检测结果比较；Figure 5 shows a comparison of the detection results of structural variants of different sizes for high-fidelity (HiFi) sequencing;

图6为本发明针对测序读段和组装contig的检测结果比较；Figure 6 is a comparison of the detection results of the present invention for sequencing reads and assembled contigs;

图7为针对分型基因组的结构变异检测结果比较。Figure 7 shows a comparison of structural variation detection results for genotyped genomes.

具体实施方式Detailed ways

下面结合具体的实施例对本发明进一步的详细说明，所述是本发明的解释而不是限定。The present invention will be further described in detail below with reference to specific examples, which are explanations rather than limitations of the present invention.

本发明提出了一种模型非依赖的基因组结构变异检测新理论，同时设计了一种基于新理论的针对长读长测序的基因组结构变异检测系统。The present invention proposes a new model-independent genome structure variation detection theory, and designs a genome structure variation detection system for long-read sequencing based on the new theory.

本发明提出的基于深度学习的基因组结构变异检测系统及方法，具体表述为，对于各种不同的长读长测序技术，结构变异的根本特征都表现为样本序列(包含测序读段或组装序列)和参考基因组序列之间的差异。因此，本发明通过图像的方式编码样本序列和参考基因组序列之间的差异，使用多目标识别框架检测参考基因组序列和样本序列相似性图像中的不同复杂程度的结构变异，然后利用图的数据结构系统性的表征和分类不同类型的复杂结构变异，最终达到结构变异检测和表征的目的。根据该理论设计算法的核心主要包括：(1)结构变异特征序列提取；(2)结构变异特征序列相似性图像编码；(3)结构变异特征序列相似性图像分割；(4)结构变异特征序列的识别及结构变异表征。The genome structural variation detection system and method based on deep learning proposed by the present invention is specifically stated as follows: for various long-read sequencing technologies, the fundamental characteristics of structural variation are expressed as sample sequences (including sequencing reads or assembly sequences) and the reference genome sequence. Therefore, the present invention encodes the differences between the sample sequence and the reference genome sequence through images, uses a multi-objective recognition framework to detect structural variations of different levels of complexity in the reference genome sequence and sample sequence similarity images, and then uses the data structure of the graph Systematically characterize and classify different types of complex structural variations, ultimately achieving the purpose of structural variation detection and characterization. The core of the algorithm designed based on this theory mainly includes: (1) Structural variation feature sequence extraction; (2) Structural variation feature sequence similarity image coding; (3) Structural variation feature sequence similarity image segmentation; (4) Structural variation feature sequence Identification and characterization of structural variations.

本发明所涉及的核心方法，其主要包括如下步骤：The core method involved in the present invention mainly includes the following steps:

步骤1、结构变异特征序列提取：根据对某样本的测序数据，锁定潜在结构变异存在的区域，提取支持结构变异的结构变异特征序列；具体是：Step 1. Structural variation feature sequence extraction: Based on the sequencing data of a certain sample, locate the area where potential structural variation exists, and extract structural variation feature sequences that support structural variation; specifically:

利用现有技术，将样本序列与参考基因组序列进行全基因组比对。得到比对结果后，提取异常比对读段，同时从这些读段中提取跨过该结构变异位点的读段序列，通过序列组装或直接使用读段序列作为结构变异特征序列。现有比对技术通常提供了两种主要异常比对读段特征：(1)比对结果中CIGAR字符了记录的插入和缺失；(2)比对结果中保留的补充比对特征。Using existing technology, the sample sequence is compared with the reference genome sequence for whole genome comparison. After obtaining the alignment results, extract the abnormal alignment reads, and at the same time extract the read sequences spanning the structural variation site from these reads, and use the read sequences as structural variation characteristic sequences through sequence assembly or directly. Existing alignment technologies usually provide two main abnormal alignment read characteristics: (1) insertions and deletions of CIGAR characters in the alignment results; (2) supplementary alignment features retained in the alignment results.

基于Kmer的变异特征序列重比对：步骤1中基于现有技术所得到的结构变异特征序列的CIGAR字符中会记录样本序列与参考基因组序列中匹配片段和不匹配片段，该结果被认为是全局比对结果。首先从全局比对结果中提取不匹配片段序列，其次将不匹配片段的序列与参考基因组序列进行重新比对并记录新的比对结果，比对过程中将同时记录该片段是否存在重复以及比对方向，比对方向分别为DNA的5’至3’端以及3’至5’端。下一步，将CIGAR字符中得到的匹配片段定义为主要片段，通过重新比对得到的重比对片段定义为次要片段，所有主要片段和次要片段统称为样本序列和参考基因组序列的匹配片段，每个片段同时包含比对位置、比对方向及是否为重复序列三种特征。Kmer-based variation feature sequence re-alignment: In step 1, the CIGAR characters of the structural variation feature sequence obtained based on the existing technology will record the matching fragments and non-matching fragments between the sample sequence and the reference genome sequence. This result is considered global Comparison results. First, the unmatched fragment sequence is extracted from the global alignment results. Secondly, the sequence of the unmatched fragment is re-aligned with the reference genome sequence and the new alignment result is recorded. During the alignment process, whether there are duplications in the fragment and the comparison will be recorded simultaneously. The alignment directions are the 5' to 3' end and the 3' to 5' end of the DNA respectively. In the next step, the matching fragment obtained from the CIGAR characters is defined as the primary fragment, and the re-aligned fragment obtained through re-alignment is defined as the secondary fragment. All primary fragments and secondary fragments are collectively referred to as the matching fragments of the sample sequence and the reference genome sequence. , each fragment contains three characteristics: alignment position, alignment direction and whether it is a repeated sequence.

步骤2、结构变异特征序列相似性图像编码：根据步骤1所得到的结构变异特征序列，通过步骤1的比对结果和基于Kmer重比对的结果编码结构变异特征序列与参考序列之间的相似性RGB图像；同时，使用相同的编码方案编码参考基因组序列自身的相似性REF-to-REF图像，然后通过图像相减的方式得到去噪之后的REF-to-ALT图像。Step 2. Structural variation feature sequence similarity image encoding: Based on the structural variation feature sequence obtained in step 1, the similarity between the structural variation feature sequence and the reference sequence is encoded through the comparison results of step 1 and the results based on Kmer re-alignment RGB image; at the same time, use the same encoding scheme to encode the similarity REF-to-REF image of the reference genome sequence itself, and then obtain the denoised REF-to-ALT image through image subtraction.

步骤2具体包括：Step 2 specifically includes:

1)RGB三通道序列比对结果编码：将步骤1得到的所有比对结果，根据每个片段所带的三种特征，分别将一个片段编码到序列匹配通道(255，0，0)，序列重复通道(0，0，255)以及序列反转通道(0，255，0)中，每个通道都可以通过设置最小片段阀值编码不同长度的序列，图像中的每个像素点可以达到单碱基分辨率，就是每个像素点表示参考基因组序列与样本序列相同的一对碱基。通过以上三通道编码方式，该步骤会输出参考基因组序列与样本序列的相似性RGB图像以及每个片段在图像中的位置信息；同时，再次调用步骤1中的重比对方法将参考基因组序列与自身进行比对，比对过程中通过连续移动Kmer得到单碱基分辨率比对结果，然后利用同样的三通道编码方式获得REF-to-REF图像。1) Coding of RGB three-channel sequence alignment results: Based on the three characteristics of each fragment, encode all the alignment results obtained in step 1 into the sequence matching channel (255, 0, 0). The sequence In the repeat channel (0, 0, 255) and the sequence inversion channel (0, 255, 0), each channel can encode sequences of different lengths by setting the minimum fragment threshold. Each pixel in the image can achieve a single Base resolution means that each pixel represents a pair of bases that is the same between the reference genome sequence and the sample sequence. Through the above three-channel encoding method, this step will output the RGB image of the similarity between the reference genome sequence and the sample sequence and the position information of each fragment in the image; at the same time, the re-alignment method in step 1 is called again to compare the reference genome sequence with Perform the alignment by itself. During the alignment process, the Kmer is continuously moved to obtain a single-base resolution alignment result, and then the same three-channel encoding method is used to obtain the REF-to-REF image.

2)参考基因组序列与样本序列的相似性RGB图像去噪：根据1)中得到参考基因组序列与样本序列的相似性RGB图像和REF-to-REF图像，首先提取两幅图像中每个片段相对于参考基因组序列的起始和终止坐标，同时按照参考基因组序列坐标顺序分别排列从两个图像中提取的片段；其次选取参考基因组序列与样本序列的相似性RGB图像提取的片段，逐个检测其与REF-to-REF图像中片段的重叠情况，该检测可以在O(logn)时间复杂度情况下完成。得到片段的重叠情况后，根据预先设定好的阈值来判断该片段是否应该被从参考基因组序列与样本序列的相似性RGB图像中去除。通过以上所述方法完成两图像相减，得到结构变异特征序列相似性图像。2) RGB image denoising based on the similarity between the reference genome sequence and the sample sequence: Based on the similarity RGB image and REF-to-REF image obtained between the reference genome sequence and the sample sequence in 1), first extract the relative values of each fragment in the two images. Based on the start and end coordinates of the reference genome sequence, the fragments extracted from the two images are arranged in the order of the coordinates of the reference genome sequence. Secondly, the fragments extracted from the RGB image are selected for similarity between the reference genome sequence and the sample sequence, and their similarities with each other are detected one by one. The detection of overlap of segments in REF-to-REF images can be completed in O(logn) time complexity. After obtaining the overlap of the fragments, it is judged according to the preset threshold whether the fragment should be removed from the RGB image of the similarity between the reference genome sequence and the sample sequence. Through the above method, the two images are subtracted to obtain the structural variation characteristic sequence similarity image.

步骤3、结构变异特征序列相似性图像分割：根据步骤2所得到的结构变异特征序列相似性图像，首先根据图像中的主要片段的排列组合把结构变异特征序列相似性图像分割成只携带单个结构变异的相似性图像(Similarity Image，SI)；在SI的基础上，再次根据主要片段和次要片段的组合把相似性图像分割成不同的感兴趣片段(Segment ofInterest，以下简称SOI)；Step 3. Structural variation feature sequence similarity image segmentation: Based on the structural variation feature sequence similarity image obtained in step 2, first segment the structural variation feature sequence similarity image into a single structure based on the arrangement and combination of the main fragments in the image. Mutated similarity image (Similarity Image, SI); on the basis of SI, the similarity image is divided into different segments of interest (Segment of Interest, hereinafter referred to as SOI) again based on the combination of primary fragments and secondary fragments;

步骤3具体包括：Step 3 specifically includes:

1)单个结构变异分割：用于将结构变异特征序列相似性图像中出现的多个不同结构变异分开，避免将连续出现的简单结构变异作为复杂结构变异进行预测和表征。其主要方法为，首先按顺序排列结构变异特征序列相似性图像中的主要片段，其次按顺序两两组合主要片段，同时根据主要片段的坐标位置将包含主要片段的图像从原有结构变异特征序列相似性图像中提取出来，最终得到包含单个结构变异的相似性图像；1) Single structural variation segmentation: used to separate multiple different structural variations appearing in the structural variation characteristic sequence similarity image to avoid predicting and characterizing continuous simple structural variations as complex structural variations. The main method is to first arrange the main fragments in the sequence similarity image of structural variation characteristics in order, and then combine the main fragments in pairs in order. At the same time, according to the coordinate position of the main fragments, the image containing the main fragments is changed from the original structural variation characteristic sequence. It is extracted from the similarity image, and finally a similarity image containing a single structural variation is obtained;

2)结构变异相似性图像多目标分割：从相似性图像中分割得到包含多个待识别的SOI。其主要步骤为，首先在相似性图像中按照主要片段和次要片段相对于参考基因组序列的顺序进行排序，其次根据排序结果将相邻两个片段组合得到并提取包含这两个片段的子图像，称为SOI。在提取过程中，首先过滤由两个次要片段组合而成的SOI；其次，判断两个片段是否可以构成一个线性片段，该过程主要通过检验两个片段在二维空间中的斜率差异是否在固定范围内，该斜率差异主要通过用户想检测的最小结构变异大小计算得到；同时，由于每个相似性图像的维度不同，该斜率差异范围在每个图像中也是不同的。经过以上两个过滤步骤后，每个相似性图像中的SOI将被用于后续识别；2) Multi-target segmentation of structural variation similarity images: Segmentation from the similarity images is performed to obtain multiple SOIs to be identified. The main steps are to first sort the major fragments and minor fragments in the similarity image according to the order of the reference genome sequence, and secondly to combine the two adjacent fragments according to the sorting results and extract the sub-image containing the two fragments. , called SOI. In the extraction process, the SOI composed of two secondary segments is first filtered; secondly, it is judged whether the two segments can form a linear segment. This process mainly checks whether the slope difference of the two segments in the two-dimensional space is within Within a fixed range, the slope difference is mainly calculated by the minimum structural variation size that the user wants to detect; at the same time, due to the different dimensions of each similarity image, the slope difference range is also different in each image. After the above two filtering steps, the SOI in each similarity image will be used for subsequent recognition;

步骤4、结构变异特征序列的识别及结构变异表征：根据步骤3所得到一幅相似图像中的多个目标(即SOI)，使用事先训练好的卷积神经网络(Convolution NeuralNetwork,以下简称CNN)识别相似性图像中的不同复杂成的结构变异；识别完成后，使用图的方式系统性的表征和分类不同复杂结构变异类型。Step 4. Identification of structural variation feature sequences and structural variation characterization: Based on multiple targets (i.e. SOI) in a similar image obtained in step 3, use a pre-trained convolutional neural network (Convolution NeuralNetwork, hereinafter referred to as CNN) Identify structural variations of different complexities in similarity images; after the identification is completed, use graphs to systematically characterize and classify different types of complex structural variations.

步骤4具体包括：Step 4 specifically includes:

1)构建结构变异训练数据集及编码：用于训练的结构变异主要包括四种简单结构变异类型，缺失(DEL)，插入(INS)，反转(INV)和重复(DUP)。其中真实数据利用1000GenomeProject中的2500个样本的结构变异作为训练数据，虚拟数据利用VISOR虚拟无噪声干扰的训练样本，两者共同组成训练数据集。其次，根据步2和3所述方式编码训练数据集；1) Construct structural variation training data set and encoding: Structural variation used for training mainly includes four simple structural variation types, deletion (DEL), insertion (INS), inversion (INV) and duplication (DUP). The real data uses the structural variation of 2500 samples in 1000GenomeProject as training data, and the virtual data uses VISOR virtual noise-free training samples. The two together form the training data set. Secondly, encode the training data set according to the method described in steps 2 and 3;

2)卷积神经网络(CNN)训练：将1)中编码好的训练集输入到卷积神经网络，利用现有技术，训练AlexNet神经网络。AlexNet神经网络训练参数使用迁移学习方式，利用GoogleImageNet竞赛中最好成绩参数作为模型初始化参数，同时使用交叉验证方式选取最优模型作为最终结构变异检测模型。2) Convolutional neural network (CNN) training: Input the encoded training set in 1) into the convolutional neural network, and use existing technology to train the AlexNet neural network. AlexNet neural network training parameters use transfer learning method, use the best score parameters in the GoogleImageNet competition as model initialization parameters, and use cross-validation method to select the optimal model as the final structural variation detection model.

3)构建复杂结构变异表征图：通过结构变异检测CNN模型识别子图像中的多个SOI，首先基于识别结果得到复杂结构变异表征图，其次根据结构变异表征图的拓扑结构计算不同复杂结构变异是否属于同一类型。3) Construct a complex structural variation representation map: Use the structural variation detection CNN model to identify multiple SOIs in the sub-image. First, obtain the complex structural variation representation map based on the recognition results. Secondly, calculate whether different complex structural variation representations are based on the topology of the structural variation representation map. Belong to the same type.

构建结构变异表征图：首先从相似性图像中的所有SOI中获取匹配片段，每个片段包含起始终点坐标、方向以及重复片段来源信息，这些片段被统一用作为结构变异表征图中的每个节点，其中每个节点带有方向；其次，图中每个边连接在样本序列上相邻的两个片段，同时，对于知道来源的重复片段，图中用重复边连接该片段和其来源。Construct a structural variation representation map: First, obtain matching fragments from all SOIs in the similarity image. Each fragment contains start and end coordinates, direction, and repeated fragment source information. These fragments are uniformly used as each segment in the structural variation representation map. Nodes, each node has a direction; secondly, each edge in the graph connects two adjacent fragments on the sample sequence. At the same time, for repeated fragments whose sources are known, repeated edges are used in the graph to connect the fragment and its source.

结构变异表征图拓扑结构相似性计算：首先计算两个图的节点和边的数量是否匹配；其次，对于两个图拥有相同的节点和边的数量，分别根据边的连接关系和节点方向得到每个图的5’至3’端路径和3’至5’端路径，如果两个结构变异表征图存在对称的路径，则认为这两个结构变异属于同一类型。Structural variation representation graph topological structure similarity calculation: first calculate whether the number of nodes and edges of the two graphs match; secondly, for two graphs with the same number of nodes and edges, obtain each graph based on the edge connection relationship and node direction. The 5' to 3' end path and the 3' to 5' end path of each graph. If there are symmetric paths in two structural variation representation graphs, the two structural variations are considered to be of the same type.

本发明具体实施过程如下，流程如图2所示：The specific implementation process of the present invention is as follows, and the flow process is shown in Figure 2:

步骤1，利用现有技术，将长读长样本测序数据(样本序列)与参考基因组序列进行比对，确定每个读段序列在参考基因组上的坐标，同时提取比对结果中样本序列与参考基因组序列不匹配的部分，得到带有异常比对特征的读段序列，，通过序列组装或直接使用读段序列作为结构变异特征序列。Step 1: Use existing technology to compare the long-read sample sequencing data (sample sequence) with the reference genome sequence, determine the coordinates of each read sequence on the reference genome, and extract the sample sequence and reference from the comparison result. For the mismatched parts of the genome sequence, read sequences with abnormal alignment characteristics are obtained, and the read sequences are used through sequence assembly or directly used as structural variation characteristic sequences.

对于步骤1中长读长比对经过若干年发展，研究已经较为透彻，通常使用seed-and-extension结合动态规划来完成。比对步骤包括种子生成和扩展，扩展主要采用动态规划的方式来实现。主流的代表性工具包括minimap2，NGMLR和pbmm2。After several years of development, the research on the long-read comparison in step 1 has been relatively thorough, and it is usually completed using seed-and-extension combined with dynamic programming. The comparison steps include seed generation and expansion, and expansion is mainly implemented using dynamic programming. Mainstream representative tools include minimap2, NGMLR and pbmm2.

得到比对结果后，提取结构变异特征序列的关键步骤在于找到比对异常位点。如果样本序列在某个位点不包含任何形式的结构变异，则在此位点的读段序列与参考基因组序列无差别，且比对的方向也是正常的，否则，比对到该位点的读段序列可能携带变异特征。本发明中所涉及的结构变异留下的变异特征主要包括：(1)未匹配序列(unmatched)，如果相对参考基因组序列，样本序列在某一个位点有一短序列的插入或者缺失，比对到该位点的多个读段序列会带有未匹配序列的特征，其主要表现形式为比对结果中的插入(I)和缺失(D)。此处，本发明只考虑长读大于50的未匹配序列。针对(1)中的未匹配序列，本发明采用基于Kmer的局部哈希比对方法细化未匹配序列插入片段的来源，同时发现除了插入(I)和缺失(D)以外的变异特征。(2)补充比对(supplementary alignment)，由于样本某一个位点存在结构变异，当读段比对通过该位点时，比对软件会把一个长读段序列打断成多个子序列，将子序列与参考基因组序列对比后得到补充比对结果。然而由于参考基因组本身的不完整性，本发明默认对于产生4个及以上补充比对的读段不与处理，该参数可以根据用户需求进行调整。After obtaining the alignment results, the key step to extract the structural variation characteristic sequence is to find the abnormal alignment sites. If the sample sequence does not contain any form of structural variation at a certain site, then the read sequence at this site is no different from the reference genome sequence, and the alignment direction is also normal. Otherwise, the read sequence at this site will not be aligned to the reference genome sequence. Read sequences may carry variant signatures. The variation characteristics left by the structural variation involved in the present invention mainly include: (1) Unmatched sequence (unmatched). If the sample sequence has a short sequence insertion or deletion at a certain position relative to the reference genome sequence, the comparison will Multiple read sequences at this site will have the characteristics of unmatched sequences, and their main manifestations are insertions (I) and deletions (D) in the alignment results. Here, the present invention only considers unmatched sequences with long reads greater than 50. For the unmatched sequences in (1), the present invention uses a Kmer-based local hash comparison method to refine the source of the unmatched sequence inserts, and at the same time discover variation characteristics other than insertions (I) and deletions (D). (2) Supplementary alignment. Since there is structural variation at a certain site in the sample, when the read alignment passes through this site, the alignment software will break a long read sequence into multiple subsequences. The subsequences are compared with the reference genome sequence to obtain supplementary alignment results. However, due to the incompleteness of the reference genome itself, the present invention does not process reads that generate 4 or more supplementary alignments by default. This parameter can be adjusted according to user needs.

步骤2，利用三通道RGB图编码结构变异特征序列与参考基因组序列之间的相似性：Step 2, use the three-channel RGB image to encode the similarity between the structural variation characteristic sequence and the reference genome sequence:

通过以上特征获得结构变异特征序列后，本发明根据每个结构变异特征序列的比对结果将其转换为片段特征，包括匹配片段和不匹配片段，同时对具有相同属性的片段特征进行聚类。针对每个聚类中的片段特征，本发明将步骤1中结构变异特征序列CIGAR字符、Kmer比对结果及参考基因组序列编码到序列匹配通道(255，0，0)，序列重复通道(0，0，255)以及序列反转通道(0，255，0)中，每个通道都可以通过设置最小片段阀值编码不同序列，通过以上三通道编码方式，该步骤会输出样本序列和参考基因组序列的RGB编码矩阵以及每个匹配片段的位置信息；根据样本序列和参考基因组序列的RGB编码矩阵，构建参考基因组序列与样本序列的相似性RGB图像；同时对参考基因组序列编码到序列匹配通道(255，0，0)，序列重复通道(0，0，255)以及序列反转通道(0，255，0)中，得到REF-to-REF图像，通过两图像相减的方式去除参考基因组序列中重复片段对后续结构变异检测的影响，得到结构变异特征序列相似性图像。After obtaining the structural variation feature sequence through the above features, the present invention converts it into fragment features based on the comparison results of each structural variation feature sequence, including matching fragments and non-matching fragments, and at the same time clusters fragment features with the same attributes. For the fragment characteristics in each cluster, the present invention encodes the structural variation characteristic sequence CIGAR characters, Kmer alignment results and reference genome sequence in step 1 into the sequence matching channel (255, 0, 0) and the sequence repetition channel (0, 0, 255) and the sequence inversion channel (0, 255, 0), each channel can encode different sequences by setting the minimum fragment threshold. Through the above three-channel encoding method, this step will output the sample sequence and the reference genome sequence. The RGB coding matrix and the position information of each matching fragment; according to the RGB coding matrix of the sample sequence and the reference genome sequence, an RGB image of the similarity between the reference genome sequence and the sample sequence is constructed; at the same time, the reference genome sequence is encoded into the sequence matching channel (255 , 0, 0), the sequence repetition channel (0, 0, 255) and the sequence inversion channel (0, 255, 0), the REF-to-REF image is obtained, and the reference genome sequence is removed by subtracting the two images. The impact of repeated fragments on subsequent structural variation detection is obtained, and a sequence similarity image of structural variation characteristics is obtained.

步骤3，采用多目标识别框架，分割步骤2中的结构变异特征序列相似性图像，得到以SOI为多个目标的包含单个结构变异的相似性图像：Step 3: Use a multi-target recognition framework to segment the structural variation feature sequence similarity image in step 2 to obtain a similarity image containing a single structural variation with SOI as multiple targets:

首先，随着测序读长的不断增加，一个读段序列很有可能同时跨过不止一个结构变异，于是本发明首先将结构变异特征序列相似性图像进行单个结构变异分割，其主要目的是将图像中出现的多个不同结构变异分开，避免将连续出现的简单结构变异作为复杂结构变异进行预测和表征；具体步骤主要是按照顺序排列组合主要片段构成相似性图像，也就是说，在一个结构变异特征序列相似性图像中，每两个主要片段所构成的子图像被认为包含一个结构变异，将相邻两个主要片段组合得到一个只包含单个结构变异的相似性图像。其次组合相似性图像中顺序相邻的主要片段和次要片段得到SOI。First of all, as the sequencing read length continues to increase, a read sequence is likely to span more than one structural variation at the same time. Therefore, the present invention first performs single structural variation segmentation on the structural variation characteristic sequence similarity image. Its main purpose is to segment the image Separate multiple different structural variations that appear in the structure to avoid predicting and characterizing continuous simple structural variations as complex structural variations; the specific steps are mainly to arrange and combine the main fragments in order to form a similarity image, that is, in a structural variation In the feature sequence similarity image, the sub-image composed of each two main fragments is considered to contain a structural variation. Combining two adjacent main fragments results in a similarity image containing only a single structural variation. Secondly, the SOI is obtained by combining the sequentially adjacent primary fragments and secondary fragments in the similarity image.

步骤4，结构变异特征序列的识别及结构变异表征：根据步骤3所得到相似性图像中的多个目标，使用事先训练好的CNN模型进行预测和分类，最终使用图的方式表征不同结构变异类型：Step 4. Identification of structural variation feature sequences and representation of structural variation: Based on the multiple targets in the similarity image obtained in step 3, use the pre-trained CNN model to predict and classify, and finally use graphs to represent different types of structural variation. :

本发明首先构建结构变异训练数据集，用于训练的结构变异主要包括四种简单结构变异类型，缺失(DEL)，插入(INS)，反转(INV)和重复(DUP)。其中真实数据利用1000Genome Project中的2500个样本的结构变异作为训练数据，虚拟数据利用VISOR虚拟无噪声干扰的训练样本，两者共同组成训练数据集。根据步骤2所述方式编码训练数据集中的结构变异。本发明选用现有AlexNet神经网络，利用编码好的结构变异对该AlexNet神经网络进行训练，AlexNet神经网络训练参数使用迁移学习方式，利用Google ImageNet竞赛中最好成绩参数作为模型初始化参数，同时使用交叉验证方式选取最优模型作为最终结构变异检测CNN模型。通过结构变异检测CNN模型识别子图像中的目标，最终不依靠任何结构变异模型识别不同复杂成度的结构变异。另外，针对复杂结构变异的多断点特征，本发明利用图的数据结构来表征不同类型的复杂结构变异。构图过程首先提取结构变异特征序列相似性图像中所有SOI里包含的片段作为图的节点，其次图中的每个边连接在样本序列上相邻的两个片段。同时，基于图的表示方法，本发明通过判断图的拓扑相似性对不同复杂结构变异进行分类，具体步骤包括：1)检验两个结构变异表征图是否拥有相同的节点和边的数量；2)对于拥有相同边和节点数量的表征图，判断两个图是否存在对称的路径，如果满足上述两个条件，则认为比较的两复杂结构变异属于同一个类型。The present invention first constructs a structural variation training data set. The structural variation used for training mainly includes four simple structural variation types, deletion (DEL), insertion (INS), inversion (INV) and duplication (DUP). The real data uses the structural variation of 2,500 samples in the 1000Genome Project as training data, and the virtual data uses VISOR virtual noise-free training samples. The two together form the training data set. Encode structural variation in the training data set as described in step 2. The present invention selects the existing AlexNet neural network and uses the coded structural variation to train the AlexNet neural network. The AlexNet neural network training parameters use the transfer learning method, using the best score parameters in the Google ImageNet competition as model initialization parameters, and using cross The verification method selects the optimal model as the final structural variation detection CNN model. The target in the sub-image is identified through the structural variation detection CNN model, and finally structural variation of different complexity levels is identified without relying on any structural variation model. In addition, in view of the multi-breakpoint characteristics of complex structural variations, the present invention uses the data structure of graphs to characterize different types of complex structural variations. The mapping process first extracts all the fragments contained in the SOI in the structural variation feature sequence similarity image as nodes of the graph, and then each edge in the graph connects two adjacent fragments on the sample sequence. At the same time, based on the graph representation method, the present invention classifies different complex structural variations by judging the topological similarity of the graph. The specific steps include: 1) checking whether the two structural variation representation graphs have the same number of nodes and edges; 2) For representation graphs with the same number of edges and nodes, determine whether there are symmetric paths in the two graphs. If the above two conditions are met, the two complex structural variations being compared are considered to belong to the same type.

结构变异特征序列提取模块，具体包括：Structural variation feature sequence extraction module, specifically including:

提取模块，用于将样本测序数据与参考基因组序列进行比对，得到比对结果；，根据比对结果提取结构变异特征序列；The extraction module is used to compare the sample sequencing data with the reference genome sequence to obtain the comparison results; and extract the structural variation characteristic sequence based on the comparison results;

去除低质量变异信号模块，用于去除质量不满要求的结构变异特征序列，主要包括比对质量低于20以及产生3个以上补充比对的结构变异特征序列；The low-quality variation signal removal module is used to remove structural variation signature sequences whose quality is not satisfactory, mainly including structural variation signature sequences with an alignment quality lower than 20 and those that generate more than 3 supplementary alignments;

特征聚类及过滤模块，用于将结构变异特征序列通过特征相似性进行聚类，得到特征聚类，过滤特征数值小于预定阀值的特征聚类，保留剩余特征聚类中的结构变异特征序列及其CIGAR字符。即根据相似性度量将潜在支持同一个结构变异的信号通过层次聚类的方法聚为一类，聚类后去除不满足最小聚类大小的类，通过筛选的聚类结果用于锁定潜在变异区域；Feature clustering and filtering module is used to cluster structural variation feature sequences based on feature similarity to obtain feature clusters, filter feature clusters whose feature values are less than a predetermined threshold, and retain structural variation feature sequences in the remaining feature clusters. and its CIGAR characters. That is, based on the similarity measure, signals that potentially support the same structural variation are clustered into one class through hierarchical clustering. After clustering, clusters that do not meet the minimum cluster size are removed, and the filtered clustering results are used to target potential variation areas. ;

基于Kmer序列的局部比对模块，用于根据结构变异特征序列的CIGAR字符，比对结构变异特征序列中特有不匹配片段，并记录Kmer比对结果。The local alignment module based on the Kmer sequence is used to compare the unique mismatched fragments in the structural variation characteristic sequence according to the CIGAR characters of the structural variation characteristic sequence, and record the Kmer alignment results.

结构变异特征序列编码模块包括：Structural variation characteristic sequence coding modules include:

RGB三通道序列相似性编码模块，将结构变异特征序列与参考基因组序列编码到序列匹配通道(255，0，0)、序列重复通道(0，0，255)以及序列反转通道(0，255，0)中，每个通道都可以通过设置最小片段阀值编码不同序列，通过以上三通道编码方式，该模块会输出参考基因组序列与样本序列的相似性RGB图像；将参考基因组序列编码到序列匹配通道(255，0，0)、序列重复通道(0，0，255)以及序列反转通道(0，255，0)中，得到参考基因组序列自身相似性图像，同时记录每个匹配片段的位置信息；编码过程主要包括通过结构变异特征序列CIGAR字符编码主要片段，其次针对比对中存在的不匹配序列，采取基于Kmer的重比对编码方式，该过程中，不同序列匹配特征将会被分别编码到匹配通道，重复通道以及反转通道中。The RGB three-channel sequence similarity coding module encodes the structural variation characteristic sequence and the reference genome sequence into the sequence matching channel (255, 0, 0), the sequence repeat channel (0, 0, 255) and the sequence inversion channel (0, 255 , 0), each channel can encode different sequences by setting the minimum fragment threshold. Through the above three-channel encoding method, the module will output an RGB image of the similarity between the reference genome sequence and the sample sequence; encode the reference genome sequence into the sequence In the matching channel (255, 0, 0), sequence repeat channel (0, 0, 255) and sequence inversion channel (0, 255, 0), the similarity image of the reference genome sequence itself is obtained, and the similarity of each matching fragment is recorded at the same time. Position information; the encoding process mainly includes encoding the main fragments through CIGAR character sequences of structural variation characteristics. Secondly, for the unmatched sequences existing in the alignment, a Kmer-based re-alignment encoding method is adopted. In this process, different sequence matching features will be Encoded into match channel, repeat channel and invert channel respectively.

去除参考基因组序列重复片段模块，根据参考基因组序列与样本序列相似性的RGB图像和参考基因组序列自身相似性图像中的匹配片段相对于参考基因组序列的坐标位置，找出每个来自参考基因组序列与样本序列的相似性RGB图像的片段和参考基因组序列自身相似性图像中相对应的片段，找到对应片段后将其从参考基因组序列与样本序列的相似性RGB图像中移除，得到结构变异特征序列相似性图像。用于解决参考基因组中重复序列导致的结构变异低检出率以及低准确率。该模块首次提出使用参考基因组与样本序列的RGB编码矩阵减去参考基因组自身序列RGB编码的方式去除重复序列对结构变异检测的影响，还原样本序列在该区域的真实信息，该方法可以有效去除重复片段对结构变异检测带来的影响，尤其是复杂结构变异，降低假阳性率，提高检测准确率。The reference genome sequence repetitive segment module is removed, and based on the RGB image of the similarity between the reference genome sequence and the sample sequence and the coordinate position of the matching segment in the reference genome sequence's own similarity image relative to the reference genome sequence, find out each reference genome sequence and The fragment of the similarity RGB image of the sample sequence and the corresponding fragment in the similarity image of the reference genome sequence. After finding the corresponding fragment, remove it from the similarity RGB image of the reference genome sequence and the sample sequence to obtain the structural variation characteristic sequence. Similarity images. Used to solve the low detection rate and low accuracy of structural variations caused by repeated sequences in the reference genome. This module proposes for the first time to use the RGB coding matrix of the reference genome and sample sequence to subtract the RGB coding of the reference genome's own sequence to remove the impact of repeated sequences on structural variation detection and restore the true information of the sample sequence in this region. This method can effectively remove repeats. The impact of fragments on structural variation detection, especially complex structural variation, reduces the false positive rate and improves detection accuracy.

片段分类模块，用于标记去除重复片段后的结构变异特征序列相似性图像中的主要片段和次要片段；主要片段来自于结构变异特征序列CIGAR字符中标记为M的片段，次要片段来自于基于Kmer的局部比对结果。The fragment classification module is used to mark the main fragments and secondary fragments in the similarity image of the structural variation characteristic sequence after removing repeated fragments; the main fragment comes from the fragment marked M in the CIGAR character of the structural variation characteristic sequence, and the secondary fragment comes from Kmer-based local alignment results.

结构变异特征序列相似性图像分割模块包括：The structural variation feature sequence similarity image segmentation module includes:

单个结构变异分割模块，用于按主要片段在参考基因组序列上的顺序，在结构变异特征序列相似性图像中组合相邻的两主要片段，得到只包含一个结构变异的子图像；用于将结构变异特征序列相似性图像中出现的多个不同结构变异分开，避免将连续出现的简单结构变异作为复杂结构变异进行预测和表征；A single structural variation segmentation module is used to combine the two adjacent main fragments in the structural variation feature sequence similarity image according to the order of the main fragments on the reference genome sequence to obtain a sub-image containing only one structural variation; used to combine the structure Separate multiple different structural variations that appear in the mutation signature sequence similarity image to avoid predicting and characterizing consecutive simple structural variations as complex structural variations;

结构变异图像分割模块，用于按照主要片段和次要片段在结构变异特征序列上的坐标进行排序，根据排序结果将相邻两个片段组合得到组合片段，过滤由两个次要片段和两个线性片段组合而成的组合片段，将由主要片段和次要片段组合而成的组合片段作为感兴趣片段。用于不同复杂程度结构变异识别，尤其是识别复杂结构变异所包含的内部结构。相较于传统计算机视觉多目标识别，本发明中所设计的RGB图像更为稀疏，并不适合使用传统的滑动窗口方式。The structural variation image segmentation module is used to sort the main fragments and the secondary fragments according to the coordinates on the structural variation feature sequence. According to the sorting results, the two adjacent fragments are combined to obtain the combined fragment. The filter consists of two secondary fragments and two A combined segment formed by combining linear segments, and a combined segment composed of a primary segment and a secondary segment is used as the segment of interest. It is used to identify structural variations of different levels of complexity, especially to identify the internal structures contained in complex structural variations. Compared with traditional computer vision multi-target recognition, the RGB images designed in the present invention are sparser and are not suitable for using the traditional sliding window method.

结构变异识别和表征模块，具体包括：Structural variation identification and characterization module, including:

构建结构变异训练数据集模块，用于训练的结构变异主要包括四种简单结构变异类型，缺失(DEL)，插入(INS)，反转(INV)和重复(DUP)。其中真实数据利用1000GenomeProject中的2500个样本的结构变异作为训练数据，但是由于真实数据中DEL和INS占绝大多数，导致了非平衡的训练数据集。于是，本发明在训练数据集中加入了利用开源软件VISOR虚拟的无噪声干扰的结构变异；Construct a structural variation training data set module. The structural variation used for training mainly includes four simple structural variation types, deletion (DEL), insertion (INS), inversion (INV) and duplication (DUP). The real data uses the structural variation of 2500 samples in the 1000GenomeProject as training data. However, since DEL and INS account for the vast majority of the real data, it results in an unbalanced training data set. Therefore, the present invention adds noise-free interference virtual structural variation using the open source software VISOR to the training data set;

卷积神经网络训练模块，用于训练AlexNet神经网络，神经网络训练参数使用迁移学习方式，利用Google ImageNet竞赛中最好成绩参数作为模型初始化参数，训练过程中通过反向传播和梯度下降算法调整参数以最小化交叉熵目标函数。同时使用交叉验证方式选取最优模型作为最终结构变异检测CNN模型；The convolutional neural network training module is used to train AlexNet neural network. The neural network training parameters use transfer learning method, using the best result parameters in the Google ImageNet competition as model initialization parameters. During the training process, the parameters are adjusted through backpropagation and gradient descent algorithms. to minimize the cross-entropy objective function. At the same time, cross-validation method is used to select the optimal model as the final structural variation detection CNN model;

结构变异图表征模块，用于通过结构变异检测CNN模型识别子图像中的感兴趣片段，得到复杂结构变异片段，基于复杂结构变异片段构建结构变异表征图，并基于结构变异表征图的拓扑结构计算不同结构变异是否属于同一类型，结构变异表征图中的每个节点为子图像中所有感兴趣片段中包含的片段，每条边连接样本序列上连续的两个片段；结构变异表征图中的节点代表结构变异特征序列相似性图像中所有SOI中包含的匹配片段，这些片段是按照出现在样本序列中的坐标由小到大排列而成。结构变异表征图中的边用来连接两个临接匹配片段；结构变异表征图根据图像片段组装格式(Graphical FragmentAssembly，以下简称GFA)保存至文件，在输出文件中，S代表从结构变异特征序列相似性图像中得到的匹配片段，L表示不同片段之间的连接关系。The structural variation graph representation module is used to identify the fragments of interest in the sub-image through the structural variation detection CNN model, obtain the complex structural variation fragments, construct the structural variation representation graph based on the complex structural variation fragments, and calculate the topology structure based on the structural variation representation graph. Whether different structural variations belong to the same type, each node in the structural variation representation graph is a fragment included in all the fragments of interest in the sub-image, and each edge connects two consecutive fragments on the sample sequence; the nodes in the structural variation representation graph Represents the matching fragments contained in all SOIs in the structural variation characteristic sequence similarity image. These fragments are arranged from small to large according to the coordinates that appear in the sample sequence. The edges in the structural variation representation graph are used to connect two adjacent matching fragments; the structural variation representation graph is saved to a file according to the Graphical Fragment Assembly Format (GFA). In the output file, S represents the structural variation feature sequence. The matching segments obtained in the similarity image, L represents the connection relationship between different segments.

仿真实例Simulation example

仿真实验从四个方面测评和对比了本发明与四种现有检测技术，纳入比较的现有检测技术包括CuteSV，PbSV，Sniffles和SVIM。本发明与现有技术均采用默认参数，其中结构变异支持度阈值为5，同时只选取比对质量大于等于20的读段。The simulation experiment evaluated and compared the present invention with four existing detection technologies from four aspects. The existing detection technologies included in the comparison include CuteSV, PbSV, Sniffles and SVIM. Both the present invention and the prior art use default parameters, in which the structural variation support threshold is 5, and only reads with alignment quality greater than or equal to 20 are selected.

实验一：由于本发明检测结构变异不依靠任何模型，而只是通过识别图像编码的序列差异来检测各种复杂程度的结构变异。实验一目的旨在检验本发明针对不同测序平台的简单结构变异检测能力。针对世面上现有的两种主流长度长测序平台PacBio和OxfordNanopore，本实验根据Genome In A Bottle(GIAB)国际基因组权威项目中的检测软件测评流程，利用样本HG002作为标准集，全方位的测评了本发明针对不同平台测序数据和不同测序数据量的检测能力。其结果如图3所示，本发明在不依靠任何模型的情况下仍可以达到与当下最先进的检测软件相同的结果。该试验结果进一步反映了本发明的普适性，同时可以应用于简单和复杂结构变异的检测。Experiment 1: Since the present invention does not rely on any model to detect structural variation, it only detects structural variation of various levels of complexity by identifying sequence differences in image encodings. The first purpose of the experiment is to test the simple structural variation detection capabilities of the present invention for different sequencing platforms. For the two existing mainstream long-length sequencing platforms in the world, PacBio and Oxford Nanopore, this experiment is based on the detection software evaluation process in the Genome In A Bottle (GIAB) international genome authoritative project, using sample HG002 as the standard set to conduct a comprehensive evaluation. The present invention is aimed at the detection capabilities of sequencing data from different platforms and different amounts of sequencing data. The results are shown in Figure 3. The present invention can still achieve the same results as the most advanced detection software without relying on any model. The test results further reflect the universality of the present invention and can be applied to the detection of simple and complex structural variations.

实验二：由于本发明涉及该领域前沿问题，并不存在业界认可的用于评估不同方法的复杂结构变异标准集。于是，根据2015年发表于《Nature》杂志上发表的通过人工筛选和检验的复杂结构变异结果，该实验虚拟了10种不同的复杂结构变异类型，每种类型包含300个复杂变异事件。利用虚拟的复杂结构变异数据集比较本发明与现有检测技术的性能。另外，不同于简单结构变异，复杂结构变异内部通常包含多个不同类型的断点，称之为子元件，在测评中采用两种指标评估不同检测方法的性能：1)区域匹配；2)完全匹配。简单来说，区域匹配要求不同检测方法能准确检测包含复杂结构变异的基因组区域，完全匹配同时要求准确检测复杂结构变异区域以及复杂结构变异的子元件。检测性能结果如图4所示，对于更容易的区域匹配，本发明的召回率为93％，而召回率第二高的检测方法CuteSV只有70％。另一方面，对于完全匹配，Sniffles是唯一一个可以检测复杂结构变异的工具，本发明的召回率是92％，是Sniffles的两倍，在高召回率的前提下，本发明依然可以保持90％的精确度，从而确保了检测出来复杂结构变异的准确性。然而其余检测方法均无法检测出复杂结构变异，只能找出复杂结构变异所在的基因组区间。Experiment 2: Since the present invention involves cutting-edge issues in this field, there is no industry-recognized set of complex structural variation standards for evaluating different methods. Therefore, based on the results of manual screening and inspection of complex structural variations published in Nature in 2015, the experiment virtualized 10 different complex structural variation types, each type containing 300 complex variation events. A virtual data set of complex structural variations was used to compare the performance of the present invention with existing detection technologies. In addition, unlike simple structural variations, complex structural variations usually contain multiple different types of breakpoints, called sub-elements. In the evaluation, two indicators are used to evaluate the performance of different detection methods: 1) regional matching; 2) complete match. Simply put, region matching requires that different detection methods can accurately detect genomic regions containing complex structural variations. Complete matching also requires accurate detection of complex structural variation regions and sub-elements of complex structural variation. The detection performance results are shown in Figure 4. For easier region matching, the recall rate of the present invention is 93%, while the detection method CuteSV with the second highest recall rate is only 70%. On the other hand, for complete matching, Sniffles is the only tool that can detect complex structural mutations. The recall rate of the present invention is 92%, which is twice that of Sniffles. Under the premise of high recall rate, the present invention can still maintain 90% The accuracy ensures the accuracy of detecting complex structural variations. However, other detection methods cannot detect complex structural variations and can only find the genomic interval where complex structural variations are located.

实验三：该试验旨在通过使用不同读长的序列比对结果检验本发明序列图像编码及识别框架的鲁棒性。本发明利用2021年在《Science》上发表的HG00733样本的结构变异作为标准集，首先选取了该样本的PacBio平台的高保真(HiFi)测序读段作为短DNA序列输入数据集。另一方面，为了得到更长的DNA序列输入数据集，该试验使用了PacBio HiFi读段组装的contig序列，这两种不同DNA序列长度的读段分别通过现有序列比对技术被比对到人类参考基因组上得到输入文件。本发明首先对比了针对短DNA序列的检测性能，其结果如图5所示，可以看出本发明检测不同长度的结构变异性能均优于基于模型的检测方法。由于其他检测方案无法针对长DNA序列数据集进行检测，本发明只对比了本发明方法分别在短DNA序列和长DNA序列作为输入时的性能，从图6结果中可以看出，本发明的检测性能与用来检测的输入DNA序列长度成正相关，这也满足了针对未来更长更准的测序数据的结构变异检测需求。Experiment 3: This experiment aims to test the robustness of the sequence image encoding and recognition framework of the present invention by using sequence comparison results of different read lengths. This invention uses the structural variation of the HG00733 sample published in "Science" in 2021 as the standard set, and first selects the high-fidelity (HiFi) sequencing reads of the PacBio platform of the sample as the short DNA sequence input data set. On the other hand, in order to obtain a longer DNA sequence input data set, the experiment used contig sequences assembled from PacBio HiFi reads. These two reads of different DNA sequence lengths were aligned using existing sequence alignment technology. The input file is obtained on the human reference genome. The present invention first compared the detection performance of short DNA sequences. The results are shown in Figure 5. It can be seen that the detection performance of the present invention in detecting structural variations of different lengths is better than the model-based detection method. Since other detection solutions cannot detect long DNA sequence data sets, the present invention only compares the performance of the method of the present invention when short DNA sequences and long DNA sequences are used as input. As can be seen from the results in Figure 6, the detection method of the present invention The performance is positively related to the length of the input DNA sequence used for detection, which also meets the need for structural variation detection for longer and more accurate sequencing data in the future.

实验四：随着近两年来基因组组装与分型技术的发展，出现了越来越多的分型后的基因组组装结果，该实验旨在比较本发明可以满足未来分型基因组的结构变异检测需求。本发明依然采用2021年在《Science》上发表的HG00733、HG00514和NA19240三个样本的分型组装结果和分型后的结构变异标准集测评不同检测技术，这三个样本分析组装结果分别通过现有比对技术比对到人类参考基因组上。由于人属于二倍体生物，标准集被分为H1和H2进行比较，其结果如图7所示，本发明在两个核型上的Fscore均高于现有技术。该实验表明本发明可以很好的满足基于分型组装后的基因组结构变异检测。Experiment 4: With the development of genome assembly and typing technology in the past two years, more and more genome assembly results after typing have appeared. This experiment aims to compare whether this invention can meet the structural variation detection needs of future typing genomes. . The present invention still uses the different detection technologies for the classification and assembly results of the three samples HG00733, HG00514 and NA19240 published in "Science" in 2021 and the evaluation of the structural variation standard set after classification. The analysis and assembly results of these three samples are respectively passed through the current Alignment technology is available to the human reference genome. Since humans are diploid organisms, the standard set is divided into H1 and H2 for comparison. The results are shown in Figure 7. The Fscore of the present invention on both karyotypes is higher than that of the prior art. This experiment shows that the present invention can well meet the requirements for genome structure variation detection after typing and assembly.

Claims

1. The genome structure variation detection method based on deep learning is characterized by comprising the following steps of:

step 1, extracting structural variation characteristic sequences: comparing the sample sequence with a reference genome sequence to obtain a global comparison result, and extracting a structural variation characteristic sequence according to the global comparison result; the matching fragments in the structural variation feature sequence are called main fragments; according to the comparison characteristics of the structural variation characteristic sequences, carrying out local Kmer weight comparison on the sequences of the unmatched fragments in the structural variation characteristic sequences and the reference genome sequences, and obtaining matched fragments called secondary fragments through the local Kmer weight comparison;

Step 2, structural variation characteristic sequence similarity image coding: adopting an RGB image three-channel coding mode, combining a main segment and a secondary segment, coding a structural variation characteristic sequence and a reference genome sequence to obtain a similarity RGB image of the reference genome sequence and a sample sequence, and simultaneously coding the reference genome sequence to obtain a similarity image of the reference genome sequence; subtracting the two images to obtain a structural variation characteristic sequence similarity image;

step 3, structural variation characteristic sequence similarity image segmentation: combining two adjacent main fragments in the structural variation characteristic sequence similarity image according to the sequence of the main fragments on the reference genome sequence to obtain a sub-image only containing single structural variation; combining adjacent primary fragments and secondary fragments in the sub-image in sequence in pairs according to the sequence of the primary fragments and the secondary fragments on the reference genome sequence to obtain a fragment of interest;

step 4, identifying structural variation characteristic sequence similarity images and carrying out structural variation characterization: identifying all interesting fragments in the sub-images containing single structural variation by using a pre-trained structural variation detection CNN model to obtain complex structural variation fragments; systematically characterizing and classifying the complex structure variant fragments by using a graph data structure;

The step 2 specifically comprises the following steps:

1) RGB three-way sequence similarity coding: coding the structural variation characteristic sequence and the reference genome sequence into a sequence matching channel (255, 0), a sequence repeating channel (0, 255) and a sequence reversing channel (0, 255, 0), and outputting a similarity RGB image of the reference genome sequence and the sample sequence; encoding the reference genome sequence into a sequence matching channel (255, 0), a sequence repeating channel (0, 255) and a sequence reversing channel (0, 255, 0) to obtain a similarity image of the reference genome sequence, and recording the position information of each matching segment;

2) Removal of reference genomic sequence repeat: and according to the RGB images of the similarity between the reference genome sequence and the sample sequence and the coordinate positions of the matching fragments in the similarity images of the reference genome sequence and the reference genome sequence relative to the reference genome sequence, fragments corresponding to the fragments in the similarity RGB images of the reference genome sequence and the sample sequence in the search of the similarity images of the reference genome sequence and the reference genome sequence, and if the corresponding fragments are found, removing the corresponding fragments from the RGB images of the similarity between the reference genome sequence and the sample sequence to obtain the structural variation characteristic sequence similarity images.

2. The deep learning-based genome structure variation detection method according to claim 1, wherein in step 1, according to the comparison feature of the structure variation feature sequence, the sequence of the unmatched fragment in the structure variation feature sequence is locally Kmer aligned with the reference genome sequence, specifically: and extracting unmatched fragments from the CIGAR characters of the structural variation characteristic sequence and the reference genome sequence, and carrying out local Kmer re-comparison on the sequences of the unmatched fragments and the reference genome sequence to obtain secondary fragments.

3. The method for detecting genomic structural variation based on deep learning according to claim 1, wherein step 3 specifically comprises:

1) Single structural variation segmentation: combining two adjacent main fragments in the structural variation characteristic sequence similarity image according to the sequence of the main fragments on the reference genome sequence to obtain a sub-image only containing single structural variation;

2) Structural variant image multi-objective segmentation: and sorting according to the coordinates of the main fragment and the secondary fragment on the structural variation characteristic sequence, and combining all fragments in the sub-images pairwise according to the sorting result, wherein the combination of the main fragment and the secondary fragment is used as the interested fragment.

4. The deep learning-based genomic structural variation detection method according to claim 1, wherein the structural variation detection CNN model training method in step 4 specifically comprises:

1) Constructing a structural variation training data set: the real data uses 2500 sample structural variation characteristic sequences in 1000Genome Project as training data, the virtual data uses VISOR virtual noise-free interference training samples as training data, and the training data set is formed by the real data and the virtual data;

2) Training data set encoding: encoding training data in the training data set by adopting the method in the step 2 to obtain an interested fragment of the training data set;

3) Model training: and inputting the training data set into a convolutional neural network, training the convolutional neural network, and obtaining a structural variation detection CNN model after training is completed.

5. The deep learning-based genome structure variation detection method according to claim 1, wherein in step 4, all interesting fragments in the sub-image containing the single structure variation are identified by using a pre-trained structure variation detection CNN model to obtain complex structure variation fragments; the complex structure variant fragments are systematically characterized and classified by using a graph data structure, and specifically:

Identifying interesting fragments in the sub-images through the structural variation detection CNN model to obtain complex structural variation fragments, constructing a structural variation characterization graph based on the complex structural variation fragments, and calculating whether different structural variations belong to the same type or not based on the topological structure of the structural variation characterization graph, wherein each node in the structural variation characterization graph is a fragment contained in all interesting fragments in the sub-images, and each side is connected with two continuous fragments on the sample sequence.

6. A deep learning-based genomic structural variation detection system comprising:

the structure variation characteristic sequence extraction module is used for comparing the sample sequence with the reference genome sequence to obtain a global comparison result, and extracting the structure variation characteristic sequence according to the global comparison result; the matching fragments in the structural variation feature sequence are called main fragments; according to the comparison characteristics of the structural variation characteristic sequences, carrying out local Kmer weight comparison on the sequences of the unmatched fragments in the structural variation characteristic sequences and the reference genome sequences, and obtaining matched fragments called secondary fragments through the local Kmer weight comparison;

the structure variation characteristic sequence coding module is used for coding the structure variation characteristic sequence and the reference genome sequence by adopting a three-channel coding mode of the RGB image to obtain a similarity RGB image of the reference genome sequence and the sample sequence; meanwhile, coding the reference genome sequence to obtain a self-similarity image of the reference genome sequence; subtracting the two images to obtain a structural variation characteristic sequence similarity image;

The structure variation characteristic sequence similarity image segmentation module is used for combining two adjacent main fragments in the structure variation characteristic sequence similarity image according to the sequence of the main fragments on the reference genome sequence to obtain a sub-image only comprising one structure variation; combining adjacent primary and secondary fragments in the sub-image in the order of the primary and secondary fragments on the reference genome sequence to obtain a fragment of interest;

the structural variation recognition and characterization module is used for recognizing all interesting fragments in the sub-images containing the single structural variation by using a pre-trained structural variation detection CNN model to obtain complex structural variation fragments; systematically characterizing and classifying the complex structure variant fragments by using a graph data structure;

the structural variation characteristic sequence coding module comprises:

the RGB three-channel sequence similarity coding module codes the structural variation characteristic sequence and the reference genome sequence into a sequence matching channel (255, 0), a sequence repeating channel (0, 255) and a sequence reversing channel (0, 255, 0) and outputs a similarity RGB image of the reference genome sequence and the sample sequence; coding the reference genome sequence into a sequence matching channel (255, 0), a sequence repeating channel (0, 255) and a sequence reversing channel (0, 255, 0) to obtain a similarity image of the reference genome sequence, and recording the position information of each matching fragment;

And removing the repeated segment module of the reference genome sequence, and removing the segment corresponding to the segment in the similarity RGB image of the reference genome sequence and the sample sequence in the search of the reference genome sequence self-similarity image according to the RGB image of the similarity of the reference genome sequence and the sample sequence and the coordinate position of the matched segment in the similarity image of the reference genome sequence self-similarity image relative to the reference genome sequence, if the corresponding segment is found, removing the segment from the RGB image of the similarity of the reference genome sequence and the sample sequence, thereby obtaining the structural variation characteristic sequence similarity image.

7. The deep learning based genomic structural variation detection system according to claim 6, wherein the structural variation feature sequence similarity image segmentation module comprises:

the single structure variation segmentation module is used for combining two adjacent main fragments in the structure variation characteristic sequence similarity image according to the sequence of the main fragments on the reference genome sequence to obtain a sub-image only comprising one structure variation;

the structure variation image segmentation module is used for sequencing according to the coordinates of the main segment and the secondary segment on the structure variation characteristic sequence, combining the two adjacent segments according to the sequencing result to obtain a combined segment, filtering the combined segment formed by combining the two secondary segments and the two linear segments, and taking the combined segment formed by combining the main segment and the secondary segment as the interested segment.

8. The deep learning based genomic structural variation detection system of claim 6, wherein the structural variation identification and characterization module specifically comprises:

the structure variation training data set module is used for forming a training data set from real data and virtual data, wherein the real data uses 2500 samples of structure variation in 1000Genome Project as training data, and the virtual data uses VISOR virtual noise-free training data;

the training data set coding module is used for coding all training data in the training data set according to the structural variation characteristic sequence coding module to obtain an interested fragment of the training data set;

the convolutional neural network training module is used for inputting the training data set into a convolutional neural network, training the convolutional neural network and obtaining a structural variation detection CNN model after training is completed;

the structure variation characterization module is used for identifying the interested fragments in the sub-images through the structure variation detection CNN model to obtain complex structure variation fragments, constructing a structure variation characterization graph based on the complex structure variation fragments, calculating whether different structure variations belong to the same type based on the topological structure of the structure variation characterization graph, wherein each node in the structure variation characterization graph is a fragment contained in all the interested fragments in the sub-images, and each side is connected with two continuous fragments on the sample sequence.