CN118862971A

CN118862971A - A method for constructing a neural network model for predicting LOH regions and its application

Info

Publication number: CN118862971A
Application number: CN202310482961.XA
Authority: CN
Inventors: 唐飞; 王中华; 孙隽
Original assignee: Tianjin Bgi Technology Co ltd; Tianjin Medical Laboratory Bgi
Current assignee: Tianjin Bgi Technology Co ltd; Tianjin Medical Laboratory Bgi
Priority date: 2023-04-28
Filing date: 2023-04-28
Publication date: 2024-10-29

Abstract

The present invention relates to the field of sequencing, and in particular to a method for constructing a neural network model for predicting LOH regions and its application. The present invention discloses a method for screening LOH regions, constructing a neural network model through clinical sample historical data and known identified LOH regions, and performing final LOH region determination through data calibration and threshold selection. The present invention constructs a neural network model for identifying LOH regions, which can quickly and accurately identify LOH regions by using sequencing data. The present invention is integrated into sequencing detection products and can be used for genetic diagnosis of imprinted gene-related diseases.

Description

A method for constructing a neural network model for predicting LOH regions and its application

技术领域Technical Field

本发明涉及测序技术领域，具体涉及一种预测LOH区域的神经网络模型的构建方法及其应用。The present invention relates to the technical field of sequencing, and in particular to a method for constructing a neural network model for predicting LOH regions and an application thereof.

背景技术Background Art

拷贝数变异、杂合性缺失和单亲二倍体是大型的基因组变异，可以导致许多常见的遗传性疾病。杂合性缺失(Loss ofheterozygosity,LOH)是指位于一对同源染色体上的相同基因座的两个等位基因中的一个(或其中部分核苷酸片段)发生缺失，而与之配对的染色体上仍然存在的情况。1929年通过研究黑腹果蝇X射线诱发突变位点，首次解释了LOH的遗传机制。LOH在癌症中普遍存在，已有研究表明LOH会导致抑制基因的失活，从而影响癌症的发生与进行。LOH的发生机制主要有三种：染色体丢失、染色体部分缺失和基因转换，其中染色体缺失是杂合性缺失的主要形成机制。染色体存在LOH时提示可能存在单亲二倍体(Uniparental Disomy,UPD)，UPD出现在特定的染色体上时，会由于遗传印记效应引起相关疾病，大量研究表明，LOH区域内发生孟德尔隐性遗传病的风险明显增加。Copy number variation, loss of heterozygosity and uniparental disomy are large genomic variations that can lead to many common genetic diseases. Loss of heterozygosity (LOH) refers to the situation where one of the two alleles (or part of the nucleotide fragments) of the same locus on a pair of homologous chromosomes is lost, while it still exists on the paired chromosome. In 1929, the genetic mechanism of LOH was first explained by studying the X-ray-induced mutation sites of Drosophila melanogaster. LOH is common in cancer. Studies have shown that LOH can lead to the inactivation of suppressor genes, thereby affecting the occurrence and progression of cancer. There are three main mechanisms for the occurrence of LOH: chromosome loss, partial chromosome loss and gene conversion, among which chromosome loss is the main mechanism for the formation of loss of heterozygosity. The presence of LOH in chromosomes indicates the possible existence of uniparental disomy (UPD). When UPD appears on a specific chromosome, it will cause related diseases due to the genetic imprinting effect. A large number of studies have shown that the risk of Mendelian recessive genetic diseases in the LOH region is significantly increased.

现有技术检测LOH区域的方法有短串联重复序列(short tandem repeats,STR)，甲基化检测，染色体微阵列分析技术(Chromosomal MicroarrayAnalysis,CMA)等，但STR检测需要根据检测目的和基因组位置来选择高度多态性STR标记，使检测方法受到一定的限制，且检测成本较高；甲基化检测检测的时间成本较高；目前最理想的检测LOH的技术为染色体微阵列(CMA)，但是CMA作为高通量高分辨率的筛查技术，保证数据准确的前提下，得到的LOH信息非常大，需要根据不同的目的设置不同的阈值来筛选，同时需要针对筛选的信息，查阅大量的文献或数据库对数据进行注释，才能最终获得合理的结果报告。Existing methods for detecting LOH regions include short tandem repeats (STR), methylation detection, chromosome microarray analysis (CMA), etc., but STR detection requires the selection of highly polymorphic STR markers based on the detection purpose and genomic location, which limits the detection method to a certain extent and has a high detection cost; methylation detection has a high time cost; the most ideal technology for detecting LOH is chromosome microarray (CMA), but CMA is a high-throughput and high-resolution screening technology. Under the premise of ensuring data accuracy, the LOH information obtained is very large, and different thresholds need to be set for screening according to different purposes. At the same time, it is necessary to consult a large amount of literature or databases to annotate the data based on the screened information in order to finally obtain a reasonable result report.

近年来随着高通量测序技术不断发展，全外显子测序技术(Whole-exome-sequencing,WES)广泛应用于疾病防治领域，如遗传病、罕见综合征和复杂疾病等。在临床检测过程中，由于大部分功能变异都集中于外显子序列中，且外显子测序更容易检测到罕见变异，因此大量的历史样品或信息不详的样品可以通过该技术获得高深度的功能突变数据。In recent years, with the continuous development of high-throughput sequencing technology, whole-exome sequencing (WES) has been widely used in the field of disease prevention and treatment, such as genetic diseases, rare syndromes and complex diseases. In the clinical testing process, since most functional variations are concentrated in exon sequences and exon sequencing is more likely to detect rare variations, a large number of historical samples or samples with unknown information can obtain high-depth functional mutation data through this technology.

现阶段获得WES数据后，多采用生物信息学软件PLINK对数据进行分析，该方法采用一个固定大小的滑窗，对每条染色体进行扫描，以寻找连续的纯合SNP。PLINK首先计算包含某个SNP的完全纯合滑窗的比例，如果该比例超过事先设定好的阈值，那么这个SNP就被认为是在一段LOH中。在每个滑窗中可以指定一定数量的缺失或是杂合的SNP，以包含基因定型错误，失败或是稀有变异等情况。最后，如果在某个片段中连续纯合SNP的数量超过一个数量或距离阈值(SNP数量或是染色体的距离)，那么就可以判定这个片段是LOH。该方法在在临床数据上精确度不够，存在漏检和假阳性的情况。其本质原因是所设定的阈值无法满足所有不同WES临床数据的分类。At present, after obtaining WES data, the bioinformatics software PLINK is often used to analyze the data. This method uses a fixed-size sliding window to scan each chromosome to find continuous homozygous SNPs. PLINK first calculates the proportion of completely homozygous sliding windows containing a certain SNP. If the proportion exceeds a pre-set threshold, then the SNP is considered to be in a segment of LOH. A certain number of missing or heterozygous SNPs can be specified in each sliding window to include genotyping errors, failures, or rare variations. Finally, if the number of continuous homozygous SNPs in a certain segment exceeds a number or distance threshold (the number of SNPs or the distance of chromosomes), then the segment can be determined to be LOH. This method is not accurate enough for clinical data, and there are cases of missed detection and false positives. The essential reason is that the set threshold cannot meet the classification of all different WES clinical data.

因此，目前亟需一种基于测序数据准确开发LOH的分析方法。本发明基于大量临床数据构建神经网络分类模型，通过对历史数据的回顾，实现LOH区域的精确识别。Therefore, there is an urgent need for an analysis method to accurately develop LOH based on sequencing data. The present invention constructs a neural network classification model based on a large amount of clinical data and realizes accurate identification of LOH regions by reviewing historical data.

发明内容Summary of the invention

本发明旨在至少在一定程度上解决相关技术中的技术问题之一。The present invention aims to solve one of the technical problems in the related art at least to a certain extent.

为此，本发明一方面提供一种预测LOH区域的神经网络模型的构建方法，所述方法包括：To this end, the present invention provides a method for constructing a neural network model for predicting LOH regions, the method comprising:

(A)获取LOH标准品的测序数据，将所述测序数据与参考基因组进行比对，获取比对后的携带变异信息的比对结果，其中，所述LOH标准品为已知LOH区域位置信息的样本；(A) obtaining sequencing data of an LOH standard, aligning the sequencing data with a reference genome, and obtaining an alignment result carrying variation information after alignment, wherein the LOH standard is a sample with known LOH region location information;

(B)在所述比对结果的每条染色体信息上设置滑动窗口，以所述滑动窗口中的滑动步长的一端为一个点，获得滑动时每一个点在所述滑动窗口中的SNP信息；(B) setting a sliding window on each chromosome information of the comparison result, taking one end of the sliding step in the sliding window as a point, and obtaining SNP information of each point in the sliding window during sliding;

(C)基于所述滑动窗口中的每一个点的SNP信息，统计所述滑动窗口中的每一个点前1.5～3MB内的点的纯合率，基于所述纯合率信息，构建二维矩阵；(C) based on the SNP information of each point in the sliding window, counting the homozygosity rate of the points within 1.5 to 3 MB before each point in the sliding window, and constructing a two-dimensional matrix based on the homozygosity rate information;

(D)根据步骤(C)中所述点是否在所述标准品的LOH区域内，构建特征矩阵；(D) constructing a feature matrix according to whether the point in step (C) is within the LOH region of the standard;

(E)根据所述特征矩阵，构建预测LOH区域的循环神经网络模型，其中，所述预测LOH区域的循环神经网络模型的输入层为所述二维矩阵，输出层为所述点在LOH区域的概率值。(E) constructing a recurrent neural network model for predicting the LOH region based on the feature matrix, wherein the input layer of the recurrent neural network model for predicting the LOH region is the two-dimensional matrix, and the output layer is the probability value of the point in the LOH region.

本发明构建了一个识别LOH区域的神经网络模型，通过使用测序数据，能快速、准确地识别出LOH区域。将本发明整合进测序检测产品及临床LOH检测的相关产品中，可用于印记基因相关疾病的遗传学诊断，快速定位LOH区域，提升解读效率，并且减少正交验证测试的周转时间和成本。The present invention constructs a neural network model for identifying LOH regions, which can quickly and accurately identify LOH regions by using sequencing data. The present invention is integrated into sequencing detection products and related products for clinical LOH detection, which can be used for genetic diagnosis of imprinted gene-related diseases, quickly locate LOH regions, improve interpretation efficiency, and reduce the turnaround time and cost of orthogonal validation tests.

根据本发明的一些实施方案，步骤(A)中所述LOH标准品中包含至少3段LOH区域。According to some embodiments of the present invention, the LOH standard in step (A) comprises at least 3 LOH regions.

根据本发明的一些实施方案，每一段LOH区域长度不小于5MB。According to some embodiments of the present invention, the length of each LOH region is not less than 5 MB.

根据本发明的一些实施方案，所述LOH标准品中包含的LOH区域总长不小于15MB。According to some embodiments of the invention, the total length of the LOH regions contained in the LOH standard is no less than 15 MB.

根据本发明的一些实施方案，步骤(A)中，所述LOH标准品的测序数据包括全外显子测序数据、全基因组测序数据或panel测序数据中的任意之一。According to some embodiments of the present invention, in step (A), the sequencing data of the LOH standard includes any one of whole exome sequencing data, whole genome sequencing data or panel sequencing data.

根据本发明的一些实施方案，所述panel测序的深度大于30X。According to some embodiments of the present invention, the depth of the panel sequencing is greater than 30X.

根据本发明的一些实施方案，所述参考基因组包括CHM13、hg19、hg38、GRCh37、GRCh38、b37、hs375d中的任意之一。According to some embodiments of the present invention, the reference genome comprises any one of CHM13, hg19, hg38, GRCh37, GRCh38, b37, and hs375d.

根据本发明的一些实施方案，步骤(A)中，所述比对结果利用变异检测工具获得。According to some embodiments of the present invention, in step (A), the comparison result is obtained using a variation detection tool.

根据本发明的一些实施方案，所述变异检测工具包括GATK、Samtools、Deepvariant中的至少之一。According to some embodiments of the present invention, the variation detection tool comprises at least one of GATK, Samtools, and Deepvariant.

根据本发明的一些实施方案，步骤(B)中，所述滑动步长为50-150kb。According to some embodiments of the present invention, in step (B), the sliding step size is 50-150 kb.

根据本发明的一些实施方案，所述滑动步长为100kb。According to some embodiments of the present invention, the sliding step size is 100 kb.

根据本发明的一些实施方案，步骤(B)中，所述滑动窗口大小为2-5MB。According to some embodiments of the present invention, in step (B), the sliding window size is 2-5 MB.

根据本发明的一些实施方案，所述滑动窗口大小为3.5MB。According to some embodiments of the present invention, the sliding window size is 3.5 MB.

根据本发明的一些实施方案，步骤(B)中，所述SNP信息包括SNP个数、纯合SNP个数和SNP密度。According to some embodiments of the present invention, in step (B), the SNP information includes the number of SNPs, the number of homozygous SNPs and the SNP density.

根据本发明的一些实施方案，步骤(C)中，统计所述滑动窗口中的每一个点前2MB内的点的纯合率。According to some embodiments of the present invention, in step (C), the homozygosity rate of points within 2MB before each point in the sliding window is counted.

根据本发明的一些实施方案，所述神经网络模型包括循环神经网络模型、卷积神经网络模型、径向基神经网络模型中的任意一种。According to some embodiments of the present invention, the neural network model includes any one of a recurrent neural network model, a convolutional neural network model, and a radial basis neural network model.

根据本发明的一些实施方案，所述神经网络模型为循环神经网络模型，所述循环神经网络模型包括长短期记忆模型、双向长短期记忆模型、Gated Recurrent Unit模型中的任意一种。According to some embodiments of the present invention, the neural network model is a recurrent neural network model, and the recurrent neural network model includes any one of a long short-term memory model, a bidirectional long short-term memory model, and a gated recurrent unit model.

根据本发明的一些实施方案，所述神经网络模型不小于3个隐藏层。According to some embodiments of the present invention, the neural network model has no less than 3 hidden layers.

根据本发明的一些实施方案，所述神经网络模型激活函数包括tanh函数、Sigmoid函数、ReLU函数中的任意一种。According to some embodiments of the present invention, the neural network model activation function includes any one of a tanh function, a Sigmoid function, and a ReLU function.

本发明另一方面提供一种预测待测样品中LOH区域的方法，所述方法包括：Another aspect of the present invention provides a method for predicting the LOH region in a sample to be tested, the method comprising:

(1)获取待测样品的测序数据，将所述测序数据与参考基因组进行比对，获取比对后的携带变异信息的比对结果；(1) obtaining sequencing data of a sample to be tested, comparing the sequencing data with a reference genome, and obtaining a comparison result carrying variation information after comparison;

(2)在所述比对结果的每条染色体信息上设置滑动窗口，以所述滑动窗口中的滑动步长的一端为一个点，获得滑动时每一个点在所述滑动窗口中的SNP信息；(2) setting a sliding window on each chromosome information of the comparison result, taking one end of the sliding step in the sliding window as a point, and obtaining SNP information of each point in the sliding window during sliding;

(3)基于所述滑动窗口中的每一个点的SNP信息，统计所述滑动窗口中的每一个点前1.5～3MB内的点的纯合率，基于所述纯合率信息，构建二维矩阵；(3) based on the SNP information of each point in the sliding window, the homozygosity rate of the points within 1.5 to 3 MB before each point in the sliding window is counted, and a two-dimensional matrix is constructed based on the homozygosity rate information;

(4)以步骤(3)所述二维矩阵为输入层，输入到通过上述的构建方法构建获得的预测LOH区域的循环神经网络模型中，获得输出结果，所述输出结果为所述点在LOH区域的概率值，其中，每一个概率值都对应着步骤(3)中所述滑动窗口中的每一个点；(4) using the two-dimensional matrix in step (3) as an input layer, inputting it into the recurrent neural network model for predicting the LOH region constructed by the above construction method, and obtaining an output result, wherein the output result is a probability value of the point in the LOH region, wherein each probability value corresponds to each point in the sliding window in step (3);

(5)统计滑动窗口中至少2MB内的连续的点对应的输出结果，其中，连续的点对应的每一个输出概率值均大于0.6时，则所述连续的点所组成的区域判定为所述待测样品的LOH区域。(5) Counting the output results corresponding to the continuous points within at least 2MB in the sliding window, wherein when each output probability value corresponding to the continuous points is greater than 0.6, the area formed by the continuous points is determined to be the LOH area of the sample to be tested.

根据本发明的一些实施方案，步骤(1)中，所述测序数据包括全外显子测序数据、全基因组测序数据、panel测序数据中的任意一种。According to some embodiments of the present invention, in step (1), the sequencing data includes any one of whole exome sequencing data, whole genome sequencing data, and panel sequencing data.

根据本发明的一些实施方案，所述比对结果利用变异检测工具获得。According to some embodiments of the present invention, the comparison result is obtained using a variation detection tool.

根据本发明的一些实施方案，步骤(2)中，所述滑动步长为50-150kb。According to some embodiments of the present invention, in step (2), the sliding step size is 50-150 kb.

根据本发明的一些实施方案，步骤(2)中，所述滑动窗口大小为2-5MB。According to some embodiments of the present invention, in step (2), the sliding window size is 2-5 MB.

根据本发明的一些实施方案，步骤(2)中所述SNP信息包括SNP个数、纯合SNP个数和SNP密度。According to some embodiments of the present invention, the SNP information in step (2) includes the number of SNPs, the number of homozygous SNPs and the SNP density.

根据本发明的一些实施方案，步骤(3)中，统计所述滑动窗口中的每一个点前2MB内的点的纯合率。According to some embodiments of the present invention, in step (3), the homozygous rate of points within 2MB before each point in the sliding window is counted.

根据本发明的一些实施方案，所述方法进一步包括：建立校准模型，基于所述校准模型获得Z-score的阈值，基于所述Z-score的阈值及SNP的密度阈值对步骤(4)输出的所述待测样品的LOH区域进行筛选，以确定最终的待测样品的LOH区域。According to some embodiments of the present invention, the method further comprises: establishing a calibration model, obtaining a Z-score threshold based on the calibration model, and screening the LOH region of the sample to be tested outputted in step (4) based on the Z-score threshold and the SNP density threshold to determine the final LOH region of the sample to be tested.

根据本发明的一些实施方案，所述Z-score的阈值通过以下方法获得：According to some embodiments of the present invention, the threshold of the Z-score is obtained by the following method:

1)将步骤(1)中所述待测样品的测序数据替换为临床样本的历史数据，经过步骤(1)到步骤(4)，获得所述临床样本的SNP信息及LOH区域，其中，所述历史数据包括临床样本的全外显子测序数据；1) replacing the sequencing data of the sample to be tested in step (1) with historical data of clinical samples, and obtaining SNP information and LOH regions of the clinical samples through steps (1) to (4), wherein the historical data includes whole exome sequencing data of clinical samples;

2)推算所述临床样本在所述LOH区域的纯合率的均值μ和标准差δ，根据以下公式获得样本的纯合率Z-Score，2) Calculate the mean μ and standard deviation δ of the homozygosity rate of the clinical sample in the LOH region, and obtain the homozygosity rate Z-Score of the sample according to the following formula:

Z-Score＝(X-μ)/(δ)，Z-Score = (X-μ)/(δ),

其中，X为所述临床样本中单个样本的LOH的纯和率，μ为总体数据纯合率的均值，δ为总体数据纯合率的标准差，Wherein, X is the homozygosity rate of LOH of a single sample in the clinical sample, μ is the mean of the homozygosity rate of the overall data, δ is the standard deviation of the homozygosity rate of the overall data,

3)基于多个临床样本历史数据保留标准正态分布后0.5％的区域，对应的Z-Score的值为所述Z-score的阈值。3) Based on multiple clinical sample historical data, 0.5% of the area after the standard normal distribution is retained, and the corresponding Z-Score value is the threshold of the Z-score.

根据本发明的一些实施方案，所述SNP的密度阈值通过以下方法获得：According to some embodiments of the present invention, the density threshold of the SNP is obtained by the following method:

4)将步骤(1)中所述待测样品的测序数据替换为临床样本的历史数据，经过步骤(1)到步骤(4)，获得所述临床样本的SNP信息及LOH区域，其中，所述历史数据包括临床样本的全外显子测序数据；4) replacing the sequencing data of the sample to be tested in step (1) with the historical data of the clinical sample, and obtaining the SNP information and LOH region of the clinical sample through steps (1) to (4), wherein the historical data includes the whole exome sequencing data of the clinical sample;

5)计算所述临床样本在所述LOH区域的SNP密度的平均值μ及标准差σ，根据3σ定律，选择μ-3σ的值为SNP密度的阈值。5) Calculate the mean value μ and standard deviation σ of the SNP density of the clinical sample in the LOH region, and select the value of μ-3σ as the threshold of the SNP density according to the 3σ rule.

根据本发明的一些实施方案，所述临床样本数量不少于50例。According to some embodiments of the present invention, the number of clinical samples is no less than 50.

根据本发明的一些实施方案，所述临床样本数量不少于200例。According to some embodiments of the present invention, the number of clinical samples is no less than 200.

根据本发明的一些实施方案，确定最终的待测样品的LOH区域的筛选标准为：According to some embodiments of the present invention, the screening criteria for determining the final LOH region of the sample to be tested are:

所述Z-Score不小于2.56且所述SNP密度不小于1.25。The Z-Score is not less than 2.56 and the SNP density is not less than 1.25.

本发明又一方面提供一种预测待测样品中LOH区域的系统，所述系统包括：In another aspect, the present invention provides a system for predicting LOH regions in a sample to be tested, the system comprising:

数据比对模块，所述数据比对模块用于将待测样品测序数据与参考基因组数据比对，获得比对结果；A data comparison module, which is used to compare the sequencing data of the sample to be tested with the reference genome data to obtain a comparison result;

数据统计模块，所述数据统计模块与所述数据比对模块相连，所述数据统计模块用于在所述比对结果的每条染色体信息上设置滑动窗口，以所述滑动窗口中的滑动步长的一端为一个点，统计滑动时每一个点在所述滑动窗口中的SNP信息；A data statistics module, the data statistics module is connected to the data comparison module, and the data statistics module is used to set a sliding window on each chromosome information of the comparison result, take one end of the sliding step in the sliding window as a point, and count the SNP information of each point in the sliding window during sliding;

矩阵构建模块，所述矩阵构建模块与所述数据统计模块相连，所述矩阵构建模块用于基于所述滑动窗口中的每一个点的SNP信息，统计所述滑动窗口中的每一个点前1.5～3MB内的点的纯合率，基于所述纯合率信息，来构建二维矩阵；A matrix construction module, the matrix construction module is connected to the data statistics module, and the matrix construction module is used to count the homozygosity rate of the points within 1.5 to 3MB before each point in the sliding window based on the SNP information of each point in the sliding window, and to construct a two-dimensional matrix based on the homozygosity rate information;

数据输出模块，所述数据输出模块与所述矩阵构建模块相连，所述数据输出模块包括数据模型模块，所述数据输出模块用于以所述矩阵构建模块为输入层，输入到所述数据模型模块中，其中，所述输出模块的输出层为所述点在LOH区域的概率值，每一个概率值都对应着步骤(3)中所述滑动窗口中的每一个点，其中，所述数据模型模块中含有的预测LOH区域的神经网络模型通过上述的构建方法构建获得；A data output module, the data output module is connected to the matrix construction module, the data output module includes a data model module, the data output module is used to use the matrix construction module as an input layer and input into the data model module, wherein the output layer of the output module is a probability value of the point in the LOH region, each probability value corresponds to each point in the sliding window in step (3), wherein the neural network model for predicting the LOH region contained in the data model module is constructed by the above-mentioned construction method;

数据判定模块，所述数据判定模块与所述数据输出模块相连，所述数据判定模块用于统计滑动窗口中至少2MB内的连续的点对应的输出结果，其中，连续的点对应的每一个输出概率值均大于0.6时，则所述连续的点所组成的区域判定为所述待测样品的LOH区域。A data determination module is connected to the data output module, and is used to count the output results corresponding to continuous points within at least 2MB in the sliding window, wherein when each output probability value corresponding to the continuous points is greater than 0.6, the area composed of the continuous points is determined to be the LOH area of the sample to be tested.

根据本发明的一些实施方案，数据比对模块中，所述测序数据包括全外显子测序数据。According to some embodiments of the present invention, in the data comparison module, the sequencing data includes whole exome sequencing data.

根据本发明的一些实施方案，数据比对模块中，所述参考基因组包括CHM13、hg19、hg38、GRCh37、GRCh38、b37、hs375d中的任意之一。According to some embodiments of the present invention, in the data comparison module, the reference genome includes any one of CHM13, hg19, hg38, GRCh37, GRCh38, b37, and hs375d.

根据本发明的一些实施方案，数据比对模块中，所述比对结果利用变异检测工具获得。According to some embodiments of the present invention, in the data comparison module, the comparison result is obtained using a variation detection tool.

根据本发明的一些实施方案，数据比对模块中，所述变异检测工具包括GATK、Samtools、Deep variant中的至少之一。According to some embodiments of the present invention, in the data comparison module, the variation detection tool includes at least one of GATK, Samtools, and Deep variant.

根据本发明的一些实施方案，数据统计模块中，所述滑动步长为50-150kb。According to some embodiments of the present invention, in the data statistics module, the sliding step size is 50-150 kb.

根据本发明的一些实施方案，数据统计模块中，所述滑动步长为100kb。According to some embodiments of the present invention, in the data statistics module, the sliding step size is 100 kb.

根据本发明的一些实施方案，数据统计模块中，所述滑动窗口大小为2-5MB。According to some embodiments of the present invention, in the data statistics module, the sliding window size is 2-5 MB.

根据本发明的一些实施方案，数据统计模块中，所述滑动窗口大小为3.5MB。According to some embodiments of the present invention, in the data statistics module, the sliding window size is 3.5 MB.

根据本发明的一些实施方案，数据统计模块中，所述SNP信息包括SNP个数、纯合SNP个数和SNP密度。According to some embodiments of the present invention, in the data statistics module, the SNP information includes the number of SNPs, the number of homozygous SNPs and the SNP density.

根据本发明的一些实施方案，矩阵构建模块中，统计所述滑动窗口中的每一个点前2MB内的点的纯合率。According to some embodiments of the present invention, in the matrix construction module, the homozygosity rate of the points within 2MB before each point in the sliding window is counted.

根据本发明的一些实施方案，所述方法进一步包括：校准模型模块，所述校准模型模块与所述数据输出模块相连，所述校准模型模块基于所述校准模型获得Z-score的阈值，基于所述Z-score的阈值及SNP的密度阈值对数据输出模块输出的所述待测样品的LOH区域进行筛选，以确定最终的待测样品的LOH区域。According to some embodiments of the present invention, the method further includes: a calibration model module, the calibration model module is connected to the data output module, the calibration model module obtains a Z-score threshold based on the calibration model, and screens the LOH region of the sample to be tested output by the data output module based on the Z-score threshold and the SNP density threshold to determine the final LOH region of the sample to be tested.

1)将数据比对模块中所述待测样品的测序数据替换为临床样本的历史数据，经过数据比对模块到数据输出模块，获得所述临床样本的SNP信息及LOH区域，，其中，所述历史数据包括临床样本的全外显子测序数据；1) replacing the sequencing data of the sample to be tested in the data comparison module with the historical data of the clinical sample, and obtaining the SNP information and LOH region of the clinical sample through the data comparison module to the data output module, wherein the historical data includes the whole exome sequencing data of the clinical sample;

Z-Score＝(X-μ)/(δ)，Z-Score = (X-μ)/(δ),

根据本发明的一些实施方案，确定最终的待测样品的LOH区域的筛选标准为：所述Z-Score不小于2.56且所述SNP密度不小于1.25。According to some embodiments of the present invention, the screening criteria for determining the LOH region of the final sample to be tested are: the Z-Score is not less than 2.56 and the SNP density is not less than 1.25.

本发明另一方面提供一种预测样品中LOH区域的电子设备，包括存储器、处理器；Another aspect of the present invention provides an electronic device for predicting LOH regions in a sample, comprising a memory, a processor;

其中，所述处理器通过读取所述存储器中存储的可执行程序代码来运行与所述可执行程序代码对应的程序，以用于实现上述的预测样品中LOH区域的方法。The processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the above-mentioned method for predicting the LOH region in the sample.

本发明又另一方面提供一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述程序被处理器执行时实现上述的预测样品中LOH区域的方法。In yet another aspect, the present invention provides a computer-readable storage medium storing a computer program, wherein the program, when executed by a processor, implements the above-mentioned method for predicting LOH regions in a sample.

本发明的附加方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the present invention will be given in part in the following description and in part will be obvious from the following description, or will be learned through practice of the present invention.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

本发明的上述和/或附加的方面和优点从结合下面附图对实施方案的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present invention will become apparent and easily understood from the description of the embodiments in conjunction with the following drawings, in which:

图1显示了本发明一个实施例中预测LOH区域的神经网络模型的构建方法流程示意图；FIG1 shows a schematic flow chart of a method for constructing a neural network model for predicting LOH regions in one embodiment of the present invention;

图2显示了本发明一个实施例中基于神经网络模型整体特征提取与建模。FIG. 2 shows overall feature extraction and modeling based on a neural network model in one embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面详细描述本发明的实施方案。下面描述的实施方案是示例性的，仅用于解释本发明，而不能理解为对本发明的限制。The embodiments of the present invention are described in detail below. The embodiments described below are exemplary and are only used to explain the present invention, and should not be construed as limiting the present invention.

需要说明的是，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。进一步地，在本发明的描述中，除非另有说明，“多个”的含义是两个或两个以上。It should be noted that the terms "first" and "second" are used for descriptive purposes only and should not be understood as indicating or implying relative importance or implicitly indicating the number of the indicated technical features. Therefore, the features defined as "first" and "second" may explicitly or implicitly include one or more of the features. Further, in the description of the present invention, unless otherwise specified, the meaning of "plurality" is two or more.

在本发明中所披露的范围的端点和任何值都不限于该精确的范围或值，这些范围或值应当理解为包含接近这些范围或值的值。对于数值范围来说，各个范围的端点值之间、各个范围的端点值和单独的点值之间，以及单独的点值之间可以彼此组合而得到一个或多个新的数值范围，这些数值范围应被视为在本文中具体公开。The endpoints and any values of the ranges disclosed in the present invention are not limited to the precise ranges or values, and these ranges or values should be understood to include values close to these ranges or values. For numerical ranges, the endpoint values of each range, the endpoint values of each range and the individual point values, and the individual point values can be combined with each other to obtain one or more new numerical ranges, and these numerical ranges should be considered as specifically disclosed in this article.

为了更容易理解本发明，以下具体定义了某些技术和科学术语。除显而易见在本发明中的它处另有明确定义，否则本发明中使用的所有其它技术和科学术语都具有本发明所属领域的一般技术人员通常理解的含义。In order to make the present invention more easily understood, certain technical and scientific terms are specifically defined below. Unless otherwise clearly defined elsewhere in the present invention, all other technical and scientific terms used in the present invention have the meanings commonly understood by those skilled in the art to which the present invention belongs.

在本发明中，术语“包含”或“包括”为开放式表达，即包括本发明所指明的内容，但并不排除其他方面的内容。In the present invention, the terms "comprise" or "include" are open expressions, that is, including the contents specified in the present invention, but not excluding other contents.

在本发明中，术语“任选地”、“任选的”或“任选”通常是指随后所述的事件或状况可以但未必发生，并且该描述包括其中发生该事件或状况的情况，以及其中未发生该事件或状况的情况。In the present invention, the terms "optionally", "optional" or "optionally" generally mean that the subsequently described event or situation can but does not necessarily occur, and the description includes cases where the event or situation occurs and cases where it does not occur.

在本发明中，术语“全外显子测序”是指利用序列捕获技术将全基因组外显子区域DNA捕捉并富集后进行高通量测序的基因组分析方法。In the present invention, the term "whole exome sequencing" refers to a genome analysis method that uses sequence capture technology to capture and enrich DNA in the exon region of the whole genome and then performs high-throughput sequencing.

在本发明中，术语“单亲源二体(Uniparental Disomy，UPD)”指来自父母一方的染色体片段被另一方的同源部分取代，或一个个体的两条同源染色体都来自同一亲体，前者称为节段性单亲源二体。In the present invention, the term "Uniparental Disomy (UPD)" refers to a situation where a chromosome segment from one parent is replaced by a homologous segment from the other parent, or two homologous chromosomes of an individual are from the same parent. The former is called segmental uniparental disomy.

在本发明中，术语“短串联重复序列(short tandem repeats，STR)”，也称微卫星DNA(microsatellite DNA),通常是基因组中由1～6个碱基单元组成的一段DNA重复序列，由于核心单位重复数目在个体间呈高度变异性并且数量丰富，构成了STR基因座的遗传多态性。一般认为人类基因组平均每15kb就存在一个STR基因座。In the present invention, the term "short tandem repeats (STR)", also known as microsatellite DNA, is usually a DNA repeat sequence consisting of 1 to 6 base units in the genome. Since the number of core unit repeats is highly variable between individuals and the number is abundant, it constitutes the genetic polymorphism of the STR locus. It is generally believed that there is an STR locus every 15kb in the human genome on average.

在本发明中，术语“染色体微阵列分析”，指检测人类全基因组DNA重复和缺失的一种方法。它是一个高分辨率的全基因组筛选，可以识别主要的染色体非整倍性以及通过常规核型分析无法检测到的特定基因变化的位置和类型。In the present invention, the term "chromosomal microarray analysis" refers to a method for detecting DNA duplications and deletions in the human genome. It is a high-resolution whole-genome screening that can identify major chromosomal aneuploidies and the location and type of specific genetic changes that cannot be detected by conventional karyotype analysis.

在本发明中，术语“VCF文件”用于描述SNP(单个碱基上的变异)，INDEL(插入缺失标记)和SV(结构变异位点)结果的文本文件。In the present invention, the term "VCF file" is used to describe text files of SNP (single base variation), INDEL (insertion and deletion markers) and SV (structural variation site) results.

在本发明中，术语“CNV”指异常的DNA拷贝数变化，是许多人类疾病(如癌症、遗传性疾病、心血管疾病)的一种重要分子机制。作为疾病的一项生物标志，染色体水平的缺失、扩增等变化已成为许多疾病研究的热点，主要操作方法是，通过在一张芯片上用标记不同荧光素的样品(病例样品和对照样品)进行共杂交可检测样本基因组相对于对照基因组的DNA拷贝数变化(CNV)，常用于肿瘤或遗传性疾病全基因组CNV检测，直观地表现出肿瘤及遗传性疾病基因组DNA在整个染色体组的缺失或扩增。对肿瘤而言缺失片段可能包含抑癌基因，而扩增片段则可能存在致癌基因。In the present invention, the term "CNV" refers to abnormal DNA copy number changes, which is an important molecular mechanism for many human diseases (such as cancer, hereditary diseases, and cardiovascular diseases). As a biomarker of disease, changes such as deletion and amplification at the chromosome level have become a hot topic in the study of many diseases. The main operation method is to detect the DNA copy number changes (CNV) of the sample genome relative to the control genome by co-hybridizing samples (case samples and control samples) labeled with different fluorescent substances on a chip. It is often used for whole genome CNV detection of tumors or hereditary diseases, and intuitively shows the deletion or amplification of tumor and hereditary disease genomic DNA in the entire chromosome group. For tumors, the deleted fragments may contain tumor suppressor genes, while the amplified fragments may contain oncogenes.

在本发明中，术语“循环神经网络模型”(Recurrent Neural Network,RNN)是一类以序列(sequence)数据为输入，在序列的演进方向进行递归(recursion)且所有节点(循环单元)按链式连接的递归神经网络(recursive neural network)，对循环神经网络的研究始于二十世纪80-90年代，并在二十一世纪初发展为深度学习(deep learning)算法之一，其中双向循环神经网络(Bidirectional RNN,Bi-RNN)和长短期记忆网络(Long Short-Term Memory networks，LSTM)是常见的循环神经网络。In the present invention, the term "Recurrent Neural Network model" (RNN) refers to a type of recursive neural network that takes sequence data as input, performs recursion in the direction of sequence evolution, and all nodes (recurrent units) are connected in a chain. The study of recurrent neural networks began in the 1980s and 1990s, and developed into one of the deep learning algorithms in the early 21st century. Among them, bidirectional recurrent neural networks (Bi-RNN) and long short-term memory networks (LSTM) are common recurrent neural networks.

在本发明中，术语“Z-Score”，Z分数，也叫标准分数(standard score)是一个数与平均数的差再除以标准差的过程。在统计学中，标准分数是一个观测或数据点的值高于被观测值或测量值的平均值的标准偏差的符号数In the present invention, the term "Z-Score", Z score, also called standard score, is the process of dividing the difference between a number and the mean by the standard deviation. In statistics, the standard score is the number of standard deviations above the mean of the observed or measured values.

在本发明中，术语“3σ定律”又称3σ原则(three-sigma rule)，即数值分布在(μ-3σ,μ+3σ)中的样本认为是大部分样本，在三倍标准差之外的样本认定为少数样本。μ表示所有LOH区间的SNP密度的平均值，σ表示所有LOH区间的SNP密度的标准差。In the present invention, the term "3σ rule" is also called the three-sigma rule, that is, samples with values distributed in (μ-3σ, μ+3σ) are considered to be the majority of samples, and samples outside three times the standard deviation are considered to be minority samples. μ represents the average value of the SNP density of all LOH intervals, and σ represents the standard deviation of the SNP density of all LOH intervals.

一种预测LOH区域的神经网络模型的构建方法A method for constructing a neural network model for predicting LOH regions

根据本发明的一些实施方案，本发明提出一种预测LOH区域的神经网络模型的构建方法，如图1所述，包括：According to some embodiments of the present invention, the present invention proposes a method for constructing a neural network model for predicting LOH regions, as shown in FIG1 , comprising:

S110、获取LOH标准品的测序数据，将所述测序数据与参考基因组进行比对，获取比对后的携带变异信息的比对结果，其中，所述LOH标准品为已知LOH区域位置信息的样本；S110, obtaining sequencing data of an LOH standard, aligning the sequencing data with a reference genome, and obtaining an alignment result carrying variation information after alignment, wherein the LOH standard is a sample with known LOH region location information;

S120、在所述比对结果的每条染色体信息上设置滑动窗口，以所述滑动窗口中的滑动步长的一端为一个点，获得滑动时每一个点在所述滑动窗口中的SNP信息；S120, setting a sliding window on each chromosome information of the comparison result, taking one end of the sliding step in the sliding window as a point, and obtaining SNP information of each point in the sliding window during sliding;

S130、基于所述滑动窗口中的每一个点的SNP信息，统计所述滑动窗口中的每一个点前1.5～3MB内的点的纯合率，基于所述纯合率信息，构建二维矩阵；S130, based on the SNP information of each point in the sliding window, counting the homozygosity rate of the points within 1.5 to 3 MB before each point in the sliding window, and constructing a two-dimensional matrix based on the homozygosity rate information;

S140、根据步骤S130中所述点是否在所述标准品的LOH区域内，构建特征矩阵；S140, constructing a characteristic matrix according to whether the point in step S130 is within the LOH region of the standard;

S150、根据所述特征矩阵，构建预测LOH区域的循环神经网络模型，其中，所述预测LOH区域的循环神经网络模型的输入层为所述二维矩阵，输出层为所述点在LOH区域的概率值。S150. Constructing a recurrent neural network model for predicting the LOH region according to the feature matrix, wherein the input layer of the recurrent neural network model for predicting the LOH region is the two-dimensional matrix, and the output layer is the probability value of the point in the LOH region.

根据本发明的实施具体实施方案，输出层为[0,1]中间的任意数值，其中，每一个数值都对应着步骤S140中所述点，所述数值大小越接近于1，判定为所述点存在于LOH区域的可能性越大。According to a specific implementation scheme of the present invention, the output layer is any value between [0, 1], wherein each value corresponds to the point in step S140, and the closer the value is to 1, the greater the possibility that the point is judged to exist in the LOH region.

根据本发明的一些具体实施方案，所述神经网络模型可以是本领域已知的包括循环神经网络模型、卷积神经网络模型、径向基神经网络模型中任意一种，优选为循环神经网络模型。According to some specific embodiments of the present invention, the neural network model can be any one of the recurrent neural network model, convolutional neural network model, and radial basis neural network model known in the art, preferably a recurrent neural network model.

根据本发明的一些实施方案，所述循环神经网络模型包括长短期记忆模型、双向长短期记忆模型、Gated Recurrent Unit模型中的任意一种。According to some embodiments of the present invention, the recurrent neural network model includes any one of a long short-term memory model, a bidirectional long short-term memory model, and a gated recurrent unit model.

根据本发明的一个更具体的实施方案，本发明提出一种预测LOH区域的循环神经网络模型(RecurrentNeural Network,RNN)的构建方法，包括：According to a more specific embodiment of the present invention, the present invention proposes a method for constructing a recurrent neural network (RNN) model for predicting LOH regions, comprising:

1)VCF文件获取1) Get VCF file

本发明首先获取5个已知的临床样本的WES基因组数据，通过与人类参考基因组比对，识别变异并输出VCF文件；The present invention first obtains the WES genome data of 5 known clinical samples, compares them with the human reference genome, identifies the variants and outputs the VCF file;

2)已知的临床样本的LOH区域坐标的获取2) Obtaining the LOH region coordinates of known clinical samples

对步骤1)所述已知的临床样本进行染色体微阵列(CMA)技术分析，所述分析具体为，采用美国Affymetrix公司生产的CytoScan750K[55万个CNV标记和20万个单核苷酸多态性(SNP)标记]SNP微阵列芯片(SNP750K)对样本进行检测，严格按照Affymetrix公司提供的标准实验操作流程进行基因组DNA消化、扩增、纯化、片段化、标记，芯片的杂交、洗涤和扫描以及数据分析获取所述已知的临床样本的LOH区段坐标；The known clinical samples in step 1) are subjected to chromosome microarray (CMA) technical analysis. Specifically, the analysis is performed using the CytoScan750K [550,000 CNV markers and 200,000 single nucleotide polymorphism (SNP) markers] SNP microarray chip (SNP750K) produced by Affymetrix, USA to detect the samples, and the genomic DNA digestion, amplification, purification, fragmentation, labeling, chip hybridization, washing and scanning, and data analysis are performed strictly in accordance with the standard experimental operation procedures provided by Affymetrix to obtain the LOH segment coordinates of the known clinical samples;

3)模型特征获取3) Model feature acquisition

如图2所示，首先从VCF文件中，通过以大小为3.5MB、步长为100KB的滑动窗口获取每个点在窗口中的SNP个数，纯合SNP个数与SNP密度。然后从每条染色体2MB位置开始获取该点前2MB内每个点在滑动窗口内的纯合率(纯合SNP个数/SNP总数)，构建一个1*20的二维矩阵。接着根据所述点是否包含于已知识别的LOH区域，将染色体上的所有点分为两类(是否存在与LOH区域)，构建N*20的矩阵(N为存在/不存在于LOH区域的点的数量)作为特征矩阵，建模过程中对位点是否在LOH区域做0、1编码，存在于LOH区域编码为1，反之为0；As shown in Figure 2, first, from the VCF file, the number of SNPs, the number of homozygous SNPs and the SNP density of each point in the window are obtained through a sliding window with a size of 3.5MB and a step size of 100KB. Then, starting from the 2MB position of each chromosome, the homozygous rate (number of homozygous SNPs/total number of SNPs) of each point in the sliding window within the first 2MB of the point is obtained to construct a 1*20 two-dimensional matrix. Then, according to whether the point is contained in the known LOH region, all the points on the chromosome are divided into two categories (whether there is and the LOH region), and an N*20 matrix (N is the number of points that exist/do not exist in the LOH region) is constructed as a feature matrix. During the modeling process, whether the site is in the LOH region is coded as 0 or 1, and the presence in the LOH region is coded as 1, otherwise it is 0;

4)根据特征矩阵，构建循环神经网络(RNN)模型4) Construct a recurrent neural network (RNN) model based on the feature matrix

所述循环神经网络(RNN)模型的输入层为1*20的二维矩阵，经过存在有5个神经元的隐藏层到一个神经元的输出层，所述输出层为[0,1]中的任意数值，其中，每一个数值都对应中每个点对于对应输出步骤(3)中所述点，所述数值大小越接近于1，判定为所述点存在于LOH区域的可能性越大。The input layer of the recurrent neural network (RNN) model is a 1*20 two-dimensional matrix, which passes through a hidden layer with 5 neurons to an output layer of neurons. The output layer is an arbitrary value in [0,1], wherein each value corresponds to each point in the corresponding output step (3). The closer the value is to 1, the greater the possibility that the point exists in the LOH area.

所述循环神经网络(RNN)模型的激活函数均选择tanh函数(公式1)，RNN模型选择经典的长短期记忆模型(LSTM)。整体特征提取与RNN建模如图2所示；The activation function of the recurrent neural network (RNN) model is the tanh function (Formula 1), and the RNN model selects the classic long short-term memory model (LSTM). The overall feature extraction and RNN modeling are shown in Figure 2;

公式1：Formula 1:

5)疑似LOH区域判定5) Determination of suspected LOH area

筛选长度大于2MB，输出值均大于0.6的连续位点所组成的区域为疑似LOH区域；The region consisting of continuous sites with a screening length greater than 2MB and output values greater than 0.6 was considered a suspected LOH region;

6)校准模型的建立6) Establishment of calibration model

使用临床样本历史数据作为参考数据集推算总体样本在疑似LOH区域的纯合率的均值μ和标准差δ，根据公式2求得样本的纯合率Z值，The clinical sample historical data was used as a reference data set to estimate the mean μ and standard deviation δ of the homozygosity rate of the overall sample in the suspected LOH region, and the homozygosity rate Z value of the sample was obtained according to Formula 2:

公式2：Formula 2:

其中x为个体的观测值，μ为总体数据纯合率的均值，δ为总体数据纯合率的标准差；Where x is the observed value of the individual, μ is the mean of the homozygosity rate of the overall data, and δ is the standard deviation of the homozygosity rate of the overall data;

7)阈值筛选与LOH区域判定7) Threshold screening and LOH region determination

经过Z-Score，保留在标准正态分布前0.5％的区域，当所述临床样本历史数据为300例正常的临床WES样本时，所述标准正态分布前0.5％的区域，为Z-Score>2.56的区域；After Z-Score, the area in the first 0.5% of the standard normal distribution is retained. When the clinical sample historical data is 300 normal clinical WES samples, the area in the first 0.5% of the standard normal distribution is the area with Z-Score>2.56;

统计位于LOH区域的样本的SNP密度信息，根据3σ定律，选择μ-3σ的值为SNP密度的阈值，其中，μ表示SNP密度的平均值，σ表示SNP密度的标准差，选择密度大于1.25的区域为最终判定的LOH区域。The SNP density information of samples located in the LOH region was statistically analyzed. According to the 3σ law, the value of μ-3σ was selected as the threshold of SNP density, where μ represents the mean value of SNP density and σ represents the standard deviation of SNP density. The region with density greater than 1.25 was selected as the final LOH region.

根据本发明的具体实施方案，所述激活函数用来将输出转换为0到1之间的概率值。这些概率值可以被解释为给定输入样本属于某个类别的置信度或概率。如果接近于1则更可能判断为LOH区域，反之亦然。According to a specific embodiment of the present invention, the activation function is used to convert the output into a probability value between 0 and 1. These probability values can be interpreted as the confidence or probability that a given input sample belongs to a certain category. If it is close to 1, it is more likely to be judged as a LOH region, and vice versa.

根据本发明的一些实施方案，步骤1)中，所述已知的临床样本，其LOH是已知的，且已知的临床样本LOH区域各不相同。According to some embodiments of the present invention, in step 1), the LOH of the known clinical samples is known, and the LOH regions of the known clinical samples are different.

根据本发明的一些实施方案，步骤1)中，所述已知的临床样本的数量不少于3个，优选3-10个，进一步优选5个，当所述知的临床样本的数量在3-7个时，LOH区域的预测较为准确。According to some embodiments of the present invention, in step 1), the number of the known clinical samples is no less than 3, preferably 3-10, and more preferably 5. When the number of the known clinical samples is 3-7, the prediction of the LOH region is more accurate.

根据本发明的一些实施方案，步骤2)中，所述已知的临床样本的LOH区域坐标可以通过本领域已知包括染色体微阵列(CMA)技术分析法、Mutagenically separated PCR、全基因组SNP微阵列芯片(SNP array)在内的任一技术获得。According to some embodiments of the present invention, in step 2), the LOH region coordinates of the known clinical sample can be obtained by any technique known in the art including chromosome microarray (CMA) technical analysis, Mutagenically separated PCR, and whole genome SNP microarray chip (SNP array).

根据本发明的一些实施方案，步骤1)中，所述人类参考基因组，包括CHM13、hg19、hg38、GRCh37、GRCh38、b37、hs375d中的任意之一。According to some embodiments of the present invention, in step 1), the human reference genome includes any one of CHM13, hg19, hg38, GRCh37, GRCh38, b37, and hs375d.

根据本发明的一些实施方案，步骤3)中所述滑动窗口的大小为2-5MB之间，优选3.5MB，当滑动窗口在此区间范围内时，LOH区域的预测较为精准。According to some embodiments of the present invention, the size of the sliding window in step 3) is between 2-5 MB, preferably 3.5 MB. When the sliding window is within this range, the prediction of the LOH region is more accurate.

根据本发明的一些实施方案，步骤3)中所述滑动窗口步长为50-150kb之间，优选100kb，当滑动步长此区间范围内时，LOH区域的预测较为准确。According to some embodiments of the present invention, the sliding window step size in step 3) is between 50-150 kb, preferably 100 kb. When the sliding step size is within this range, the prediction of the LOH region is more accurate.

根据本发明的一些实施方案，步骤3)中所述纯合率的统计可以从染色体1.5-3MB之间的任一位点开始，优选2MB位点，当所述位点在此区间范围内时，LOH区域的预测较为准确。According to some embodiments of the present invention, the statistics of the homozygosity rate in step 3) can start from any site between 1.5-3MB of the chromosome, preferably the 2MB site. When the site is within this interval, the prediction of the LOH region is more accurate.

根据本发明的一些实施方案，步骤4)中所述隐藏层的层数不小于3层，优选3-7层之间，进一步优选5层，当所述隐藏层层数在3-7层的区间范围内时，LOH区域的预测较为准确。According to some embodiments of the present invention, the number of hidden layers in step 4) is not less than 3, preferably between 3 and 7, and more preferably 5. When the number of hidden layers is within the range of 3 to 7, the prediction of the LOH region is more accurate.

根据本发明的一些实施方案，所述SNP密度区域的选择，是基于历史临床样本位于LOH区域的SNP密度的统计所得，当SNP密度大于1.25是，LOH区域的预测较为准确。According to some embodiments of the present invention, the selection of the SNP density region is based on the statistics of the SNP density located in the LOH region of historical clinical samples. When the SNP density is greater than 1.25, the prediction of the LOH region is more accurate.

根据本发明的一些实施方案，所述基准线可以是标准正态分布后占总体前1％以上的区域，优选为0.5％的区域，当基准线Z-Score此区间范围内时，LOH区域的预测较为准确。According to some embodiments of the present invention, the baseline may be a region that occupies more than the first 1% of the total population after standard normal distribution, preferably a region that occupies 0.5%. When the baseline Z-Score is within this interval, the prediction of the LOH region is more accurate.

根据本发明的一些实施方案，步骤4)中所述激活函数包括tanh函数、Sigmoid函数、ReLU函数在内的任意一种。According to some embodiments of the present invention, the activation function in step 4) includes any one of a tanh function, a sigmoid function, and a ReLU function.

神经网络模型在检测LOH区域产品中的应用Application of Neural Network Model in Detecting LOH Area Products

根据本发明的一些具体实施例，本发明提供一种神经网络模型在检测LOH区域产品中的应用方法，所述方法包括：According to some specific embodiments of the present invention, the present invention provides a method for applying a neural network model in detecting LOH area products, the method comprising:

1)将待检测样本的WES基因组数据，通过与人类参考基因组比对，识别变异并输出VCF文件；1) Compare the WES genome data of the sample to be tested with the human reference genome, identify the variation and output the VCF file;

2)将所述VCF文件导入所述预测LOH区域的RNN模型中，获得所述待检测样本的疑似LOH区域；2) importing the VCF file into the RNN model for predicting the LOH region to obtain the suspected LOH region of the sample to be tested;

3)将临床样本历史数据带入所述预测LOH区域的RNN模型中，获得所述临床样本历史数据在疑似LOH区域的纯合率的均值μ和标准差δ，并根据所述基于RNN模型筛选疑似LOH区域的方法，根据公式2，计算得所述临床样本历史数据纯合率Z值，保留在标准正态分布后0.5％的区域，即Z-Score>2.56的区域，3) Bringing the clinical sample historical data into the RNN model for predicting the LOH region, obtaining the mean μ and standard deviation δ of the homozygosity rate of the clinical sample historical data in the suspected LOH region, and calculating the homozygosity rate Z value of the clinical sample historical data according to the method for screening suspected LOH regions based on the RNN model according to Formula 2, and retaining the region after 0.5% of the standard normal distribution, that is, the region with Z-Score>2.56,

公式2：Formula 2:

4)最终统计位于LOH区域的样本的SNP密度信息，根据3σ定律，选择μ-3σ值为SNP密度阈值，即密度大于1.25的区域为最终判定的LOH区域，其中μ为SNP密度均值，σ为SNP密度标准差；4) Finally, the SNP density information of the samples located in the LOH region was counted. According to the 3σ law, the μ-3σ value was selected as the SNP density threshold, that is, the region with a density greater than 1.25 was the final LOH region, where μ was the mean SNP density and σ was the standard deviation of the SNP density;

5)将步骤2)所述待检测样本在疑似LOH区域代入公式2中，计算待测样本的纯合率Z值，并以Z-Score>2.56且SNP密度大于1.25的区域为基准线进行筛选，将基准线以上部分判定为最终的LOH区域。5) Substitute the suspected LOH region of the sample to be tested in step 2) into formula 2, calculate the homozygosity rate Z value of the sample to be tested, and screen the region with Z-Score>2.56 and SNP density greater than 1.25 as the baseline, and determine the part above the baseline as the final LOH region.

根据本发明的一些实施方案，所述SNP密度区域的选择，是基于历史临床样本位于LOH区域的SNP密度的统计所得，当SNP密度大于1.25时，LOH区域的预测较为准确。According to some embodiments of the present invention, the selection of the SNP density region is based on the statistics of the SNP density located in the LOH region of historical clinical samples. When the SNP density is greater than 1.25, the prediction of the LOH region is more accurate.

下面将结合实施例对本发明的方案进行解释。本领域技术人员将会理解，下面的实施例仅用于说明本发明，而不应视为限定本发明的范围。The scheme of the present invention will be explained below in conjunction with embodiments. It will be understood by those skilled in the art that the following embodiments are only used to illustrate the present invention and should not be regarded as limiting the scope of the present invention.

实施例1预测LOH区域的神经网络模型的构建Example 1 Construction of a neural network model for predicting LOH regions

(1)变异位点的获取(1) Acquisition of variant sites

将5例临床样本的全外显子组测序数据，与人类参考基因组(hg19)比对，并使用GATK识别变异输出VCF文件；The whole exome sequencing data of 5 clinical samples were aligned with the human reference genome (hg19), and GATK was used to identify variants and output VCF files;

(2)已知LOH区域的获取(2) Acquisition of known LOH regions

将步骤(1)中的5例临床样本，进行染色体微阵列(CMA)技术分析，采用美国Affymetrix公司生产的CytoScan750K[55万个CNV标记和20万个单核苷酸多态性(SNP)标记]SNP微阵列芯片(SNP750K)对样本进行检测，严格按照Affymetrix公司提供的标准实验操作流程进行基因组DNA消化、扩增、纯化、片段化、标记，芯片的杂交、洗涤和扫描以及数据分析获取5例临床样本的LOH区段坐标，如表1所示，共识别出10段LOH区域，区域长度共约265.67MB。The five clinical samples in step (1) were subjected to chromosome microarray (CMA) analysis. The samples were tested using the CytoScan750K [550,000 CNV markers and 200,000 single nucleotide polymorphism (SNP) markers] SNP microarray chip (SNP750K) produced by Affymetrix, USA. The genomic DNA digestion, amplification, purification, fragmentation, labeling, chip hybridization, washing and scanning, and data analysis were performed strictly in accordance with the standard experimental operation procedures provided by Affymetrix to obtain the LOH segment coordinates of the five clinical samples. As shown in Table 1, a total of 10 LOH regions were identified, with a total region length of approximately 265.67 MB.

表1Table 1

(3)模型特征获取(3) Model feature acquisition

首先从步骤(1)的VCF文件中，通过以大小为3.5MB、步长为100KB的滑动窗口获取每个点在窗口中的SNP个数，纯合SNP个数与SNP密度。然后从每条染色体2MB位置开始获取该点前2MB内每个点在滑动窗口内的纯合率(纯合SNP个数/SNP总数)，构建一个1*20的二维矩阵。First, from the VCF file in step (1), the number of SNPs, the number of homozygous SNPs and the SNP density of each point in the window are obtained by sliding the window with a size of 3.5MB and a step size of 100KB. Then, starting from the 2MB position of each chromosome, the homozygous rate (number of homozygous SNPs/total number of SNPs) of each point in the sliding window within the first 2MB of the point is obtained to construct a 1*20 two-dimensional matrix.

根据该点是否包含于步骤(2)中已知的LOH区域内，将染色体上的所有点分为两类(是否存在与LOH区域)，存在于LOH区域的点标记为1，即在LOH区域的点；染色体其余点则标记为0，表示不存在于LOH区域的正常位点，构建N*20的矩阵(N为存在/不存在于LOH区域的点的数量)作为特征矩阵。According to whether the point is contained in the known LOH region in step (2), all points on the chromosome are divided into two categories (whether there is and the LOH region). Points that exist in the LOH region are marked as 1, that is, points in the LOH region; the remaining points on the chromosome are marked as 0, indicating normal sites that do not exist in the LOH region. An N*20 matrix (N is the number of points that exist/do not exist in the LOH region) is constructed as the feature matrix.

(4)RNN模型建立(4) RNN model establishment

根据特征矩阵，构建循环神经网络(RNN)模型，其输入层为1*20的二维矩阵，经过存在有5个神经元的隐藏层到一个神经元的输出层，输出层为[0,1]中的任意小数。According to the feature matrix, a recurrent neural network (RNN) model is constructed, whose input layer is a 1*20 two-dimensional matrix, passing through a hidden layer with 5 neurons to an output layer of one neuron, and the output layer is any decimal in [0,1].

建模过程中对位点是否在LOH区域做0、1编码，存在于LOH区域编码为1，反之为0。因此经过RNN模型每个点对应输出[0,1]中的任意小数，其数值大小越接近于1，判定为该点存在于LOH区域的可能性越大。During the modeling process, whether the site is in the LOH region is coded as 0 or 1. If it is in the LOH region, it is coded as 1, otherwise it is 0. Therefore, after the RNN model, each point corresponds to an output of any decimal in [0,1]. The closer its value is to 1, the greater the possibility that the point is in the LOH region.

RNN模型选择经典的长短期记忆模型(LSTM)，激活函数均选择tanh函数(公式1)；The RNN model uses the classic long short-term memory model (LSTM), and the activation function uses the tanh function (Formula 1).

公式1：Formula 1:

(5)疑似LOH区域判定(5) Determination of suspected LOH areas

将连续2MB内，输出值均大于0.6的连续位点所组成的区域为疑似LOH区域。The region consisting of continuous sites with output values greater than 0.6 within 2 MB was regarded as a suspected LOH region.

实施例2预测LOH区域的神经网络模型的验证Example 2 Validation of the Neural Network Model for Predicting LOH Regions

验证用例共使用6例待测样本，待测样本共记36段已知LOH区域，已知LOH区域的获取参照实施例1中步骤(2)。The verification case uses a total of 6 test samples, and the test samples have a total of 36 known LOH regions. The known LOH regions are obtained by referring to step (2) in Example 1.

(1)6例待测样本的VCF文件获取(1) Obtaining VCF files of 6 samples to be tested

将6例待测样本进行全外显子测序，测序所得数据与人类参考基因组(hg19)比对，并使用GATK识别变异输出VCF文件；The six samples to be tested were subjected to whole exome sequencing, the sequencing data were aligned with the human reference genome (hg19), and GATK was used to identify variants and output VCF files;

(2)神经网络模型输入层的获取(2) Obtaining the input layer of the neural network model

从步骤(1)的VCF文件中，通过以大小为3.5MB、步长为100KB的滑动窗口获取每个点在窗口中的SNP个数，纯合SNP个数与SNP密度。然后从每条染色体2MB位置开始获取该点前2MB内每个点在滑动窗口内的纯合率(纯合SNP个数/SNP总数)，构建一个1*20的二维矩阵；From the VCF file in step (1), the number of SNPs, the number of homozygous SNPs and the SNP density of each point in the window are obtained by sliding the window with a size of 3.5MB and a step size of 100KB. Then, starting from the 2MB position of each chromosome, the homozygous rate (number of homozygous SNPs/total number of SNPs) of each point in the sliding window within the first 2MB of the point is obtained to construct a 1*20 two-dimensional matrix;

(3)疑似LOH区域的获取(3) Acquisition of suspected LOH regions

将步骤(2)所述的二维矩阵投入到实施例1构建好的预测LOH区域的神经网络模型中，得到6例待测样本的疑似LOH的区域；The two-dimensional matrix described in step (2) is put into the neural network model for predicting LOH regions constructed in Example 1 to obtain the suspected LOH regions of the six samples to be tested;

(4)校准模型(Z-SCORE)建立(4) Calibration model (Z-SCORE) establishment

将步骤(1)中6例待测样本替换成300例已获得WES的临床历史样本，经历步骤(1)到步骤(3)，获得临床历史样本的疑似LOH的区域。The 6 samples to be tested in step (1) were replaced with 300 clinical history samples for which WES had been obtained, and steps (1) to (3) were repeated to obtain the regions of suspected LOH in the clinical history samples.

计算300例临床历史样本在疑似LOH区域的纯合率的均值μ和标准差δ，及SNP密度的平均值μ及标准差σ；Calculate the mean μ and standard deviation δ of the homozygosity rate in the suspected LOH region of 300 clinical history samples, as well as the mean μ and standard deviation σ of the SNP density;

根据公式2求得样本的纯合率Z值，其中x为个体的观测值，μ为总体数据纯合率的均值，δ为总体数据纯合率的标准差；The homozygosity rate Z value of the sample is obtained according to formula 2, where x is the observed value of the individual, μ is the mean of the homozygosity rate of the overall data, and δ is the standard deviation of the homozygosity rate of the overall data;

公式2：Formula 2:

根据3σ定律，选择μ-3σ的值为SNP密度的阈值，其中，μ表示SNP密度的平均值，σ表示SNP密度的标准差，选择密度大于1.25的区域为最终判定的LOH区域。According to the 3σ law, the value of μ-3σ was selected as the threshold of SNP density, where μ represented the average value of SNP density and σ represented the standard deviation of SNP density. The region with density greater than 1.25 was selected as the final LOH region.

经过Z-Score，保留在标准正态分布前0.5％的区域，即Z-Score>2.56的区域且SNP密度大于1.25的区域为最终判定的LOH区域；After Z-Score, the regions in the first 0.5% of the standard normal distribution, that is, the regions with Z-Score>2.56 and SNP density greater than 1.25 were retained as the final determined LOH regions;

(5)LOH区域的筛选(5) Screening of LOH regions

将步骤(3)得到的6例待测样本的疑似LOH的区域代入步骤(4)的公式2，获得6例待测样本的Z-Score，并筛选Z-Score>2.56的区域，SNP密度大于1.25的区域未最终判定的LOH区域。Substitute the suspected LOH regions of the 6 samples obtained in step (3) into formula 2 in step (4) to obtain the Z-Score of the 6 samples, and screen the regions with Z-Score>2.56. The regions with SNP density greater than 1.25 are not finally determined as LOH regions.

结果显示，利用本发明提供的预测LOH区域的神经网络模型，6例待测样本，最终共识别出35段LOH区域，识别准确率为97％，平均LOH区域长度误差不超过1MB。如表2所述，表2仅展示5个待测样本的10段区域，包括唯一一例未识别区域。The results show that, using the neural network model for predicting LOH regions provided by the present invention, 35 LOH regions were finally identified in 6 samples to be tested, with an identification accuracy of 97% and an average LOH region length error of no more than 1MB. As shown in Table 2, Table 2 only shows 10 regions of 5 samples to be tested, including the only unidentified region.

表2Table 2

本发明构建了一个预测LOH区域的神经网络模型，通过使用测序数据，能快速、准确地识别出LOH区域。将本发明整合进测序检测产品及临床LOH检测的相关产品中，可用于印记基因相关疾病的遗传学诊断，快速定位LOH区域，提升解读效率，并且减少正交验证测试的周转时间和成本。The present invention constructs a neural network model for predicting LOH regions, which can quickly and accurately identify LOH regions by using sequencing data. The present invention is integrated into sequencing detection products and related products for clinical LOH detection, which can be used for genetic diagnosis of imprinted gene-related diseases, quickly locate LOH regions, improve interpretation efficiency, and reduce the turnaround time and cost of orthogonal validation tests.

在本说明书的描述中，参考术语“一个实施方案”、“一些实施方案”、“示例”、“具体示例”、“一些实施方案”或“一些示例”等的描述意指结合该实施方案或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施方案或示例中。在本说明书中，对上述术语的示意性表述不必须针对的是相同的实施方案或示例。而且，描述的具体特征、结构、材料或者特点可以在任一个或多个实施方案或示例中以合适的方式结合。此外，在不相互矛盾的情况下，本领域的技术人员可以将本说明书中描述的不同实施方案或示例以及不同实施方案或示例的特征进行结合和组合。In the description of this specification, the description with reference to the terms "one embodiment", "some embodiments", "example", "specific example", "some embodiments" or "some examples" etc. means that the specific features, structures, materials or characteristics described in conjunction with the embodiment or example are included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the above terms do not necessarily refer to the same embodiment or example. Moreover, the specific features, structures, materials or characteristics described may be combined in any one or more embodiments or examples in a suitable manner. In addition, those skilled in the art may combine and combine the different embodiments or examples described in this specification and the features of the different embodiments or examples, without contradiction.

尽管上面已经示出和描述了本发明的实施方案，可以理解的是，上述实施方案是示例性的，不能理解为对本发明的限制，本领域的普通技术人员在本发明的范围内可以对上述实施方案进行变化、修改、替换和变型。Although the embodiments of the present invention have been shown and described above, it is to be understood that the above embodiments are exemplary and are not to be construed as limitations of the present invention. A person skilled in the art may change, modify, substitute and vary the above embodiments within the scope of the present invention.

Claims

1. A method for constructing a neural network model for predicting LOH regions, characterized in that the method comprises:

(A) obtaining sequencing data of an LOH standard, aligning the sequencing data with a reference genome, and obtaining an alignment result carrying variation information after alignment, wherein the LOH standard is a sample with known LOH region location information;

(B) setting a sliding window on each chromosome information of the comparison result, taking one end of the sliding step in the sliding window as a point, and obtaining SNP information of each point in the sliding window during sliding;

(C) based on the SNP information of each point in the sliding window, counting the homozygosity rate of the points within 1.5 to 3 MB before each point in the sliding window, and constructing a two-dimensional matrix based on the homozygosity rate information;

(D) constructing a feature matrix according to whether the point in step (C) is within the LOH region of the standard;

(E) constructing a neural network model for predicting the LOH region based on the feature matrix, wherein the input layer of the neural network model for predicting the LOH region is the two-dimensional matrix, and the output layer is the probability value of the point in the LOH region.

2. The method according to claim 1, characterized in that in step (A), the LOH standard contains at least 3 LOH regions;

Optionally, each LOH region is no less than 5 MB in length;

Optionally, the total length of the LOH region contained in the LOH standard is not less than 15MB;

Optionally, the sequencing data of the LOH standard includes any one of whole exome sequencing data, panel sequencing data, and whole genome sequencing data;

Optionally, the reference genome includes any one of CHM13, hg19, hg38, GRCh37, GRCh38, b37, hs375d;

Optionally, the alignment result is obtained using a variation detection tool;

Optionally, the variation detection tool comprises at least one of GATK, Samtools, and Deep variant.

3. The method according to claim 1, characterized in that in step (B), the sliding step size is 50-150 kb;

Optionally, the sliding step size is 100 kb;

Optionally, the sliding window size is 2-5MB;

Optionally, the sliding window size is 3.5MB;

Optionally, the SNP information includes the number of SNPs, the number of homozygous SNPs and the SNP density.

4. The method according to claim 1 is characterized in that in step (C), the homozygous rate of points within 2MB before each point in the sliding window is counted.

5. The method according to claim 1, characterized in that the neural network model comprises any one of a recurrent neural network model, a convolutional neural network model, and a radial basis neural network model;

Optionally, the neural network model is a recurrent neural network model, and the recurrent neural network model includes any one of a long short-term memory model, a bidirectional long short-term memory model, and a gated recurrent unit model;

Optionally, the neural network model has no less than 3 hidden layers;

Optionally, the neural network model activation function includes any one of a tanh function, a Sigmoid function, and a ReLU function.

6. A method for predicting LOH regions in a sample to be tested, characterized in that the method comprises:

(1) obtaining sequencing data of a sample to be tested, comparing the sequencing data with a reference genome, and obtaining a comparison result carrying variation information after comparison;

(2) setting a sliding window on each chromosome information of the comparison result, taking one end of the sliding step in the sliding window as a point, and obtaining SNP information of each point in the sliding window during sliding;

(3) based on the SNP information of each point in the sliding window, the homozygosity rate of the points within 1.5 to 3 MB before each point in the sliding window is counted, and a two-dimensional matrix is constructed based on the homozygosity rate information;

(4) using the two-dimensional matrix of step (3) as an input layer, inputting it into a neural network model for predicting the LOH region constructed by the method of any one of claims 1 to 5, and obtaining an output result, wherein the output result is a probability value of the point in the LOH region, wherein each probability value corresponds to each point in the sliding window of step (3);

(5) Counting the output results corresponding to the continuous points within at least 2MB in the sliding window, wherein when each output probability value corresponding to the continuous points is greater than 0.6, the area formed by the continuous points is determined to be the LOH area of the sample to be tested.

7. The method according to claim 6, characterized in that in step (1), the sequencing data comprises any one of whole exome sequencing data, panel sequencing data, and whole genome sequencing data;

Optionally, the alignment result is obtained using a variation detection tool;

8. The method according to claim 6, characterized in that in step (2), the sliding step size is 50-150 kb;

Optionally, the sliding step size is 100 kb;

Optionally, the sliding window size is 2-5MB;

Optionally, the sliding window size is 3.5MB;

9. The method according to claim 6 is characterized in that in step (3), the homozygous rate of points within 2MB before each point in the sliding window is counted.

10. The method according to claim 6 is characterized in that the method further comprises: establishing a calibration model, obtaining a Z-score threshold based on the calibration model, and screening the LOH region of the sample to be tested output in step (4) based on the Z-score threshold and the SNP density threshold to determine the final LOH region of the sample to be tested.

11. The method according to claim 10, characterized in that the threshold of the Z-score is obtained by the following method:

1) replacing the sequencing data of the sample to be tested in step (1) with historical data of clinical samples, and obtaining SNP information and LOH regions of the clinical samples through steps (1) to (4), wherein the historical data includes whole exome sequencing data of clinical samples;

2) Calculate the mean μ and standard deviation δ of the homozygosity rate of the clinical sample in the LOH region, and obtain the homozygosity rate Z-Score of the sample according to the following formula:

Z-Score = (X-μ)/(δ),

Wherein, X is the homozygosity rate of LOH of a single sample in the clinical sample, μ is the mean of the homozygosity rate of the overall data, δ is the standard deviation of the homozygosity rate of the overall data,

3) Based on multiple clinical sample historical data, 0.5% of the area after the standard normal distribution is retained, and the corresponding Z-Score value is the threshold of the Z-score.

12. The method according to claim 10, characterized in that the density threshold of the SNP is obtained by the following method:

4) replacing the sequencing data of the sample to be tested in step (1) with the historical data of the clinical sample, and obtaining the SNP information and LOH region of the clinical sample through steps (1) to (4), wherein the historical data includes the whole exome sequencing data of the clinical sample;

5) Calculate the mean value μ and standard deviation σ of the SNP density of the clinical sample in the LOH region, and select the value of μ-3σ as the threshold of the SNP density according to the 3σ rule.

13. The method according to claim 11 or 12, characterized in that the number of clinical samples is not less than 50;

Optionally, the number of clinical samples is no less than 200.

14. The method according to any one of claims 10 to 12, characterized in that the screening criteria for determining the final LOH region of the sample to be tested are:

The Z-Score is not less than 2.56 and the SNP density is not less than 1.25.