CN112735514B

CN112735514B - A training and visualization method and system for neural network extraction and regulation of DNA combination patterns

Info

Publication number: CN112735514B
Application number: CN202110063192.0A
Authority: CN
Inventors: 汪小我; 魏征
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-01-18
Filing date: 2021-01-18
Publication date: 2022-09-16
Anticipated expiration: 2041-01-18
Also published as: CN112735514A

Abstract

The invention discloses a training and visualization method and system for a neural network to extract and regulate a DNA combination pattern. The method includes: obtaining a DNA sequence with a specific function and a DNA sequence without a specific function; labeling the two DNA sequences, and Use one-hot encoding to represent DNA sequences with specific functions and DNA sequences without specific functions; build a convolutional neural network, take the one-hot encoding of the labeled DNA sequence as input, and label the corresponding DNA sequence as the output of the convolutional neural network The fitted value of the convolutional neural network is trained to make the convolutional neural network recognize DNA sequences; the trained convolutional neural network is decoupled using the NeuronMotif algorithm to obtain the gene regulatory element combination module, and the regulatory element syntax tree is used. representation and storage. This method provides a general neural network interpretation algorithm NeuronMotif that decouples convolutional neural networks to discover and visualize patterns recognized by convolutional neural networks.

Description

A training and visualization method and system for neural network extraction and regulation of DNA combination patterns

技术领域technical field

本发明涉及基因调控技术领域，特别涉及一种神经网络提取调控DNA组合模式的训练和可视化方法及系统。The invention relates to the technical field of gene regulation, in particular to a method and system for training and visualization of a neural network extracting and regulating DNA combination mode.

背景技术Background technique

基因表达与调控决定着细胞的生长和分化，控制基因的转录调控过程可在一定程度上控制基因表达的水平进而控制细胞的各种状态。在基因的转录调控过程中，基因组DNA上各种调控元件的组合排布逻辑是最关键的因素之一。在基因编辑和改造的应用中，可以针对特定基因功能的需要，根据多个调控元件的碱基偏好、距离位置、先后顺序、出现数量等逻辑进行设计和调整，以达到对基因转录水平的控制。但如此复杂的调控模块和逻辑很难用当前的浅层机器学习方法和模型进行提取和表示。深度学习模型因其复杂的表现能力和优秀的特征自动提取能力在很多基因组功能注释任务中表现卓越，但其习得的基因调控元件组合模块难以被解读和提取。Gene expression and regulation determine the growth and differentiation of cells, and the transcriptional regulation process of controlling genes can control the level of gene expression to a certain extent and then control various states of cells. In the process of gene transcription regulation, the combinational arrangement logic of various regulatory elements on genomic DNA is one of the most critical factors. In the application of gene editing and transformation, it can be designed and adjusted according to the needs of specific gene functions, according to the logic of base preference, distance position, sequence, and number of occurrences of multiple regulatory elements, so as to achieve the control of gene transcription level. . However, such complex control modules and logic are difficult to extract and represent with current shallow machine learning methods and models. Deep learning models perform well in many genome functional annotation tasks due to their complex performance capabilities and excellent automatic feature extraction capabilities, but the learned combination modules of gene regulatory elements are difficult to interpret and extract.

最近几年大量工作在研究神经网络中基因调控元件组合模块的提取方法，取得了一定的进展，但这些进展并没有使问题得到解决。目前，在DNA序列预测的问题中解释神经网络的思路基本上是一致的，都是研究神经元输入的碱基与神经元输出之间的关系，方法基本改进自计算机视觉领域，也可应用于计算机视觉或其它领域中神经网络的可视化。这些方法基本可以分为三大类：(1)改变输入查看输出值的变化；(2)反向梯度传播算法；(3)激活值最大化的序列输入分布。它们从一定水平上解释了神经网络，但都忽略了神经网络是一个混合模型，没有方法设法去打开神经网络黑盒以解决这个问题。In recent years, a lot of work has been done to study the extraction method of gene regulatory element combination modules in neural networks, and some progress has been made, but these advances have not solved the problem. At present, the idea of interpreting neural networks in the problem of DNA sequence prediction is basically the same, which is to study the relationship between the bases input by neurons and the output of neurons. The method is basically improved from the field of computer vision, and can also be applied to Visualization of neural networks in computer vision or other fields. These methods can basically be divided into three categories: (1) changing the input to see the change in the output value; (2) the back gradient propagation algorithm; (3) the sequence input distribution that maximizes the activation value. They explain the neural network to a certain level, but they all ignore that the neural network is a hybrid model, and there is no way to try to open the neural network black box to solve this problem.

改变输入查看输出值的变化这种方法的典型代表是DeepSEA。这种方法的优点是最简单和直白，便于理解。若输入的碱基发生了改变，而输出的神经元没有发生改变，则该碱基不是关键的碱基，反之则说明该碱基非常重要。这种方法的主要的缺点是计算量特别大，每个碱基位置发生改变的组合数量是随着DNA序列长度指数增长的。这种方法多适用于研究单核苷酸多态性问题，它关心的是一段序列中，少数位点的突变对功能所带来的影响，而不是研究所有碱基位置，所以能够基本满足用户需求。这种解析似乎并不能展现神经网络所学习到的知识全貌，大部分对神经网络解析的工作都没有局限在此方法上，因此它的应用不是特别广泛。The typical representative of this method of changing the input to see the change of the output value is DeepSEA. The advantage of this method is that it is the most simple and straightforward, making it easy to understand. If the input base has changed, but the output neuron has not changed, the base is not a critical base, otherwise it means that the base is very important. The main disadvantage of this method is that it is extremely computationally intensive, and the number of combinations that change at each base position grows exponentially with the length of the DNA sequence. This method is mostly suitable for the study of single nucleotide polymorphisms. It is concerned with the effect of mutations at a few sites on the function of a sequence, rather than studying all base positions, so it can basically meet the needs of users. need. This kind of analysis does not seem to show the whole picture of the knowledge learned by the neural network. Most of the work on neural network analysis is not limited to this method, so its application is not particularly extensive.

对于另外两种方法，它们都借鉴了近几年图像领域中常用的方法，可以用来解析每个样本中所有碱基的重要性。实现这两种方法利用的都是反向梯度传播算法，但具体使用方法不同。Saliency Map和DeepLIFT是基于反向梯度传播算法的解析方法中的典型代表，它们使用的是神经元输出值对输入值的偏导数或者类似变形作为输入位置的重要性评价。这种方法可以使用反向梯度传播算法来进行方便的求解，因此可以容易地应用于任何神经元，使用者只需要提供一段待研究的序列，输入神经网络，正向传播一次，再计算某种梯度反向传播一次，即可完成序列中对应位置的重要性注释。因为它的计算成本较低，所以使用起来要相对更广泛一些，但它也存在相当多的问题。其中一个问题是它不能直接计算出Motif，Motif是针对于多个序列具有的所有碱基位置的概率分布统计，而这种方法仅仅提供一个序列对应位置的重要性评价，因此不具有统计意义。为了满足这种需求，基于DeepLIFT算法的研究组又开发了TF-MoDISco，它的基本思路是将一些关心的序列中的关键子序列进行匹配对齐、切割、聚类等一系列后处理，最终将多个序列各个碱基位置的重要性评分进行合并。但存在的问题是，每条序列对应位置的重要性评分并不具有可比性，相对大小没有绝对的意义，而且计算操作过程很依赖于人工设定，结果不是特别稳定，因此计算得到或发现的所谓“Motif”也就没有得到广泛的应用。For the other two methods, they borrow from methods commonly used in the image field in recent years and can be used to resolve the importance of all bases in each sample. The two methods are implemented using the back gradient propagation algorithm, but the specific methods are different. Saliency Map and DeepLIFT are typical representatives of analytical methods based on the back gradient propagation algorithm, which use the partial derivative of the neuron output value to the input value or similar deformation as the importance evaluation of the input position. This method can be easily solved using the back gradient propagation algorithm, so it can be easily applied to any neuron. The user only needs to provide a sequence to be studied, input it into the neural network, propagate forward once, and then calculate some The gradient is back-propagated once to complete the importance annotation of the corresponding position in the sequence. Because of its lower computational cost, it is relatively more widely used, but it also has quite a few problems. One of the problems is that it cannot directly calculate Motif. Motif is the probability distribution statistics for all base positions of multiple sequences, and this method only provides an evaluation of the importance of the corresponding position of a sequence, so it has no statistical significance. In order to meet this demand, the research group based on the DeepLIFT algorithm has developed TF-MoDISco. Its basic idea is to perform a series of post-processing such as matching, aligning, cutting, and clustering on the key subsequences in some sequences of interest, and finally The importance scores for each base position of multiple sequences are combined. But the problem is that the importance scores of the corresponding positions of each sequence are not comparable, the relative size has no absolute significance, and the calculation operation process is very dependent on manual settings, and the results are not particularly stable, so the calculated or found The so-called "Motif" has not been widely used.

激活值最大化的序列输入分布主要是出于神经网络本身的特性来考虑的。任意一个神经元只有在激活状态下才能影响下一层神经元发挥自身的作用，这说明它所识别的序列就是能够使得它能够被激活的序列，因此只要收集这些序列，然后就可以根据序列集合计算出PPM(PositionProbabilityMatrix)和PWM(PositionWeightMatrix)。但其中也存在大量问题，比如这些序列的阈值应当如何选取，并没有人给出合理的解释，在解释Basset模型的实例中，作者为了解释第一层神经元所学习到的Motif，使用对应卷积核扫描所有样本，选择激活值为所得到最大值一半以上的序列作为被激活的序列集合，使用这个集合计算了PWM并绘制了Motif对应的WebLogo图，这些Motif与标准数据库中的Motif相似度令人满意，但阈值取最大值一半的原因并没有进行解释，其它工作也有类似问题。虽然Basset模型在第一层神经元解释上有了良好的结果，但到目前为止，鲜有工作使用这种方法合理地解析出第二层及以上的神经元究竟学到了什么Motif。这说明此方法在第二层或更深层可能不再直接适用。The sequence input distribution that maximizes the activation value is mainly due to the characteristics of the neural network itself. Any neuron can affect the function of the next layer of neurons only when it is activated, which means that the sequence it recognizes is the sequence that enables it to be activated. Calculate PPM (PositionProbabilityMatrix) and PWM (PositionWeightMatrix). But there are also a lot of problems, such as how the thresholds of these sequences should be selected, and no one has given a reasonable explanation. In the example of explaining the Basset model, the author uses the corresponding volume to explain the Motif learned by the neurons in the first layer. The accumulation kernel scans all samples, selects the sequence whose activation value is more than half of the obtained maximum value as the activated sequence set, uses this set to calculate the PWM and draws the WebLogo diagram corresponding to the Motif. These Motifs are similar to the Motif in the standard database. Satisfactory, but the reason why the threshold is half of the maximum value is not explained, and other works have similar problems. While the Basset model has yielded good results on the interpretation of neurons in the first layer, so far there has been little work using this method to reasonably parse what motifs are learned by neurons in the second layer and above. This suggests that this approach may no longer be directly applicable at the second or deeper level.

综合以上三个方面来看，当前神经网络中基因调控元件组合模块的提取方法已经遇到了瓶颈，需要更好的方法来提取神经网络中学习到的基因调控元件组合模块。In view of the above three aspects, the current method for extracting gene regulatory element combination modules in neural networks has encountered a bottleneck, and a better method is needed to extract the learned gene regulatory element combination modules in neural networks.

发明内容SUMMARY OF THE INVENTION

本发明旨在至少在一定程度上解决相关技术中的技术问题之一。The present invention aims to solve one of the technical problems in the related art at least to a certain extent.

为此，本发明的一个目的在于提出一种使用神经网络提取调控DNA组合模式的训练和可视化方法，该方法提供一种解耦神经网络的通用解释算法NeuronMotif，NeuronMotif可解耦用于注释DNA是否具有特定功能的卷积神经网络模型，发掘其中所识别的基因调控元件组合模块并进行可视化，该算法也可用于任意卷积神经网络在其它问题或领域应用中所识别模式的发掘和可视化。To this end, an object of the present invention is to propose a training and visualization method for extracting and regulating DNA combination patterns using a neural network, which provides a general interpretation algorithm NeuronMotif for decoupling neural networks. NeuronMotif can be decoupled to annotate whether DNA is A convolutional neural network model with specific functions to discover and visualize the combined modules of gene regulatory elements identified in it. The algorithm can also be used to discover and visualize patterns identified by any convolutional neural network in other problems or domain applications.

本发明的另一个目的在于提出一种使用神经网络提取调控DNA组合模式的训练和可视化系统。Another object of the present invention is to propose a training and visualization system for extracting and regulating DNA combination patterns using neural networks.

为达到上述目的，本发明一方面实施例提出了一种神经网络提取调控DNA组合模式的训练和可视化方法，包括：In order to achieve the above object, an embodiment of the present invention proposes a training and visualization method for extracting and regulating DNA combination patterns by a neural network, including:

S1，获取具有特定功能的DNA序列和不具有所述特定功能的DNA序列；S1, obtaining DNA sequences with specific functions and DNA sequences without the specific functions;

S2，对两种DNA序列进行标注，并将所述具有特定功能的DNA序列和所述不具有所述特定功能的DNA序列使用独热编码表示；S2, annotating the two DNA sequences, and expressing the DNA sequence with a specific function and the DNA sequence without the specific function using one-hot encoding;

S3，搭建卷积神经网络，将标注后的DNA序列的独热编码作为输入，对应DNA序列标注为卷积神经网络输出的拟合值，对卷积神经网络进行训练，以使卷积神经网络识别DNA序列；S3, build a convolutional neural network, take the one-hot encoding of the marked DNA sequence as input, and mark the corresponding DNA sequence as the fitting value of the output of the convolutional neural network, and train the convolutional neural network to make the convolutional neural network identify DNA sequences;

S4，通过NeuronMotif算法将训练后的卷积神经网络解耦，获得基因调控元件组合模块，并利用调控元件语法树进行表示和存储。S4, the trained convolutional neural network is decoupled by NeuronMotif algorithm to obtain a combination module of gene regulatory elements, which is represented and stored by the syntax tree of regulatory elements.

本发明实施例的神经网络提取调控DNA组合模式的训练和可视化方法，通过获取具有特定功能的DNA序列和不具有特定功能的DNA序列；对两种DNA序列进行标注，并将具有特定功能的DNA序列和不具有特定功能的DNA序列使用独热编码表示；搭建卷积神经网络，将标注后的DNA序列的独热编码作为输入，对应DNA序列标注为卷积神经网络输出的拟合值，对卷积神经网络进行训练，以使卷积神经网络识别DNA序列；设计和使用NeuronMotif算法将训练后的卷积神经网络解耦，从而发掘出每个神经元对应的Motif和Motif组合模块，获得基因调控元件组合模块，并使用调控元件语法树进行表示和存储，为神经网络中基因调控元件组合模块提取，提供了一套新的思路和方案。The training and visualization method of the neural network extraction and regulation DNA combination pattern according to the embodiment of the present invention, by obtaining DNA sequences with specific functions and DNA sequences without specific functions; Sequences and DNA sequences without specific functions are represented by one-hot encoding; a convolutional neural network is built, and the one-hot encoding of the labeled DNA sequence is used as input, and the corresponding DNA sequence is labeled as the fitting value of the output of the convolutional neural network. The convolutional neural network is trained to make the convolutional neural network recognize DNA sequences; the NeuronMotif algorithm is designed and used to decouple the trained convolutional neural network, so as to discover the Motif and Motif combination modules corresponding to each neuron, and obtain the gene The regulatory element combination module is represented and stored using the regulatory element syntax tree, which provides a new set of ideas and solutions for the extraction of the gene regulatory element combination module in the neural network.

另外，根据本发明上述实施例的神经网络提取调控DNA组合模式的训练和可视化方法还可以具有以下附加的技术特征：In addition, the training and visualization method for extracting and regulating DNA combination patterns by the neural network according to the above-mentioned embodiments of the present invention may also have the following additional technical features:

进一步地，在本发明的一个实施例中，S1进一步包括：Further, in an embodiment of the present invention, S1 further includes:

S11，在使用生物实验手段标注的生物基因组上截取具有所述特定功能的DNA序列片段和不具有述特定功能的DNA序列片段。S11, cutting out the DNA sequence fragments with the specific function and the DNA sequence fragments without the specific function from the biological genome marked by the biological experiment method.

S12，通过人工合成DNA序列片段分子，做任意类型的生物功能验证实验，确定其中具有所述特定功能的片段分子和不具有所述特定功能的片段分子。S12, by artificially synthesizing DNA sequence fragment molecules, any type of biological function verification experiment is performed to determine the fragment molecules that have the specific function and the fragment molecules that do not have the specific function.

进一步地，在本发明的一个实施例中，所述对DNA序列进行标注，包括：Further, in an embodiment of the present invention, the annotating the DNA sequence includes:

将所述具有特定功能的DNA序列标注为正样本，所述不具有所述特定功能的DNA序列标注为负样本。The DNA sequence with the specific function is marked as a positive sample, and the DNA sequence without the specific function is marked as a negative sample.

进一步地，在本发明的一个实施例中，S4进一步包括：Further, in an embodiment of the present invention, S4 further comprises:

S41，对于卷积神经网络中的一个神经元，采集一个DNA序列新集合，所述DNA序列新集合中的不同DNA序列具有各种大小的神经元激活值；S41, for a neuron in the convolutional neural network, collect a new set of DNA sequences, where different DNA sequences in the new set of DNA sequences have neuron activation values of various sizes;

S42，分别计算所述DNA序列新集合中的DNA序列在神经网络各层能够影响该神经元的所有神经元激活值；S42, respectively calculating all the neuron activation values that the DNA sequences in the new set of DNA sequences can affect the neuron at each layer of the neural network;

S43，对所述DNA序列新集合进行划分得到多个DNA序列子集合；S43, dividing the new set of DNA sequences to obtain a plurality of subsets of DNA sequences;

S44，计算每个DNA序列子集合对应的基因功能元件组合模块的数学表示形式，并利用调控元件语法树对基因功能元件组合模块进行表示和存储。S44, calculating the mathematical representation of the gene function element combination module corresponding to each DNA sequence subset, and using the regulatory element syntax tree to represent and store the gene function element combination module.

进一步地，在本发明的一个实施例中，S41进一步包括：Further, in an embodiment of the present invention, S41 further includes:

根据神经元接收域大小随机生成DNA序列，使用遗传算法优化所述DNA序列，优化目标为所述DNA序列的神经元激活值，遗传算法中对DNA序列的突变根据神经元激活值对DNA序列的独热编码输入的梯度大小作为概率进行抽样，除了保持DNA序列的交叉互换以外，还需要根据神经网络池化层结构进行循环位移，对遗传算法优化的中间结果DNA序列进行采样，采样的DNA序列不重复，采样的DNA序列组成各种激活的DNA序列集合。The DNA sequence is randomly generated according to the size of the neuron receptive field, and the DNA sequence is optimized by using the genetic algorithm. The optimization goal is the neuron activation value of the DNA sequence. The mutation of the DNA sequence in the genetic algorithm is based on the neuron activation value to the DNA sequence. The gradient size of the one-hot encoding input is sampled as a probability. In addition to maintaining the cross-exchange of DNA sequences, it is also necessary to perform cyclic displacement according to the neural network pooling layer structure to sample the intermediate result DNA sequence optimized by the genetic algorithm. The sampled DNA The sequences are not repeated, and the sampled DNA sequences form a collection of various activated DNA sequences.

进一步地，在本发明的一个实施例中，S43进一步包括：Further, in one embodiment of the present invention, S43 further comprises:

S431，对于所述DNA序列新集合，从所述神经元所在层开始，从深层到浅层进行检测，若遇到最大池化层，则根据池化大小K，使用Kmeans算法根据所述DNA序列新集合的序列对应的该池化层浅一层神经元激活值特征，将所述DNA序列新集合聚成K类，每一类对应被划分的DNA序列子集合；S431, for the new set of DNA sequences, start from the layer where the neuron is located, and perform detection from deep to shallow layers. If a maximum pooling layer is encountered, use the Kmeans algorithm according to the DNA sequence according to the pooling size K. The sequence of the new set corresponds to the neuron activation value feature of the shallow layer of the pooling layer, and the new set of DNA sequences is aggregated into K categories, each category corresponding to the divided DNA sequence subsets;

S432，将划分的DNA序列子集合都作为一个DNA序列新集合，从聚类发生层开始，再从深层到浅层进行检测，若遇到最大池化层，则根据池化大小K，使用Kmeans算法根据DNA序列新集合的序列对应的该池化层浅一层神经元激活值特征，将DNA序列新集合聚成K类，每一类对应被划分的DNA序列子集合；S432, the divided DNA sequence subsets are regarded as a new DNA sequence set, starting from the clustering layer, and then detecting from the deep layer to the shallow layer. If the maximum pooling layer is encountered, Kmeans is used according to the pooling size K. The algorithm aggregates the new set of DNA sequences into K categories according to the activation value characteristics of the neurons in the shallower layer of the pooling layer corresponding to the sequences of the new set of DNA sequences, each of which corresponds to the divided DNA sequence subsets;

S433，重复步骤S432直到第一层，得到所述多个DNA序列子集合。S433, repeating step S432 until the first layer, to obtain the plurality of DNA sequence subsets.

进一步地，在本发明的一个实施例中，所述基因功能元件组合模块的计算表达式为E[E(X|Y)]，其中，X为采样序列的one-hot编码对应的随机变量，Y是采样序列对应的激活值所表示的随机变量，Y与X之间的关系Y＝f(X)由对应的神经元确定，其中随机变量Y的分布需要给定，是自由变量，随机变量X依赖于随机变量Y。Further, in an embodiment of the present invention, the calculation expression of the gene function element combination module is E[E(X|Y)], wherein X is the random variable corresponding to the one-hot encoding of the sampling sequence, Y is a random variable represented by the activation value corresponding to the sampling sequence, and the relationship between Y and X is determined by the corresponding neuron Y=f(X), where the distribution of the random variable Y needs to be given, which is a free variable, a random variable X depends on random variable Y.

为达到上述目的，本发明另一方面实施例提出了一种神经网络提取调控DNA组合模式的训练和可视化系统，包括：In order to achieve the above object, another embodiment of the present invention proposes a training and visualization system for a neural network to extract and regulate DNA combination patterns, including:

获取模块，用于获取具有特定功能的DNA序列和不具有所述特定功能的DNA序列；an acquisition module for acquiring DNA sequences with specific functions and DNA sequences without the specific functions;

标注模块，用于对两种DNA序列进行标注，并将所述具有特定功能的DNA序列和所述不具有所述特定功能的DNA序列使用独热编码表示；An annotation module, configured to annotate two DNA sequences, and express the DNA sequence with a specific function and the DNA sequence without the specific function using one-hot encoding;

训练模块，用于搭建卷积神经网络，将标注后的DNA序列的独热编码作为输入，对应DNA序列标注为卷积神经网络输出的拟合值，对卷积神经网络进行训练，以使卷积神经网络识别DNA序列；The training module is used to build a convolutional neural network. The one-hot encoding of the marked DNA sequence is used as input, and the corresponding DNA sequence is marked as the fitting value of the output of the convolutional neural network. The integrated neural network recognizes DNA sequences;

解耦模块，用于通过NeuronMotif算法将训练后的卷积神经网络解耦，获得基因调控元件组合模块，并利用调控元件语法树进行表示和存储。The decoupling module is used to decouple the trained convolutional neural network through the NeuronMotif algorithm to obtain the gene regulatory element combination module, which is represented and stored by the regulatory element syntax tree.

本发明实施例的神经网络提取调控DNA组合模式的训练和可视化系统，通过获取具有特定功能的DNA序列和不具有特定功能的DNA序列；对两种DNA序列进行标注，并将具有特定功能的DNA序列和不具有特定功能的DNA序列使用独热编码表示；搭建卷积神经网络，将标注后的DNA序列的独热编码作为输入，对应DNA序列标注为卷积神经网络输出的拟合值，对卷积神经网络进行训练，以使卷积神经网络识别DNA序列；设计和使用NeuronMotif算法将训练后的卷积神经网络解耦，从而发掘出每个神经元对应的Motif和Motif组合模块，获得基因调控元件组合模块，并使用调控元件语法树进行表示和存储，为神经网络中基因调控元件组合模块提取，提供了一套新的思路和方案。The training and visualization system of the neural network extraction and regulation DNA combination pattern of the embodiment of the present invention obtains DNA sequences with specific functions and DNA sequences without specific functions; annotates the two DNA sequences, and combines the DNA sequences with specific functions Sequences and DNA sequences without specific functions are represented by one-hot encoding; a convolutional neural network is built, and the one-hot encoding of the labeled DNA sequence is used as input, and the corresponding DNA sequence is labeled as the fitting value of the output of the convolutional neural network. The convolutional neural network is trained to make the convolutional neural network recognize DNA sequences; the NeuronMotif algorithm is designed and used to decouple the trained convolutional neural network, so as to discover the Motif and Motif combination modules corresponding to each neuron, and obtain the gene The regulatory element combination module is represented and stored using the regulatory element syntax tree, which provides a new set of ideas and solutions for the extraction of the gene regulatory element combination module in the neural network.

另外，根据本发明上述实施例的神经网络提取调控DNA组合模式的训练和可视化系统还可以具有以下附加的技术特征：In addition, the training and visualization system for the neural network extraction and regulation DNA combination pattern according to the above-mentioned embodiments of the present invention may also have the following additional technical features:

进一步地，在本发明的一个实施例中，获取具有特定功能的DNA序列和不具有所述特定功能的DNA序列，包括：Further, in one embodiment of the present invention, obtaining a DNA sequence with a specific function and a DNA sequence without the specific function, including:

在使用生物实验手段标注的生物基因组上截取具有所述特定功能的DNA序列片段和不具有述特定功能的DNA序列片段；或Cut out DNA sequence fragments with the specific function and DNA sequence fragments without the specific function from the biological genome annotated by biological experimental means; or

通过人工合成DNA序列片段分子，做任意类型的生物功能验证实验，确定其中具有所述特定功能的片段分子和不具有所述特定功能的片段分子。By artificially synthesizing DNA sequence fragment molecules, any type of biological function verification experiment is performed to determine the fragment molecules that have the specific function and the fragment molecules that do not have the specific function.

本发明附加的方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the present invention will be set forth, in part, from the following description, and in part will be apparent from the following description, or may be learned by practice of the invention.

附图说明Description of drawings

本发明上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from the following description of embodiments taken in conjunction with the accompanying drawings, wherein:

图1为根据本发明一个实施例的神经网络提取调控DNA组合模式的训练和可视化方法流程图；1 is a flowchart of a training and visualization method for extracting and regulating DNA combination patterns by a neural network according to an embodiment of the present invention;

图2为根据本发明一个实施例的数学形式的PPM示意图；2 is a schematic diagram of a PPM in mathematical form according to an embodiment of the present invention;

图3为根据本发明一个实施例的转录因子匹配示意图；3 is a schematic diagram of transcription factor matching according to an embodiment of the present invention;

图4为根据本发明一个实施例的语法树示意图；4 is a schematic diagram of a syntax tree according to an embodiment of the present invention;

图5为根据本发明一个实施例的神经网络提取调控DNA组合模式的训练和可视化系统结构示意图。FIG. 5 is a schematic structural diagram of a training and visualization system for extracting and regulating DNA combination patterns by a neural network according to an embodiment of the present invention.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，旨在用于解释本发明，而不能理解为对本发明的限制。The following describes in detail the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary, and are intended to explain the present invention and should not be construed as limiting the present invention.

下面参照附图描述根据本发明实施例提出的神经网络提取调控DNA组合模式的训练和可视化方法及系统。The following describes the training and visualization method and system for extracting and regulating DNA combination patterns by neural network according to the embodiments of the present invention with reference to the accompanying drawings.

首先将参照附图描述根据本发明实施例提出的神经网络提取调控DNA组合模式的训练和可视化方法。First, a method for training and visualizing a neural network extraction and regulation DNA combination pattern proposed according to an embodiment of the present invention will be described with reference to the accompanying drawings.

图1为根据本发明一个实施例的神经网络提取调控DNA组合模式的训练和可视化方法流程图。FIG. 1 is a flowchart of a training and visualization method for extracting and regulating DNA combination patterns by a neural network according to an embodiment of the present invention.

如图1所示，该神经网络提取调控DNA组合模式的训练和可视化方法包括以下步骤：As shown in Figure 1, the training and visualization method of the neural network for extracting and regulating DNA combination patterns includes the following steps:

步骤S1，获取具有特定功能的DNA序列和不具有特定功能的DNA序列。Step S1, obtaining DNA sequences with specific functions and DNA sequences without specific functions.

进一步地，在本发明的实施例中，提供了两种收集DNA序列的方法。第一种，从使用各种生物实验手段标注的生物基因组上截取带有功能的DNA序列片段和不带功能的DNA序列片段，比如ATAC-seq标注的染色质开放区域DNA序列和染色质不开放区域的DNA序列，ChIP-seq标注的核小体修饰或转录因子绑定位点的DNA序列和没有核小体修饰或转录因子绑定位点的DNA序列。Further, in the embodiments of the present invention, two methods for collecting DNA sequences are provided. The first is to cut functional DNA sequence fragments and non-functional DNA sequence fragments from biological genomes annotated by various biological experimental methods, such as DNA sequences of chromatin open regions and chromatin non-open regions annotated by ATAC-seq DNA sequences of regions, DNA sequences of nucleosomal modifications or transcription factor binding sites annotated by ChIP-seq and DNA sequences without nucleosome modifications or transcription factor binding sites.

第二种，人工合成DNA序列片段分子，做任意类型的生物功能验证实验，确定其中带有功能的片段分子和不带功能的片段分子，比如说SELEX技术，合成设计的DNA序列，提取其中具有蛋白结合的序列和不具有蛋白结合的序列。The second is to artificially synthesize DNA sequence fragments, perform any type of biological function verification experiments, and determine the functional fragments and non-functional fragments. Sequences that bind to proteins and sequences that do not bind to proteins.

步骤S2，对两种DNA序列进行标注，并将具有特定功能的DNA序列和不具有特定功能的DNA序列使用独热编码表示。In step S2, the two DNA sequences are annotated, and the DNA sequences with specific functions and the DNA sequences without specific functions are represented by one-hot encoding.

进一步地，对DNA序列进行标注，包括：Further, annotate the DNA sequence, including:

将具有特定功能的定长DNA序列标注为正样本，在数值上记为1；不具有特定功能的DNA序列标注为负样本，在数值上记为0。对于每个DNA序列，允许有多个功能，因此可以出现多个标注，分别对应是否具有相应功能。The fixed-length DNA sequences with specific functions are marked as positive samples, and the value is marked as 1; the DNA sequences without specific functions are marked as negative samples, and the value is marked as 0. For each DNA sequence, multiple functions are allowed, so multiple annotations can appear, corresponding to whether they have the corresponding function or not.

步骤S3，搭建卷积神经网络，将标注后的DNA序列的独热编码作为输入，对应DNA序列标注为卷积神经网络输出的拟合值，对卷积神经网络进行训练，以使卷积神经网络识别DNA序列。In step S3, a convolutional neural network is built, the one-hot encoding of the marked DNA sequence is used as input, the corresponding DNA sequence is marked as the fitting value of the output of the convolutional neural network, and the convolutional neural network is trained to make the convolutional neural network. The network recognizes DNA sequences.

具体地，将带有功能的DNA序列标注为正样本，不带有功能的DNA序列标注为负样本，同时将DNA序列使用独热编码表示。搭建任意卷积神经网络，结构可以包括卷积层、池化层、全连接层等，输入维度应匹配DNA序列长度以及独热编码格式，输出维度应匹配DNA功能数量，使用DNA序列的独热编码作为输入，对应DNA序列标注为神经网络输出的拟合值，训练神经网络，使得神经网络能够尽可能准确识别DNA序列是否为正样本。Specifically, DNA sequences with functions are marked as positive samples, DNA sequences without functions are marked as negative samples, and DNA sequences are represented by one-hot encoding. Build any convolutional neural network. The structure can include convolutional layers, pooling layers, fully connected layers, etc. The input dimension should match the length of the DNA sequence and the one-hot encoding format, and the output dimension should match the number of DNA functions, using the one-hot DNA sequence. The code is used as input, and the corresponding DNA sequence is marked as the fitting value of the neural network output, and the neural network is trained so that the neural network can identify whether the DNA sequence is a positive sample as accurately as possible.

步骤S4，通过NeuronMotif算法将训练后的卷积神经网络解耦，获得基因调控元件组合模块，并利用调控元件语法树进行表示和存储。In step S4, the trained convolutional neural network is decoupled through the NeuronMotif algorithm to obtain a gene regulatory element combination module, which is represented and stored by using the regulatory element syntax tree.

NeuronMotif算法为一种用于解耦卷积神经网络的通用算法，该算法可解耦用于注释DNA是否具有特定功能的卷积神经网络模型，发掘其中所识别的基因调控元件组合模块并进行可视化，该算法也可用于卷积神经网络在其它问题应用中所识别模式的发掘和可视化。The NeuronMotif algorithm is a general algorithm for decoupling convolutional neural networks that decouple convolutional neural network models used to annotate whether DNA has a specific function, discover and visualize the combined modules of gene regulatory elements identified therein , the algorithm can also be used to discover and visualize patterns identified by convolutional neural networks in other problem applications.

进一步地，S4进一步包括：Further, S4 further includes:

可以理解的是，对于卷积神经网络中的每一个神经元，从浅层到深层都需要完成S41-S44的过程。It is understandable that for each neuron in the convolutional neural network, the process of S41-S44 needs to be completed from the shallow layer to the deep layer.

可以理解的是，若采样的DNA序列数量过大，则根据最大激活值，分为20个或更多的激活值区间，对每个区间内的DNA序列进行非重复随机选择，丢弃未被选择的DNA序列样本。It is understandable that if the number of sampled DNA sequences is too large, it will be divided into 20 or more activation value intervals according to the maximum activation value, and the DNA sequences in each interval will be selected randomly and non-repeatedly, and those that have not been selected will be discarded. DNA sequence samples.

S431，对于DNA序列新集合，从神经元所在层开始，从深层到浅层进行检测，若遇到最大池化层，则根据池化大小K，使用Kmeans算法根据DNA序列新集合的序列对应的该池化层浅层神经元激活值特征，将DNA序列新集合聚成K类，每一类对应被划分的DNA序列子集合；S431, for the new set of DNA sequences, start from the layer where the neuron is located, and perform detection from the deep layer to the shallow layer. If the maximum pooling layer is encountered, according to the pooling size K, use the Kmeans algorithm according to the sequence corresponding to the new set of DNA sequences. The feature of the activation value of the shallow neurons in the pooling layer aggregates a new set of DNA sequences into K categories, each of which corresponds to a subset of the divided DNA sequences;

S433，重复步骤S432直到第一层，得到多个DNA序列子集合。S433, repeating step S432 until the first layer, to obtain a plurality of DNA sequence subsets.

进一步地，在本发明的一个实施例中，所述基因功能元件组合模块的计算表达式为E[E(X|Y)]，其中，X为采样序列的one-hot编码对应的随机变量，Y是采样序列对应的激活值所表示的随机变量，Y与X之间的关系Y＝f(X)由对应的神经元确定，其中随机变量Y的分布需要给定，是自由变量，而随机变量X依赖于随机变量Y，在这里推荐选取Y的分布对应的概率密度函数为p(y)＝2y/(A*A),A为DNA序列集合中所有序列激活值的最大值。E[E(X|Y)]表示在给定激活值分布的条件下所采DNA序列样本的one-hot编码的期望值。对于每一个划分好的DNA序列样本子集合计算E[E(X|Y)]，即可得到该神经元所表示的基因功能元件组合模块。Further, in an embodiment of the present invention, the calculation expression of the gene function element combination module is E[E(X|Y)], wherein X is the random variable corresponding to the one-hot encoding of the sampling sequence, Y is a random variable represented by the activation value corresponding to the sampling sequence, and the relationship between Y and X, Y=f(X), is determined by the corresponding neuron, where the distribution of the random variable Y needs to be given, which is a free variable, while random The variable X depends on the random variable Y. Here, the probability density function corresponding to the distribution of Y is recommended to be p(y)=2y/(A*A), where A is the maximum activation value of all sequences in the DNA sequence set. E[E(X|Y)] represents the expected value of the one-hot encoding of the DNA sequence samples taken under the condition of a given activation value distribution. E[E(X|Y)] is calculated for each sub-set of divided DNA sequence samples, and the combination module of gene function elements represented by the neuron can be obtained.

对于所有得到的基因功能元件组合模块，找到其中相同的功能元件模式集合表示为{A,B,C…}，确定每个组合模块中功能元件的长度、排列和相对位置，一般而言，在组合模块中，它们的排列是固定的，比如都是ABCABC，相邻两个元件之间间隔的碱基数目不一致。若在所有组合模块中，相邻元件中的距离为固定碱基数，则在排列中两个相邻元件中间插入碱基数目，比如A-6N-B，6N代表6个碱基长度的间距，而对于距离不确定的，可以写明区间，用括号予以分割，比如A-[6±2N]-B。对于功能相对固定或者相关的功能子模块可以用括号予以标记，模块中可以嵌套模块，比如：[A-6N-[B-C]]-[A-6N-[B-C]]，根据括号即可表示出功能模块的语法树。For all the obtained gene functional element combination modules, find the same set of functional element patterns represented as {A,B,C…}, determine the length, arrangement and relative position of the functional elements in each combination module, in general, in In the combination module, their arrangement is fixed, for example, they are all ABCABC, and the number of bases spaced between two adjacent elements is inconsistent. If in all combination modules, the distance between adjacent elements is a fixed number of bases, insert the number of bases between two adjacent elements in the arrangement, such as A-6N-B, where 6N represents a distance of 6 bases in length , and for the uncertain distance, the interval can be written and divided by brackets, such as A-[6±2N]-B. Sub-modules with relatively fixed or related functions can be marked with brackets, and modules can be nested in modules, such as: [A-6N-[B-C]]-[A-6N-[B-C]], which can be expressed according to brackets Get the syntax tree of the function module.

具体地，描述解耦神经网络中任意一个卷积神经元的方法：Specifically, the method of decoupling any one convolutional neuron in the neural network is described:

1)对于一个神经元，采集一个DNA序列新集合，它具有足量各种激活值的DNA序列新集合。1) For a neuron, collect a new set of DNA sequences, which has a sufficient number of new sets of DNA sequences of various activation values.

根据神经元接收域大小随机生成DNA序列，使用遗传算法优化所述DNA序列，优化目标为所述DNA序列的神经元激活值，遗传算法中对DNA序列的突变根据神经元激活值对DNA序列的独热编码输入的梯度大小作为概率进行抽样，除了保持DNA序列的交叉互换以外，还需要根据神经网络池化层结构进行循环位移。对遗传算法优化的中间结果DNA序列进行采样，不允许重复的DNA序列，得到各种激活的DNA序列集合，若数量过大，则根据最大激活值，分为20个或更多的激活值区间，对每个区间内的DNA序列进行非重复随机选择，丢弃未被选择的DNA序列样本。The DNA sequence is randomly generated according to the size of the neuron receptive field, and the DNA sequence is optimized by using the genetic algorithm. The optimization goal is the neuron activation value of the DNA sequence. The mutation of the DNA sequence in the genetic algorithm is based on the neuron activation value to the DNA sequence. The gradient size of the one-hot encoding input is sampled as a probability. In addition to maintaining the cross-exchange of DNA sequences, it is also necessary to perform cyclic shifts according to the neural network pooling layer structure. Sampling the DNA sequence of the intermediate result optimized by the genetic algorithm, and does not allow repeated DNA sequences to obtain various activated DNA sequence sets. If the number is too large, it will be divided into 20 or more activation value intervals according to the maximum activation value. , perform non-repetitive random selection of DNA sequences in each interval, and discard unselected DNA sequence samples.

2)分别计算所述DNA序列新集合中的DNA序列在神经网络各层能够影响该神经元的所有神经元激活值。2) Calculate the activation values of all the neurons in each layer of the neural network that the DNA sequences in the new set of DNA sequences can affect the neuron respectively.

对于该神经元所在层及更浅的层，每层只有部分神经元的激活结果会影响到该层神经元。对每条DNA序列计算所有这些神经元的激活值。For the layer where the neuron is located and the shallower layers, only the activation results of some neurons in each layer will affect the neurons in this layer. Activation values for all these neurons were calculated for each DNA sequence.

3)对新集合进行划分。3) Divide the new set.

对于新集合DNA序列，从该神经元所在层开始，从深层到浅层进行检测，当遇到最大池化层就根据池化大小K，使用Kmeans算法根据新集合的序列对应的该池化层浅一层神经元激活值特征，将新集合DNA序列聚成K类，每一类对应被划分的DNA序列子集合。将划分的DNA序列子集合都作为一个新集合DNA序列，从聚类发生层开始，再从深层到浅层进行检测，当遇到最大池化层就根据池化大小K，使用Kmeans算法根据新集合的序列对应的该池化层浅一层神经元激活值特征，将新集合DNA序列聚成K类，每一类对应被划分的DNA序列子集合，重复这个过程直到第一层，得到大量集合。For the DNA sequence of the new set, start from the layer where the neuron is located, and detect from the deep layer to the shallow layer. When the maximum pooling layer is encountered, according to the pooling size K, the Kmeans algorithm is used according to the pooling layer corresponding to the sequence of the new set. A shallower layer of neuron activation value features aggregates the new set of DNA sequences into K classes, each class corresponding to the divided DNA sequence subsets. The divided DNA sequence subsets are regarded as a new set of DNA sequences, starting from the cluster occurrence layer, and then from the deep layer to the shallow layer. The sequence of the set corresponds to the activation value feature of the neurons in the shallow layer of the pooling layer, and the new set of DNA sequences is aggregated into K categories, each category corresponds to the sub-set of the divided DNA sequences, and this process is repeated until the first layer, and a large number of gather.

4)最后计算每个子集合对应的基因功能元件组合模块的数学表示形式。4) Finally, calculate the mathematical representation of the gene functional element combination module corresponding to each subset.

PPM的计算表达式为E[E(X|Y)]，即在激活值条件下的，采样序列的one-hot编码的期望值。X为采样序列的one-hot编码对应的随机变量，Y是采样序列对应的激活值所表示的随机变量，Y与X之间的关系Y＝f(X)由对应的神经元确定，其中随机变量Y的分布需要给定，是自由变量，而随机变量X依赖于随机变量Y，在这里推荐选取Y的分布对应的概率密度函数为p(y)＝2y/(A*A)，A为DNA序列集合中所有序列激活值的最大值。E[E(X|Y)]表示在给定激活值分布的条件下所采DNA序列样本的one-hot编码的期望值。对于每一个划分好的DNA序列样本子集合计算E[E(X|Y)]，即可得到该神经元所表示的基因功能元件组合模块。具体计算方法如下：The calculation expression of PPM is E[E(X|Y)], that is, the expected value of the one-hot encoding of the sampling sequence under the condition of activation value. X is the random variable corresponding to the one-hot encoding of the sampling sequence, Y is the random variable represented by the activation value corresponding to the sampling sequence, and the relationship between Y and X, Y=f(X), is determined by the corresponding neuron, where random The distribution of variable Y needs to be given, which is a free variable, and the random variable X depends on the random variable Y. Here, it is recommended to select the probability density function corresponding to the distribution of Y as p(y)=2y/(A*A), and A is The maximum activation value of all sequences in a collection of DNA sequences. E[E(X|Y)] represents the expected value of the one-hot encoding of the DNA sequence samples taken under the condition of a given activation value distribution. E[E(X|Y)] is calculated for each sub-set of divided DNA sequence samples, and the combination module of gene function elements represented by the neuron can be obtained. The specific calculation method is as follows:

对于任意其中一个子集的DNA序列，计算它们的该神经元的激活值，获取其中的最大激活值A，根据最大激活值大小等分成N份，在每一个区间i(i＝1,2,…,N)内[A*(i-1)/N,A*i/N]做如下操作：For any subset of DNA sequences, calculate the activation value of their neuron, obtain the maximum activation value A, and divide it into N equal parts according to the size of the maximum activation value. In each interval i (i=1, 2, ...,N) in [A*(i-1)/N,A*i/N] do the following:

找出该子集中满足激活值在区间[A*(i-1)/N,A*i/N]的序列；Find the sequence in the subset that satisfies the activation value in the interval [A*(i-1)/N, A*i/N];

计算这些序列激活值的平均值Vi；Calculate the average Vi of these sequence activation values;

计算这些序列的独热码每个位置的平均值得到平均矩阵PPMi；Calculate the average value of each position of the one-hot code of these sequences to obtain the average matrix PPMi;

完成每个区间内的计算以后，计算该子集合的基因组功能元件模块为：After completing the calculation in each interval, the genome functional element module for calculating the subset is:

PPM＝(PPM1*V1+PPM1*V2+…+PPM*VN)/(V1+V2+…+VN)；PPM即是子集合对应的基因组功能元件模块。数学形式表示的PPM可以绘制WebLogo图如图2所示。PPM=(PPM1*V1+PPM1*V2+…+PPM*VN)/(V1+V2+…+VN); PPM is the genome functional element module corresponding to the subset. Mathematically expressed PPM can draw a WebLogo diagram as shown in Figure 2.

5)对于这个神经元所有的DNA序列子集计算得到的PPM，归纳存储语法。5) For the PPM calculated for all the subsets of DNA sequences of this neuron, generalize the storage grammar.

在一个具体实例中，可以根据已知数据库中的Motif进行匹配，得到相关模式对应的基本元件，其中部分已知存在于数据库，部分未知，不存在于数据库，如图3所示，包括了CTCF，DDIT3::CEBPA，ZEB1和某未知转录因子。In a specific example, matching can be performed according to the Motif in the known database to obtain the basic elements corresponding to the relevant patterns, some of which are known to exist in the database, and some are unknown and do not exist in the database, as shown in Figure 3, including CTCF , DDIT3::CEBPA, ZEB1 and an unknown transcription factor.

根据这些基本元件以及他们在WebLogo图中的相对位置可以总结出如下关系[CTCF-[6N]-DDIT3::CEBPA]-[59±1N]-[CTCF-[6N]-DDIT3::CEBPA]根据括号关系可以生成图4所示的语法树的表示方法。According to these basic elements and their relative positions in the WebLogo diagram, the following relationship can be summarized [CTCF-[6N]-DDIT3::CEBPA]-[59±1N]-[CTCF-[6N]-DDIT3::CEBPA]According to The parenthesis relationship can generate the representation of the syntax tree shown in Figure 4.

若对结果不满意，可以在已经有的集合基础上，重复3)、4)和5)的过程，直到结果满意。对每个神经元的每个子集合都做类似操作，即可得到大量基因组功能元件模块。通过使用该方法可提取大量以PPM表示的基因组功能元件模块。If you are not satisfied with the result, you can repeat the process of 3), 4) and 5) on the basis of the existing set until the result is satisfactory. Doing similar operations for each subset of each neuron yields a large number of genomic functional element modules. By using this method, a large number of genomic functional element modules expressed in PPM can be extracted.

可以理解的是，本发明的实施例提出了一种用于解耦卷积神经网络的通用算法NeuronMotif，该算法可解耦用于注释DNA是否具有特定功能的卷积神经网络模型，发掘其中所识别的基因调控元件组合模块并进行可视化，该算法也可用于卷积神经网络在其它问题应用中所识别模式的发掘和可视化。在NeuronMotif算法中，首先定义了神经元对应的Motif的数学统计形式。随后将每个神经元看作是一个隐变量模型，分类解析其中隐变量的来源、含义。根据这些对神经网络和神经元的全新分析和理解，设计了NeuronMotif以实现神经元混合模型的解耦，从而发掘出每个神经元对应的Motif和Motif组合模块(使用PPM表示)，即基因调控元件组合模块的表示形式。为神经网络中基因调控元件组合模块提取建立了理论基础。It can be understood that the embodiments of the present invention propose a general algorithm NeuronMotif for decoupling convolutional neural networks, which can decouple the convolutional neural network model used to annotate whether DNA has a specific function, and discover the The identified gene regulatory elements are assembled into modules and visualized, and the algorithm can also be used to discover and visualize patterns identified by convolutional neural networks in other problem applications. In the NeuronMotif algorithm, the mathematical statistical form of the Motif corresponding to the neuron is first defined. Then, each neuron is regarded as a latent variable model, and the source and meaning of the latent variables are classified and analyzed. Based on these new analyses and understandings of neural networks and neurons, NeuronMotif is designed to realize the decoupling of the neuron hybrid model, so as to discover the corresponding Motif and Motif combination modules (expressed by PPM) of each neuron, that is, gene regulation Representation of the component combination module. The theoretical basis is established for the extraction of the combination module of gene regulatory elements in the neural network.

根据本发明实施例提出的神经网络提取调控DNA组合模式的训练和可视化方法，通过获取具有特定功能的DNA序列和不具有特定功能的DNA序列；对两种DNA序列进行标注，并将具有特定功能的DNA序列和不具有特定功能的DNA序列使用独热编码表示；搭建卷积神经网络，将标注后的DNA序列的独热编码作为输入，对应DNA序列标注为卷积神经网络输出的拟合值，对卷积神经网络进行训练，以使卷积神经网络识别DNA序列；设计和使用NeuronMotif算法将训练后的卷积神经网络解耦，从而发掘出每个神经元对应的Motif和Motif组合模块，获得基因调控元件组合模块，并使用调控元件语法树进行表示和存储，为神经网络中基因调控元件组合模块提取，提供了一套新的思路和方案。According to the training and visualization method of the neural network extraction and regulation DNA combination mode proposed in the embodiment of the present invention, by obtaining DNA sequences with specific functions and DNA sequences without specific functions; annotating the two DNA sequences, which will have specific functions The one-hot encoding is used to represent the DNA sequence of the DNA sequence and the DNA sequence without a specific function; a convolutional neural network is built, the one-hot encoding of the labeled DNA sequence is used as input, and the corresponding DNA sequence is labeled as the fitting value of the convolutional neural network output. , train the convolutional neural network so that the convolutional neural network can recognize DNA sequences; design and use the NeuronMotif algorithm to decouple the trained convolutional neural network, so as to discover the Motif and Motif combination modules corresponding to each neuron, The combination module of gene regulatory elements is obtained, represented and stored by the syntax tree of regulatory elements, which provides a new set of ideas and solutions for the extraction of combination modules of gene regulatory elements in neural networks.

其次参照附图描述根据本发明实施例提出的神经网络提取调控DNA组合模式的训练和可视化系统。Next, the training and visualization system for the neural network extraction and regulation DNA combination pattern proposed according to the embodiment of the present invention will be described with reference to the accompanying drawings.

图5根据本发明一个实施例的神经网络提取调控DNA组合模式的训练和可视化系统结构示意图。5 is a schematic structural diagram of a training and visualization system for extracting and regulating DNA combination patterns by a neural network according to an embodiment of the present invention.

如图5所示，该神经网络提取调控DNA组合模式的训练和可视化系统包括：获取模块201、标注模块202、训练模块203和解耦模块204。As shown in FIG. 5 , the training and visualization system for extracting and regulating DNA combination patterns by the neural network includes: an acquisition module 201 , a labeling module 202 , a training module 203 and a decoupling module 204 .

获取模块201，用于获取具有特定功能的DNA序列和不具有特定功能的DNA序列。The obtaining module 201 is used for obtaining DNA sequences with specific functions and DNA sequences without specific functions.

标注模块202，用于对两种DNA序列进行标注，并将具有特定功能的DNA序列和不具有特定功能的DNA序列使用独热编码表示。The labeling module 202 is configured to label the two DNA sequences, and use one-hot encoding to represent the DNA sequences with specific functions and the DNA sequences without specific functions.

训练模块203，用于搭建卷积神经网络，将标注后的DNA序列的独热编码作为输入，对应DNA序列标注为卷积神经网络输出的拟合值，对卷积神经网络进行训练，以使卷积神经网络识别DNA序列。The training module 203 is used to build a convolutional neural network, taking the one-hot encoding of the marked DNA sequence as an input, and marking the corresponding DNA sequence as a fitting value output by the convolutional neural network, and training the convolutional neural network so that the A convolutional neural network recognizes DNA sequences.

解耦模块204，用于通过NeuronMotif算法将训练后的卷积神经网络解耦，获得基因调控元件组合模块，并利用调控元件语法树进行表示和存储。The decoupling module 204 is used for decoupling the trained convolutional neural network through the NeuronMotif algorithm, to obtain a combination module of gene regulatory elements, and to use the regulatory element syntax tree for representation and storage.

进一步地，在本发明的一个实施例中，获取具有特定功能的DNA序列和不具有特定功能的DNA序列，包括：Further, in one embodiment of the present invention, obtaining DNA sequences with specific functions and DNA sequences without specific functions, including:

在使用生物实验手段标注的生物基因组上截取具有特定功能的DNA序列片段和不具有述特定功能的DNA序列片段；或Cut out DNA sequence fragments with specific functions and DNA sequence fragments without the specific functions from the biological genome annotated by biological experimental means; or

通过人工合成DNA序列片段分子，做任意类型的生物功能验证实验，确定其中具有特定功能的片段分子和不具有特定功能的片段分子。Through artificial synthesis of DNA sequence fragment molecules, any type of biological function verification experiment is performed to determine the fragment molecules with specific functions and the fragment molecules without specific functions.

需要说明的是，前述对方法实施例的解释说明也适用于该实施例的系统，此处不再赘述。It should be noted that, the foregoing explanations of the method embodiment are also applicable to the system of this embodiment, and details are not repeated here.

根据本发明实施例提出的神经网络提取调控DNA组合模式的训练和可视化系统，通过获取具有特定功能的DNA序列和不具有特定功能的DNA序列；对两种DNA序列进行标注，并将具有特定功能的DNA序列和不具有特定功能的DNA序列使用独热编码表示；搭建卷积神经网络，将标注后的DNA序列的独热编码作为输入，对应DNA序列标注为卷积神经网络输出的拟合值，对卷积神经网络进行训练，以使卷积神经网络识别DNA序列；设计和使用NeuronMotif算法将训练后的卷积神经网络解耦，从而发掘出每个神经元对应的Motif和Motif组合模块，获得基因调控元件组合模块，并使用调控元件语法树进行表示和存储，为神经网络中基因调控元件组合模块提取，提供了一套新的思路和方案。According to the training and visualization system of the neural network extraction and regulation DNA combination mode proposed in the embodiment of the present invention, by acquiring DNA sequences with specific functions and DNA sequences without specific functions; annotate the two DNA sequences, and will have specific functions The one-hot encoding is used to represent the DNA sequence of the DNA sequence and the DNA sequence that does not have a specific function; a convolutional neural network is built, and the one-hot encoding of the labeled DNA sequence is used as input, and the corresponding DNA sequence is labeled as the fitting value of the convolutional neural network output. , train the convolutional neural network so that the convolutional neural network can recognize DNA sequences; design and use the NeuronMotif algorithm to decouple the trained convolutional neural network, so as to discover the Motif and Motif combination modules corresponding to each neuron, The combination module of gene regulatory elements is obtained, represented and stored by the syntax tree of regulatory elements, which provides a new set of ideas and solutions for the extraction of combination modules of gene regulatory elements in neural networks.

此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本发明的描述中，“多个”的含义是至少两个，例如两个，三个等，除非另有明确具体的限定。In addition, the terms "first" and "second" are only used for descriptive purposes, and should not be construed as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature delimited with "first", "second" may expressly or implicitly include at least one of that feature. In the description of the present invention, "plurality" means at least two, such as two, three, etc., unless otherwise expressly and specifically defined.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外，在不相互矛盾的情况下，本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, description with reference to the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples", etc., mean specific features described in connection with the embodiment or example , structure, material or feature is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, those skilled in the art may combine and combine the different embodiments or examples described in this specification, as well as the features of the different embodiments or examples, without conflicting each other.

尽管上面已经示出和描述了本发明的实施例，可以理解的是，上述实施例是示例性的，不能理解为对本发明的限制，本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present invention have been shown and described above, it should be understood that the above-mentioned embodiments are exemplary and should not be construed as limiting the present invention. Embodiments are subject to variations, modifications, substitutions and variations.

Claims

1. A training and visualization method for a neural network extraction regulation and control DNA combination mode is characterized by comprising the following steps:

s1, obtaining a DNA sequence with a specific function and a DNA sequence without the specific function;

s2, labeling two DNA sequences, and representing the DNA sequence with the specific function and the DNA sequence without the specific function by using unique heat codes;

s3, building a convolutional neural network, taking the one-hot code of the labeled DNA sequence as input, labeling the corresponding DNA sequence as a fitting value output by the convolutional neural network, and training the convolutional neural network to enable the convolutional neural network to identify the DNA sequence;

s4, decoupling the trained convolutional neural network through a NeuronMotif algorithm to obtain a gene regulatory element combination module, and expressing and storing the gene regulatory element combination module by using a regulatory element syntax tree;

s41, for a neuron in the convolutional neural network, collecting a new DNA sequence set, wherein different DNA sequences in the new DNA sequence set have neuron activation values with various sizes;

s42, respectively calculating all neuron activation values of the DNA sequences in the new DNA sequence set, which can affect the neurons in each layer of the neural network;

s43, dividing the new DNA sequence set to obtain a plurality of DNA sequence subsets;

s44, calculating the mathematical expression form of the gene function component combination module corresponding to each DNA sequence subset, and expressing and storing the gene function component combination module by using a regulatory element syntax tree;

s431, for the new DNA sequence set, detecting from a deep layer to a shallow layer from the layer where the neuron is located, if a largest pooling layer is met, clustering the new DNA sequence set into K classes according to the pooling size K and the neuron activation value characteristics of the shallow layer of the pooling layer corresponding to the sequences of the new DNA sequence set by using a Kmeans algorithm, wherein each class corresponds to a divided DNA sequence subset;

s432, all the divided DNA sequence subsets are used as a DNA sequence new set, detection is carried out from a deep layer to a shallow layer from a clustering occurrence layer, if a largest pooling layer is met, the DNA sequence new set is clustered into K classes according to the pooling size K and a Kmeans algorithm according to neuron activation value characteristics of a shallow layer of the pooling layer corresponding to the sequences of the DNA sequence new set, and each class corresponds to the divided DNA sequence subsets;

and S433, repeating the step S432 to the first layer to obtain a plurality of DNA sequence subsets.

2. The method of claim 1, wherein S1 further comprises:

s11, cutting DNA sequence fragments with the specific function and DNA sequence fragments without the specific function on the biological genome marked by using a biological experimental means.

3. The method of claim 1, wherein S1 further comprises:

s12, artificially synthesizing DNA sequence fragment molecules, carrying out any type of biological function verification experiment, and determining the fragment molecules with the specific function and the fragment molecules without the specific function.

4. The method of claim 1, wherein labeling two DNA sequences comprises:

labeling the DNA sequence with the specific function as a positive sample, and labeling the DNA sequence without the specific function as a negative sample.

5. The method of claim 1, wherein S41 further comprises:

randomly generating a DNA sequence according to the size of a neuron receiving domain, optimizing the DNA sequence by using a genetic algorithm, wherein the optimization target is a neuron activation value of the DNA sequence, mutation of the DNA sequence in the genetic algorithm is sampled according to the neuron activation value and the gradient size of the one-hot coding input of the DNA sequence as probability, besides cross interchange of the DNA sequence is kept, cyclic displacement is required according to a neural network pooling layer structure, the DNA sequence of an intermediate result optimized by the genetic algorithm is sampled, the sampled DNA sequence is not repeated, and the sampled DNA sequence forms various activated DNA sequence sets.

6. The method according to claim 1, wherein the computational expression of the gene function element combination module is E [ E (X | Y) ], where X is a random variable corresponding to one-hot encoding of the sampling sequence, Y is a random variable represented by an activation value corresponding to the sampling sequence, and the relationship Y = f (X) between Y and X is determined by the corresponding neuron, where the distribution of the random variable Y needs to be given and is a free variable, and the random variable X depends on the random variable Y.

7. A training and visualization system of a neural network extraction-regulatory DNA combination pattern, characterized in that the training and visualization method for realizing the neural network extraction-regulatory DNA combination pattern of any one of claims 1 to 5 comprises:

the acquisition module is used for acquiring a DNA sequence with a specific function and a DNA sequence without the specific function;

a labeling module for labeling two DNA sequences and representing the DNA sequence with a specific function and the DNA sequence without the specific function by using unique heat codes;

the training module is used for building a convolutional neural network, taking the one-hot code of the labeled DNA sequence as input, labeling the corresponding DNA sequence as a fitting value output by the convolutional neural network, and training the convolutional neural network so that the convolutional neural network can identify the DNA sequence;

and the decoupling module is used for decoupling the trained convolutional neural network through a NeuronMotif algorithm to obtain a gene regulatory element combination module, and expressing and storing the gene regulatory element combination module by utilizing a regulatory element syntax tree.

8. The system of claim 7, wherein obtaining a DNA sequence having a specific function and a DNA sequence not having the specific function comprises:

intercepting DNA sequence fragments with the specific functions and DNA sequence fragments without the specific functions on a biological genome marked by using a biological experimental means; or

Artificially synthesizing DNA sequence fragment molecules, performing any type of biological function verification experiment, and determining the fragment molecules with the specific function and the fragment molecules without the specific function.