CN111477271B

CN111477271B - MicroRNA prediction method based on supervised self-organizing mapping neural network

Info

Publication number: CN111477271B
Application number: CN201911284083.0A
Authority: CN
Inventors: 於东军; 阚雯雯
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2019-12-13
Filing date: 2019-12-13
Publication date: 2022-09-30
Anticipated expiration: 2039-12-13
Also published as: CN111477271A

Abstract

The invention discloses a microRNA prediction method based on a supervised self-organizing mapping neural network, which comprises the following steps of: extracting features based on the microRNA sequence, obtaining secondary structure base pairing information of the microRNA sequence to be detected by using an RNAfold program, and converting the secondary structure base pairing information into vectorization expression by combining the features based on the primary sequence and the secondary structure to obtain a feature vector finally used for calculating a prediction means; receiving characteristic data of an input layer, learning a spatial distribution rule of the input data by implicit layer self-organizing mapping, and performing mapping processing between data to form new characteristic representation in the process; and step 3: the output supervision layer calculates output categories and related errors by using the generated new feature representation, and reversely transmits the errors to update the network weight; and setting a threshold value by using a cross validation method, and acquiring a classification result. The invention solves the defects of high cost, long time consumption and the like in the traditional biological prediction means and the existing machine learning prediction method.

Description

MicroRNA prediction method based on supervised self-organizing map neural network

技术领域technical field

本发明涉及生物信息学预测microRNA领域，具体地说，是一种基于有监督自组织映射神经网络的microRNA预测方法。The invention relates to the field of bioinformatics prediction of microRNA, in particular to a microRNA prediction method based on a supervised self-organizing map neural network.

背景技术Background technique

不同于可翻译成蛋白质的编码RNA(coding RNA)，将长度约为20个碱基且不可翻译蛋白质的RNA小分子统称为非编码RNA(noncoding RNA,ncRNA)，其具有新的、未为人深知的基因表达调控功能。非编码RNA有多种类型，包括核糖体RNA(ribosomal RNA, rRNA)，转运RNA(transfer RNA,tRNA)和微小RNA(microRNA,miRNA)，在广泛的生物过程中发挥重要作用，并对许多疾病如癌症有重大影响。目前已知的且研究较为深入的非编码小RNA主要有三类，分别是siRNA、miRNA和piRNA，这些非编码小RNA既能在生命物种转录阶段调控，也能实现转录后调控，在生物的生长发育中有着极为重要的作用。Different from coding RNA (coding RNA) that can be translated into protein, small RNA molecules with a length of about 20 bases that cannot be translated into proteins are collectively referred to as noncoding RNA (ncRNA), which has new and unexplored characteristics. Known gene expression regulation functions. There are many types of non-coding RNAs, including ribosomal RNA (rRNA), transfer RNA (tRNA), and microRNA (microRNA, miRNA), which play important roles in a wide range of biological processes and contribute to many diseases. Such as cancer has a major impact. There are mainly three types of non-coding small RNAs that are known and studied in depth at present, namely siRNA, miRNA and piRNA. These non-coding small RNAs can not only regulate the transcription stage of living species, but also realize post-transcriptional regulation. plays a very important role in development.

microRNA(miRNA)是一种单链且小的非编码RNA分子(约17～25nt)，广泛存在于真核生物中，microRNA分子从初始状态转换为成熟的microRNA分子需要经过多步加工。研究表明，microRNA通过与mRNA结合调节生物体转录水平的编码基因。已在许多癌症和其它疾病状态中观察到microRNA的异常表达，这表明microRNA的异常表达与这些疾病有较深的联系。研究报道病毒型microRNA与人类疾病相关联，特别是癌症，举例来说，EB病毒、肝炎B病毒和C病毒、人类乳头状瘤病毒分别与胃癌、鼻咽癌、肝癌和宫颈癌高度关联。因此，基于具有相似环状结构的发夹序列区分真正的miRNA与假miRNA是极其重要的， microRNA的深入研究对以microRNA为基础的疾病治疗也是极为重要的。MicroRNA (miRNA) is a single-stranded and small non-coding RNA molecule (about 17-25 nt) that is widely present in eukaryotes. The conversion of microRNA molecules from the initial state to mature microRNA molecules requires multi-step processing. Studies have shown that microRNAs regulate genes encoded at the transcriptional level of organisms by binding to mRNAs. Aberrant expression of microRNAs has been observed in many cancers and other disease states, suggesting that aberrant expression of microRNAs is deeply linked to these diseases. Studies have reported that viral microRNAs are associated with human diseases, especially cancer, for example, Epstein-Barr virus, hepatitis B and C viruses, and human papilloma virus are highly associated with gastric, nasopharyngeal, liver, and cervical cancer, respectively. Therefore, it is extremely important to distinguish true miRNAs from pseudo-miRNAs based on hairpin sequences with similar loop structures, and in-depth research on microRNAs is also extremely important for microRNA-based disease treatment.

常见的鉴别microRNA的方法有分子生物学研究方法和生物信息学研究方法，计算预测方法主要有：用于查找与已知microRNA同源的microRNA的同源片段搜索方法；基于比较基因组学、具有保守性的方法；基于序列和结构特征对候选片段进行筛选打分的预测方法；多用于植物miRNA的预测结合作用靶标方法；基于机器学习(Machine Learning,,ML)方法，包括支持向量机(Support Vector Machine,SVM)、随机森林(Random Forest,RF)、隐马尔可夫模型(Hidden Markov Model,HMM)、朴素贝叶斯(Naive Bayes,NB)和线性遗传编程(Linear Genetic Programming,LGP)等。Triplet SVM(Xue C,et al:Classification ofreal and pseudo microRNA precursors using local structure-sequence featuresand support vector machine. BMC Bioinformatics 2005,6:310.)使用了一组32个序列结构特征；MiPred(Ng KL,Mishra SK: De novo SVM classification of precursormicroRNAs from genomic pseudo hairpins using global and intrinsic foldingmeasures.Bioinformatics 2007,23(11):1321-30)融合使用了32个序列结构特征、最小折叠自由能(Minimum of Free Energy,MFE)和稳定性度量(randfold)的特征集； MicroPred(Batuwita R,Palade V:microPred:effective classification of pre-miRNAs forhuman miRNA gene prediction.Bioinformatics 2009,25(8):989-995.)由48个序列和结构特征集合的子集诱导产生，包含其中的21个特征；G²DE(Hsieh CH,Chang DTH,Hsueh CH,Wu CY,Oyang YJ:Predicting microRNA precursors with a generalized Gaussiancomponents based density estimation algorithm.BMC Bioinformatics 2010,11(Suppl 1):S52.)是由48个序列和结构特征集合的7个特征子集诱导的；Mirident(Liu X,He S,

G,Gong F,Chen R:Integrated sequence-structure motifs suffice toidentify microRNA precursors.PloS One 2012,7.)使用了一组 1300个序列结构模体特征；HuntMi(Gudys A,Szcze′sniak MW,Sikora M,Makalowska I:′ HuntMi:an efficientand taxon-specific approach in pre-miRNA identification.BMC Bioinformatics2013,14:83.)的特征集合并了microPred中使用的特征、triplet-SVM特征集中的四个特征和其他三个特征：序列中检测到的低复杂度区域的百分比、不含终止密码子的氨基酸链最大长度和内环的累积大小。ViralmiR(Kai-Yao Huang,Tzong-Yi Lee,Yu-Chuan Teng,Tzu-Hao Chang.ViralmiR:a support-vector-machine-based method for predictingviral microRNA precursors[C].Asia-Pacific Bioinformatics Conference.CurranAssociates,2015:86-92.)是一种基于支持向量机的病毒microRNA前体预测方法，从先前研究中选择54个特征并使用特征得分评价特征的鉴别能力。Liu等人(Liu B,Fang L,LiuF,et al.iMiRNA-PseDPC:microRNA precursor identification with a pseudodistance-pair composition approach[J].Journal of Biomolecular Structure andDynamics,2015(ahead-of-print):1-13.)以支持向量机(SVM)为分类器，提出了一种新的特征向量，称为伪距离对合成(Pseudo Distance Pair Composition,PseDPC)来保留序列顺序信息。Common methods for identifying microRNAs include molecular biology research methods and bioinformatics research methods. Computational prediction methods mainly include: homologous fragment search methods for finding microRNAs homologous to known microRNAs; based on comparative genomics, conservative Predictive methods for screening and scoring candidate fragments based on sequence and structural features; methods for predicting binding targets of plant miRNAs; based on machine learning (ML) methods, including support vector machines (Support Vector Machines) , SVM), Random Forest (Random Forest, RF), Hidden Markov Model (Hidden Markov Model, HMM), Naive Bayes (Naive Bayes, NB) and Linear Genetic Programming (Linear Genetic Programming, LGP) and so on. Triplet SVM (Xue C, et al: Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics 2005, 6:310.) used a set of 32 sequence structure features; MiPred (Ng KL, Mishra SK: De novo SVM classification of precursormicroRNAs from genomic pseudo hairpins using global and intrinsic folding measures. Bioinformatics 2007, 23(11): 1321-30) fusion using 32 sequence structural features, Minimum of Free Energy (MFE) ) and stability metric (randfold) feature set; MicroPred (Batuwita R, Palade V: microPred: effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics 2009, 25(8): 989-995.) consists of 48 sequences and a subset of the structural feature set, including 21 features; G ² DE (Hsieh CH, Chang DTH, Hsueh CH, Wu CY, Oyang YJ: Predicting microRNA precursors with a generalized Gaussiancomponents based density estimation algorithm.BMC Bioinformatics 2010, 11 (Suppl 1): S52.) is induced by a subset of 7 features from a set of 48 sequence and structural features; Mirident (Liu X, He S,

G, Gong F, Chen R: Integrated sequence-structure motifs suffice to identify microRNA precursors. PloS One 2012, 7.) used a set of 1300 sequence-structure motif features; HuntMi (Gudys A, Szcze'sniak MW, Sikora M, Makalowska I:' HuntMi: an efficient and taxon-specific approach in pre-miRNA identification. BMC Bioinformatics 2013, 14:83.) The feature set incorporates the features used in microPred, four features in the triplet-SVM feature set, and three others Features: Percentage of low-complexity regions detected in the sequence, maximum length of amino acid chain without stop codons, and cumulative size of inner loops. ViralmiR(Kai-Yao Huang,Tzong-Yi Lee,Yu-Chuan Teng,Tzu-Hao Chang.ViralmiR:a support-vector-machine-based method for predictingviral microRNA precursors[C].Asia-Pacific Bioinformatics Conference.CurranAssociates,2015 :86-92.) is a support vector machine-based prediction method for viral microRNA precursors, selecting 54 features from previous studies and using feature scores to evaluate the discriminative power of features. Liu et al. (Liu B, Fang L, LiuF, et al. iMiRNA-PseDPC: microRNA precursor identification with a pseudodistance-pair composition approach [J]. Journal of Biomolecular Structure and Dynamics, 2015 (ahead-of-print): 1- 13.) Using Support Vector Machine (SVM) as the classifier, a new feature vector called Pseudo Distance Pair Composition (PseDPC) is proposed to preserve sequence order information.

基于现有事实的研究表明，microRNA与各种疾病之间存在多种关联。传统的生物预测方法往往实验时间长、实验成本高亦或有其他劣势而不能满足后基因组时代中的强烈需求。Studies based on existing facts have shown that there are multiple associations between microRNAs and various diseases. Traditional biological prediction methods often have long experimental time, high experimental cost or other disadvantages and cannot meet the strong needs in the post-genome era.

发明内容SUMMARY OF THE INVENTION

为了解决传统生物学预测手段和已有机器学习预测方法中成本高、耗时长等缺陷，本发明的目的在于提出一种结合microRNA序列和microRNA结构特征的基于有监督自组织映射网络的microRNA预测方法。In order to solve the defects of high cost and long time in traditional biological prediction methods and existing machine learning prediction methods, the purpose of the present invention is to propose a microRNA prediction method based on supervised self-organizing mapping network that combines microRNA sequence and microRNA structural features .

为达成上述目的，本发明所采用的技术方案如下：In order to achieve the above object, the technical scheme adopted in the present invention is as follows:

一种基于有监督自组织映射神经网络的microRNA预测方法，包括以下步骤：A method for microRNA prediction based on supervised self-organizing map neural network, including the following steps:

步骤1：特征提取，抽取基于microRNA序列的特征，并使用RNAfold程序获取待测microRNA序列的二级结构碱基配对信息，结合基于一级序列的特征和基于二级结构的特征转换为向量化表示，得到最终用于计算预测手段的特征向量；Step 1: Feature extraction, extract the features based on the microRNA sequence, and use the RNAfold program to obtain the base pairing information of the secondary structure of the microRNA sequence to be tested, and convert the features based on the primary sequence and the features based on the secondary structure into a vectorized representation , obtain the eigenvectors that are finally used to calculate the prediction means;

步骤2：接收输入层的特征数据，隐含层自组织映射学习输入数据的空间分布规律，并进行数据间的映射处理，在此过程中形成新的特征表示；Step 2: Receive the feature data of the input layer, learn the spatial distribution law of the input data by self-organizing mapping of the hidden layer, and perform mapping processing between the data, and form a new feature representation in the process;

步骤3：输出监督层使用步骤2中生成的新的特征表示计算输出类别和相关误差，并反向传输误差更新网络权重；Step 3: The output supervision layer uses the new feature representation generated in step 2 to calculate the output category and related errors, and back-transmits the errors to update the network weights;

步骤4：使用交叉验证方法，设定阈值，获取分类结果。Step 4: Use the cross-validation method, set the threshold, and obtain the classification result.

本发明与现有技术相比，其显著优点在于：(1)基于序列和结构特征，不单单抽取序列特征，也抽取microRNA的二级结构特征，丰富了特征来源，有利于提高模型的预测性能；(2) 基于随机下采样技术，从原始的不平衡样本集中抽取平衡子样本集，有效解决类不平衡问题，有利于提高模型的预测性能；(3)使用有监督的自组织映射神经网络，利用自组织映射神经网络在学习高维输入空间中的分布规律同时在低维输出空间中保持拓扑邻域信息的特点，计算得出新的特征表示并进行后续处理，提升模型的预测性能；(4)使用交叉验证检验模型性能，不仅仅使用单一标签值的阈值划分分类数据，以两个输出状态的调整阈值进行样本划分，提升模型的预测性能。Compared with the prior art, the present invention has significant advantages as follows: (1) based on sequence and structural features, not only sequence features but also secondary structure features of microRNA are extracted, which enriches feature sources and is beneficial to improving the prediction performance of the model ; (2) Based on random down-sampling technology, a balanced sub-sample set is extracted from the original unbalanced sample set, which effectively solves the class imbalance problem, which is beneficial to improve the prediction performance of the model; (3) Use a supervised self-organizing mapping neural network , using the self-organizing mapping neural network to learn the distribution law in the high-dimensional input space while maintaining the characteristics of topological neighborhood information in the low-dimensional output space, calculate the new feature representation and perform subsequent processing, and improve the prediction performance of the model; (4) Use cross-validation to test the performance of the model, not only use the threshold of a single label value to divide the classification data, but also use the adjustment threshold of two output states to divide the samples to improve the prediction performance of the model.

附图说明Description of drawings

图1是基于有监督自组织映射神经网络预测方法的示意图。Figure 1 is a schematic diagram of a prediction method based on a supervised self-organizing map neural network.

图2是编码位置偏差图。FIG. 2 is an encoding position deviation diagram.

具体实施方式Detailed ways

本发明一种基于有监督自组织映射网络的microRNA预测方法，包括以下步骤：A microRNA prediction method based on a supervised self-organizing map network of the present invention comprises the following steps:

步骤1：特征提取，将待测microRNA的碱基序列转换为机器可处理的数值形式表示。对于一个microRNA碱基序列，首先提取它的序列长度、GC含量、组成成分和编码位置偏置差 the codon position bias)等可由原始碱基序列计算得到的特征，其中，k组成成分代表长度为k 的字串在序列中的出现频率(预测方法中k值取1，2，3)，由此获得84(4¹+4²+4³)维k组成成分特征；编码位置偏差表示碱基在三联体密码子的三种位置的偏差情况。其次，通过 RNAfold程序可得到该microRNA基于二级结构最小自由能下的碱基配对信息，microRNA中的每一个碱基有配对和不配对两种状态，由特定符号表示；统计序列的二级结构碱基配对信息，对任意三个相邻的核苷酸计算8(2³)种可能配对情况的频率，结合中间核苷酸的4种可能情况，获得32(4×8)种三重元素结构特征；通过RNAfold程序得到的二级结构最小自由能也作为一种特征，并对其进行标准化；另计算AU配对与序列长度的比值、GC配对与序列长度的比值和GU配对与序列长度的比值。最终将所有的特征向量串行组合，得到用于预测的总特征向量。Step 1: Feature extraction, converting the base sequence of the microRNA to be tested into a numerical representation that can be processed by a machine. For a microRNA base sequence, first extract its sequence length, GC content, composition and coding position offset (the codon position bias) and other features that can be calculated from the original base sequence, where the k component represents the length of k The frequency of occurrence of the word string in the sequence (the value of k in the prediction method is 1, 2, 3), thus obtaining the 84 (4 ¹ +4 ² +4 ³ ) dimension k component feature; the coding position deviation indicates that the base is in the Bias at three positions of triplet codons. Secondly, the base pairing information of the microRNA based on the minimum free energy of the secondary structure can be obtained through the RNAfold program. Each base in the microRNA has two states of pairing and unpairing, which are represented by a specific symbol; the secondary structure of the statistical sequence Base pairing information, calculate the frequency of 8 (2 ³ ) possible pairing cases for any three adjacent nucleotides, and combine the 4 possible cases of the intermediate nucleotide to obtain 32 (4 × 8) triple element structures Features; the minimum free energy of the secondary structure obtained by the RNAfold program is also used as a feature and normalized; the ratio of AU pairing to sequence length, GC pairing to sequence length, and GU pairing to sequence length are also calculated. . Finally, all the eigenvectors are serially combined to get the total eigenvector for prediction.

步骤2：采用随机下采用技术，对负microRNA的样本进行随机下采样。实验已验证标记的正microRNA样本数量远远小于负microRNA样本数量，对待这种情况下的数据集，需要进行数据预处理。对负microRNA的样本进行随机下采样，将得到的负microRNA样本子集和正microRNA样本集作为实验数据集，解决不平衡学习问题，保持正负样本的均衡性。Step 2: Random downsampling of negative microRNA samples using random downsampling techniques. Experiments have verified that the number of labeled positive microRNA samples is much smaller than the number of negative microRNA samples. For the dataset in this case, data preprocessing is required. The samples of negative microRNA are randomly down-sampled, and the obtained subset of negative microRNA samples and positive microRNA sample set are used as experimental data sets to solve the problem of unbalanced learning and maintain the balance of positive and negative samples.

步骤3：将抽取得到microRNA序列特征向量作为输入，在自组织映射神经网络层学习输入的空间分布规律，与输入节点全互联，自组织映射中的各神经元由一组权重向量表示，通过寻找最佳匹配单元来不断调整最佳匹配单元及其邻近神经元的权向量，在低维输出空间保持拓扑结构信息，得出新特征表示。自组织映射输出节点默认为10×10的网格，方法中可根据数据的不同设置其他值。Step 3: Use the extracted microRNA sequence feature vector as input, learn the spatial distribution law of the input in the self-organizing map neural network layer, and fully interconnect with the input nodes. Each neuron in the self-organizing map is represented by a set of weight vectors. The best matching unit is used to continuously adjust the weight vector of the best matching unit and its neighboring neurons, maintain the topology information in the low-dimensional output space, and obtain new feature representations. The self-organizing map output node defaults to a 10×10 grid, and other values can be set in the method according to different data.

步骤4：有监督输出层使用SOM层输出的新的特征表示，标记类别信息，并将误差信息反向传输，利用梯度下降方法通过求取导数更新相关权重得出训练好的模型，使用交叉验证检测模型性能，经阈值划分获得二分类结果。Step 4: The supervised output layer uses the new feature representation output by the SOM layer, labels the category information, and transmits the error information in the reverse direction. The gradient descent method is used to update the relevant weights by obtaining the derivative to obtain the trained model, and cross-validation is used. The performance of the model is detected, and the binary classification results are obtained by thresholding.

为了更了解本发明的技术内容，特举具体实施例并配合所附图式说明如下。In order to better understand the technical content of the present invention, specific embodiments are given and described below in conjunction with the accompanying drawings.

图1给出了本发明预测方法的示意图，结合图1所示，根据本发明的实施例，基于有监督自组织映射神经网络的microRNA预测方法的步骤说明如下。Fig. 1 is a schematic diagram of the prediction method of the present invention. With reference to Fig. 1, according to an embodiment of the present invention, the steps of the microRNA prediction method based on the supervised self-organizing mapping neural network are described as follows.

首先，对于microRNA序列训练集中的每条序列，提取对应的序列信息特征，使用RNAfold 程序获取二级结构碱基配对信息，并计算获取相应的二级结构特征，以向量的形式表示；再将获得的特征进行融合得到最终的特征向量，并与该序列的标签相结合作为该序列对应的机器学习样本，microRNA序列训练集可以转化为microRNA样本训练集。其次，基于有监督的自组织映射三层神经网络，SOM计算新的表示，有监督层的神经元使用新的表示计算结果，前向传输中计算输出和相关误差，反向传输中使用梯度下降将误差反向传输以更新网络权重。最后，使用交叉验证检验模型性能，不仅仅使用单一标签值的阈值划分分类数据，以两个输出状态的调整阈值进行样本划分，获得数据的分类结果。First, for each sequence in the microRNA sequence training set, extract the corresponding sequence information features, use the RNAfold program to obtain the secondary structure base pairing information, and calculate and obtain the corresponding secondary structure features, which are expressed in the form of vectors; The features of the sequence are fused to obtain the final feature vector, and combined with the label of the sequence as the machine learning sample corresponding to the sequence, the microRNA sequence training set can be converted into a microRNA sample training set. Second, based on a supervised self-organizing map three-layer neural network, the SOM computes a new representation, the neurons in the supervised layer compute the result using the new representation, the output and associated errors are computed in the forward pass, and gradient descent is used in the back pass The error is transmitted back to update the network weights. Finally, use cross-validation to test the performance of the model, not only use the threshold of a single label value to divide the classification data, but also use the adjustment threshold of the two output states to divide the sample to obtain the classification result of the data.

结合附图所示，更加具体地描述过程。The process is described in more detail with reference to the drawings.

步骤1：特征提取Step 1: Feature Extraction

对于一个microRNA碱基序列，从原始序列中可提取它的序列长度、GC含量、组成成分和编码位置偏置(the codon position bias)等可由原始碱基序列计算得到的特征，其中，序列长度为由碱基(通常为A、U、G、C)组成的microRNA序列的字符长度；GC含量为microRNA 序列碱基G和C出现的频率；k组成成分代表长度为k的字串在序列中的出现频率(预测方法中k值取1，2，3)，由此获得84(4¹+4²+4³)维k组成成分特征；编码位置偏差表示碱基在三联体密码子的三种位置的偏差情况，对于任一碱基x∈{G,C,A,U}，x₁为x在0,3,6,...等位置出现的次数，x₂为x在1,4,7,...等位置出现的次数，x₃为x在2,5,8,...等位置出现的次数，编码位置偏差如图2所示：For a microRNA base sequence, its sequence length, GC content, composition and coding position bias (the codon position bias) and other features that can be calculated from the original base sequence can be extracted from the original sequence, where the sequence length is The character length of the microRNA sequence consisting of bases (usually A, U, G, C); the GC content is the frequency of occurrences of bases G and C in the microRNA sequence; the k component represents the length of the k word string in the sequence. Occurrence frequency (the value of k in the prediction method is 1, 2, 3), thus obtaining the 84 (4 ¹ +4 ² +4 ³ ) dimension k component features; the coding position deviation indicates that the base is in three types of triplet codons The deviation of the position, for any base x∈{G,C,A,U}, x ₁ is the number of times x appears in 0,3,6,... etc., x ₂ is x in 1,4 ,7,... and other positions, x ₃ is the number of times x appears in 2, 5, 8,... and other positions, and the coding position deviation is shown in Figure 2:

可由以下公式计算：It can be calculated by the following formula:

其次，通过RNAfold程序可得到该microRNA的最小自由能二级结构碱基配对信息，microRNA中的每一个碱基有配对(paired)和不配对(unpaired)两种状态，由特定符号表示“(”、“)”和“.”，左括号“(”表示靠近5′端的某个核苷酸可和靠近3′端的另一个核苷酸匹配，相对应的右括号“(”表示靠近3′端的某个核苷酸可和靠近5′端的另一个核苷酸匹配，本发明中两种情况可只用“(”表示。对任意三个相邻的核苷酸计算其8(2³)种可能配对情况的频率，包括“(((”,“((.”,“(.(”,“.((”,“(..”,“.(.”,“..(”,and“...”，结合中间核苷酸的4种可能情况(A、U、G、C)，获得32(4×8)种三重元素结构特征。通过RNAfold程序得到的二级结构最小自由能也作为一种特征，并对其进行标准化dG(normalized MFE)。另计算AU配对与序列长度的比值|A-U|/L、GC配对与序列长度的比值|G-C|/L和GU配对与序列长度的比值|G-U|/L。最终将所有的特征向量串行组合，得到用于预测的总特征向量。Secondly, the minimum free energy secondary structure base pairing information of the microRNA can be obtained through the RNAfold program. Each base in the microRNA has two states, paired and unpaired, represented by a specific symbol "(" , ")" and ".", the left bracket "(" indicates that a nucleotide near the 5' end can match another nucleotide near the 3' end, and the corresponding right bracket "(" indicates that a nucleotide near the 3' end A certain nucleotide can be matched with another nucleotide near the 5' end. In the present invention, the two cases can only be represented by "(". Calculate 8 (2 ³ ) kinds of any three adjacent nucleotides. Frequency of possible pairing cases, including "(((", "((.", "(.(", ".((", "(..", ".(.", "..(", and "...", combining 4 possible cases of intermediate nucleotides (A, U, G, C) to obtain 32 (4 × 8) triple element structural features. Minimum free energy of secondary structure obtained by RNAfold program It is also used as a feature and normalized dG (normalized MFE). The ratio of AU pairing to sequence length |AU|/L, the ratio of GC pairing to sequence length |GC|/L and GU pairing to sequence length are also calculated. The ratio |GU|/L. Finally, all the eigenvectors are serially combined to obtain the total eigenvectors for prediction.

步骤2：自组织映射网络层Step 2: Self-Organizing Map Network Layer

自组织映射(Self-organizing Map,SOM)是Kohonen提出的一种无监督学习模型，训练目标是为每个输出层神经元找到合适的权向量，能自适应学习模式在高维输入空间中的分布规律，并在低维输出空间中保持输入数据的拓扑邻域信息。通常，SOM的输出节点以2-D规则网格来排列，每个输出节点与输入节点全互连，对应着一个原型向量w_h∈R^m，m为输入空间的维数。SOM的训练过程为，在接收到一个训练样本x_i∈R^m后，SOM的H个神经元单元中每个输出层神经元会计算该样本与自身权向量之间的距离，距离最近的神经元称为最佳匹配单元(Best Matching Unit，BMU)，然后调整最佳匹配单元及其邻近神经元的权向量，以使权向量与当前输入样本的距离缩小，此过程不断迭代，直至收敛。最佳匹配单元(BestMatching Unit,BMU)可由公式BMU(x_i)＝arg min_1≤h≤H||w_h-x_i||计算得到，其中||.,.||为L2范式。Self-organizing Map (SOM) is an unsupervised learning model proposed by Kohonen. The training goal is to find a suitable weight vector for each output layer neuron, which can adapt the learning model to the high-dimensional input space. distribution law, and maintain the topological neighborhood information of the input data in the low-dimensional output space. Usually, the output nodes of SOM are arranged in a 2-D regular grid, and each output node is fully interconnected with the input node, corresponding to a prototype vector w _h ∈ R ^m , where m is the dimension of the input space. The training process of SOM is that after receiving a training sample x _i ∈ R ^m , each output layer neuron in the H neuron units of SOM will calculate the distance between the sample and its own weight vector, and the distance between the nearest neuron The unit is called the Best Matching Unit (BMU), and then the weight vector of the best matching unit and its neighboring neurons is adjusted to reduce the distance between the weight vector and the current input sample. This process is iterated until convergence. The best matching unit (BestMatching Unit, BMU) can be calculated by the formula BMU(x _i )=arg min _1≤h≤H ||w _h _-xi ||, where ||.,.|| are L2 normal form.

设第t步时的学习样本为x_i，通过以下公式更新BMU及其邻居：Let the learning sample at step t be x _i , update the BMU and its neighbors by the following formula:

其中，α(t)为学习效率，随t单调递减；

为邻域函数，定义了与BMU相邻的输出节点。

其中，r为映射的半径，T是最大迭代次数， d(BMU(x_i),h)是与输入x_i相匹配的最佳神经元BMU(x_i)和神经元h之间的曼哈顿距离(the Manhattan distance)。Among them, α(t) is the learning efficiency, which decreases monotonically with t;

For the neighborhood function, the output nodes adjacent to the BMU are defined.

where r is the radius of the map, T is the maximum number of iterations, and d(BMU( _xi ),h) is the Manhattan distance between the best neuron BMU( _xi ) that matches the input _xi and neuron h (the Manhattan distance).

步骤3：类别预测Step 3: Class Prediction

单元的激活项通过网络前向传输以获得输出值，输出值同输入类别标签值比较计算得出误差值，为保持映射网络的组织性，获取当前神经元和其邻居的信息。The activation item of the unit is forwarded through the network to obtain the output value, and the output value is compared with the input category label value to calculate the error value. In order to maintain the organization of the mapping network, the information of the current neuron and its neighbors is obtained.

d(u',u)是神经元u'和神经元u之间的曼哈顿距离(the Manhattan distance)，α是常数。d(u', u) is the Manhattan distance between neuron u' and neuron u, and α is a constant.

隐藏层神经元和输出层神经元之间全连接，输出层神经元的两个输出值为隐含层SOM的 H个神经元与输出层神经元之间的权重向量与激活项的乘积和再累加上其输出层偏置的 sigmoid函数值。sigmoid函数是生物学中常见的函数，也常被用作神经网络的激活函数，定义sig(x)＝1/(1+exp(-x))，将一个实数映射到(0,1)的区间，用来做二分类。The neurons in the hidden layer and the neurons in the output layer are fully connected, and the two output values of the neurons in the output layer are the sum of the products of the weight vector and the activation term between the H neurons in the hidden layer SOM and the neurons in the output layer. Accumulates the value of the sigmoid function with its output layer bias. The sigmoid function is a common function in biology, and is also often used as the activation function of neural networks. Define sig(x)=1/(1+exp(-x)), which maps a real number to (0,1) interval, used for binary classification.

同时引入损失函数，避免过拟合，提高模型的泛化能力。反向传输中，使用经典的梯度下降优化算法调整输出权重和SOM的权重，对损失函数求取相关权重导数，沿梯度下降方向更新权重值。本发明中设置两个输出，不同于单个标签值设定阈值判定输入的类别标签，以两个输出状态的调整阈值进行样本划分。不采用取两个输出值的最大化进行样本的0和1 划分，对不同类别的数据采用不同的阈值使其结果中的马修斯相关系数最大化。At the same time, a loss function is introduced to avoid overfitting and improve the generalization ability of the model. In the reverse transmission, the classic gradient descent optimization algorithm is used to adjust the output weight and the weight of the SOM, obtain the relevant weight derivative for the loss function, and update the weight value along the gradient descent direction. In the present invention, two outputs are set, which is different from the single label value setting the threshold value to determine the input category label, and the sample is divided by the adjustment threshold value of the two output states. Instead of taking the maximization of the two output values to divide the samples into 0 and 1, different thresholds are used for different categories of data to maximize the Matthews correlation coefficient in the results.

Claims

1. A microRNA prediction method based on a supervised self-organizing mapping neural network is characterized by comprising the following steps:

step 1: extracting characteristics, namely extracting the characteristics based on the microRNA sequence, obtaining secondary structure base pairing information of the microRNA sequence to be detected by using an RNAfold program, and converting the secondary structure base pairing information into vectorization expression by combining the characteristics based on the primary sequence and the characteristics based on the secondary structure to obtain a characteristic vector finally used for calculating a prediction means;

step 2: receiving characteristic data of an input layer, learning a spatial distribution rule of the input data by self-organizing mapping of a hidden layer, and performing mapping processing between data to form new characteristic representation in the process;

and step 3: the output supervision layer calculates output categories and related errors by using the new feature representation generated in the step 2, and reversely transmits the errors to update the network weight;

and 4, step 4: and setting a threshold value by using a cross validation method, and obtaining a classification result.

2. The microRNA prediction method based on the supervised self-organizing map neural network of claim 1, wherein: in the step 1, for a microRNA sequence with the sequence length of n, the sequence-based characteristics of the microRNA sequence are extracted, and the microRNA sequence length, GC content, composition and coding position deviation characteristics are extracted.

3. The supervised self-organizing map neural network-based microRNA prediction method as recited in claim 1, wherein: in the step 1, for a microRNA sequence, extracting the characteristics related to the structure of the microRNA sequence; extracting the base pairing information of the secondary structure of the sequence by using an RNAfold program, and obtaining the pairing condition of each base in the microRNA sequence based on the minimum free energy principle of the secondary structure at the default temperature; according to the base pairing information, the position composition conditions of any 3 adjacent nucleotides are calculated in a statistical manner, and 32 triplet structure composition characteristics are extracted; extracting the ratio of characteristic AU pairs to n, the ratio of GC pairs to n and the ratio of CU pairs to n according to the ratio of base pairs to sequence length in the secondary structure; and extracting the thermodynamically related characteristics of the microRNA according to the minimum free energy in the secondary structure.

4. The supervised self-organizing map neural network-based microRNA prediction method as recited in claim 1, wherein: in the step 2, the extracted microRNA sequence feature vector is used as an input, an input spatial distribution rule is learned in a self-organizing mapping neural network layer and is fully interconnected with an input node, each neuron in the self-organizing mapping is represented by a group of weight vectors, the weight vectors of an optimal matching unit and adjacent neurons thereof are continuously adjusted by searching for the optimal matching unit, topological structure information is kept in a low-dimensional output space, a new feature representation is obtained, and the self-organizing mapping output node defaults to be a 10 x 10 grid.

5. The microRNA prediction method based on the supervised self-organizing map neural network of claim 1, wherein: in the step 3, two neurons of the output layer are fully connected with the self-organizing mapping layer, new feature representation information output by the self-organizing mapping layer is used, the connection weight of interlayer input and output is utilized to calculate a sample label value, an error value of a predicted sample label and a real label of the predicted sample label is reversely transmitted to the network, the weight of the neural network is updated by using a gradient descent method, and the process is iterated for multiple times.

6. The microRNA prediction method of the supervised self-organizing map neural network of claim 1, wherein: in the step 4, a threshold segmentation method is used to judge whether the sequence belongs to the micro RNA, so that the Mauss correlation coefficient of the prediction result is maximized, and the classification result is obtained.