CN103218543B

CN103218543B - A kind of method and system distinguishing protein coding gene and Noncoding gene

Info

Publication number: CN103218543B
Application number: CN201310102224.9A
Authority: CN
Inventors: 赵屹; 孙亮; 罗海涛
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2013-03-27
Filing date: 2013-03-27
Publication date: 2016-04-13
Anticipated expiration: 2033-03-27
Also published as: CN103218543A

Abstract

The present invention provides a method and system for distinguishing protein-coding genes and non-coding genes, which can distinguish the characteristics of protein-coding genes and non-coding genes at the sequence level, and the characteristics do not depend on the known data of the species and do not require conservation information, and has a good judgment effect on long non-coding RNA. In addition to having a strong advantage in accuracy, it is easy to operate, does not require too much file dependence, and the processing time is significantly better than known methods.

Description

A method and system for distinguishing protein-coding genes from non-coding genes

技术领域technical field

本发明涉及生命科学领域，尤其涉及一种区分蛋白编码基因和非编码基因的方法及系统。The invention relates to the field of life sciences, in particular to a method and system for distinguishing protein-coding genes and non-coding genes.

背景技术Background technique

目前国际上主要有两中方法进行区分蛋白编码基因（以下简称编码基因）和非编码基因：At present, there are mainly two methods in the world to distinguish protein-coding genes (hereinafter referred to as coding genes) and non-coding genes:

CPC方法由北京大学生命科学学院开发，依靠预测基因的开放阅读框及已知蛋白库等信息来判定一条核酸序列为编码基因还是非编码基因。该方法太过依赖于开放阅读框的预测方法及已知数据库，对新基因的判定及长非编码基因的判定存在明显不足，且根据我们自身的测评显示，对于长非编码基因的判断准确率非常低。The CPC method was developed by the School of Life Sciences of Peking University. It relies on information such as the open reading frame of the predicted gene and the known protein library to determine whether a nucleic acid sequence is a coding gene or a non-coding gene. This method relies too much on the prediction method of the open reading frame and the known database, and there are obvious deficiencies in the judgment of new genes and long non-coding genes. According to our own evaluation, the accuracy of the judgment of long non-coding genes very low.

PhyloCSF是国际上近几年采用的一种方法，依靠多个物种序列比对信息得到保守型区域，根据待测序列的保守型强弱来判定是编码还是非编码序列。但是，由于很多物种根本没有全基因组序列，所以无法得到多物种序列比对信息。因此，对于很多物种无法衡量序列的保守性，进而无法判定编码与非编码能力。此外，长非编码基因内部有多个保守型的模块（子序列），因此仅仅依靠保守型区域来判定编码能力过于片面，我们自身对该方法的测评显示准确率也很低。PhyloCSF is a method adopted internationally in recent years. It relies on sequence comparison information of multiple species to obtain conserved regions, and judges whether it is a coding or non-coding sequence according to the strength of conservation of the sequence to be tested. However, since many species do not have full genome sequences at all, it is not possible to obtain multi-species sequence alignment information. Therefore, for many species, the conservation of the sequence cannot be measured, and thus the coding and non-coding capabilities cannot be determined. In addition, there are multiple conserved modules (subsequences) inside the long non-coding gene, so it is too one-sided to judge the coding ability only by the conserved region, and our own evaluation of this method shows that the accuracy rate is also very low.

发明内容Contents of the invention

为解决上述问题，本发明提供一种区分编码基因和非编码基因的方法及系统，其主要利用序列串连密码子对的频率统计准确的将编码序列和非编码序列以及编码区域按照其他五种读码方式所产生的序列区分出来，不依赖于物种已知的数据，不需要保守性信息，并且对长非编码RNA有很好的判断效果。In order to solve the above problems, the present invention provides a method and system for distinguishing coding genes from non-coding genes, which mainly uses the frequency statistics of sequence tandem codon pairs to accurately classify coding sequences, non-coding sequences and coding regions according to the other five types The sequence generated by the reading method is distinguished, does not depend on the known data of the species, does not require conservative information, and has a good judgment effect on long non-coding RNA.

为实现上述目的，本发明提供了一种区分编码基因和非编码基因的方法发明，该方法包括：In order to achieve the above object, the present invention provides a method invention for distinguishing coding genes and non-coding genes, the method comprising:

步骤1，将样本集按照编码和非编码序列分为正、负两个训练集合，对正、负两个训练集合分别执行步骤2至步骤4；Step 1, divide the sample set into positive and negative training sets according to the coding and non-coding sequences, and perform steps 2 to 4 for the positive and negative training sets respectively;

步骤2，在训练集合中统计出每个相邻核苷酸三聚体ANT在编码序列、非编码序列和基因间区域序列中的出现频率并分别构建出现频率矩阵，基于三个出现频率矩阵通过log2-ratio运算构建打分矩阵；Step 2: Count the occurrence frequency of each adjacent nucleotide trimer ANT in the coding sequence, non-coding sequence and intergenic region sequence in the training set and construct the occurrence frequency matrix respectively, based on the three occurrence frequency matrices by The log2-ratio operation constructs the scoring matrix;

步骤3，所述打分矩阵利用滑动窗口进行打分的方式计算出窗口分值S-score，以此作为分类模型的第一个特征，并使用动态规划算法分别找出由所述样本集中的编码序列和非编码序列所转换成的数组内的具有最大子段和的区域作为特征子序列MLCDS，并且以所述MLCDS的长度作为分类模型的第二个特征；Step 3, the scoring matrix calculates the window score S-score by using a sliding window to score, and uses this as the first feature of the classification model, and uses a dynamic programming algorithm to find out the coding sequences in the sample set The area with the maximum subsection sum in the array converted from and non-coding sequences is used as the feature subsequence MLCDS, and the length of the MLCDS is used as the second feature of the classification model;

步骤4，利用i∈(1，2,3,4,5,6)获取分类模型的第三个特征，其中X是读码方式中MLCDS的长度，Yi代表全部六种读码方式中各自的MLCDS长度；Step 4, use i∈(1, 2, 3, 4, 5, 6) obtains the third feature of the classification model, where X is the length of the MLCDS in the reading mode, and Yi represents the length of the MLCDS in all six reading modes;

利用j∈(1,2,3,4,5)获取分类模型的第四个特征，其中S是在核酸序列一共的六种读码方式中按照正确的读码方式提取出来的MLCDS的S-score，Ej代表剩下其他五种错误读码方式中提取出来的MLCDS的S-score；use j∈(1,2,3,4,5) obtains the fourth feature of the classification model, where S is the S-score of MLCDS extracted according to the correct reading method in the six reading methods of the nucleic acid sequence , Ej represents the S-score of MLCDS extracted from the remaining five error code reading methods;

利用单个核苷酸三聚体在编码和非编码区域的出现频率进行log2-ratio运算，获取核苷酸三聚体偏好性作为分类模型的第五个特征；Use the frequency of single nucleotide trimers in coding and non-coding regions to perform log2-ratio calculations to obtain nucleotide trimer preference as the fifth feature of the classification model;

步骤5，利用所述正、负两个训练集合的五个特征组成正负两个特征向量集合来训练分类模型，待区分序列利用所述分类模型进行预测得到区分结果。Step 5, using the five features of the positive and negative training sets to form positive and negative feature vector sets to train a classification model, and the sequence to be distinguished is predicted by the classification model to obtain a distinguishing result.

其中出现频率X_iF的计算公式为：The formula for calculating the frequency of occurrence X _i F is:

${X x}_{i i} N N = = {Σ Σ}_{j j = = 11}^{n no} {S S}_{j j} (({X x}_{i i}))$

$T T = = {Σ Σ}_{i i = = 11}^{m m} {X x}_{i i} N N = = {Σ Σ}_{i i = = 11}^{m m} (({Σ Σ}_{j j = = 11}^{n no} {S S}_{j j} (({X x}_{i i}));; m m = = 6464 * * 6464;; n no = = ((11 . . . . . . . . . . . . . . . . N N))$

${X x}_{i i} F f = = \frac{{X x}_{i i} N N}{T T}$

其中X代表着某种类型的ANT，S_j(X_i)是X在某一类序列集合中的某一条序列上的出现次数，X_iN是该种ANT在整个某种序列集合中的出现次数，T则表示所有种类的ANT在该数据集中的总共出现次数，m表示ANT的种类数目，n表示该类型的集合中所包含的序列条数。Among them, X represents a certain type of ANT, S _j (X _i ) is the number of occurrences of X in a certain sequence in a certain type of sequence set, and X _i N is the occurrence of this kind of ANT in the entire certain sequence set T represents the total number of occurrences of all types of ANTs in the data set, m represents the number of types of ANTs, and n represents the number of sequences contained in this type of collection.

进一步的，所述步骤3包括：Further, said step 3 includes:

步骤31，使用滑动窗口分别对编码序列和非编码序列的每条转录本序列按照六框读码的方式进行扫描；Step 31, using a sliding window to scan each transcript sequence of the coding sequence and the non-coding sequence according to the six-frame reading code;

步骤32，所述打分矩阵会在上述扫描的过程中对所述滑动窗口的每一个子窗口进行打分，将一条由核苷酸序列组成的转录本转化为六个数组，所述数组中的元素就是每个子窗口的窗口分值；Step 32, the scoring matrix will score each sub-window of the sliding window during the scanning process, and convert a transcript composed of nucleotide sequences into six arrays, the elements in the array is the window score of each sub-window;

步骤33，利用动态规划算法中的求最大子段和的方式在所述六个数组中的每一个数组中找出一条加和最大的子段，得到六个候选最大字段；Step 33, using the method of seeking the maximum subsection sum in the dynamic programming algorithm to find a subsection with the largest sum in each of the six arrays to obtain six candidate maximum fields;

步骤34，所述打分矩阵在所述六个候选最大字段中找出分值最大的那一条作为该转录本的最像CDS区域的特征子序列。Step 34, the scoring matrix finds the one with the highest score among the six maximum candidate fields as the characteristic subsequence of the transcript most similar to the CDS region.

而所述步骤33中加和最大的子段X计算公式为：And the calculation formula of the subsection X with the largest sum in the step 33 is:

$X x = = \underset{11 \leq \leq i i \leq \leq j j \leq \leq n no}{max max} {{{Σ Σ}_{k k = = i i}^{j j} a a [[k k]]}}$

a[k]是拥有这个最大子段和的最大子段，i和j分别代表a[k]这个最大子段在这种读码方式中的起始和终止位置。a[k] is the largest sub-segment with the largest sub-segment sum, and i and j respectively represent the start and end positions of the largest sub-segment a[k] in this reading mode.

进一步的，所述步骤5中：将所述待区分序列分为正、负训练集，然后将所述正、负训练集转换成支持向量机SVM所要求的输入格式，并将其放入SVM进行分类模型的训练，得到区分结果。Further, in the step 5: the sequence to be distinguished is divided into positive and negative training sets, then the positive and negative training sets are converted into the required input format of the support vector machine SVM, and put into the SVM Carry out the training of the classification model and obtain the distinguishing result.

为实现上述目的，本发明还提供了一种区分编码基因和非编码基因的系统，该系统包括：To achieve the above object, the present invention also provides a system for distinguishing coding genes and non-coding genes, the system comprising:

预处理模块，用于将样本集按照编码和非编码序列分为正、负两个训练集合，对正、负两个训练集合分别执行频率统计模块至特征提取模块；The preprocessing module is used to divide the sample set into positive and negative training sets according to the coding and non-coding sequences, and executes the frequency statistics module to the feature extraction module for the positive and negative training sets respectively;

频率统计模块，用于在训练集合中统计出每个ANT在编码序列、非编码序列和基因间区域序列中的出现频率并分别构建出现频率矩阵，基于三个出现频率矩阵通过log2-ratio运算构建打分矩阵；The frequency statistics module is used to count the occurrence frequency of each ANT in the coding sequence, non-coding sequence and intergenic region sequence in the training set and construct the occurrence frequency matrix respectively, which is constructed by log2-ratio operation based on three occurrence frequency matrices scoring matrix;

序列提取模块，所述打分矩阵利用滑动窗口进行打分的方式计算出窗口分值S-score，以此作为分类模型的第一个特征，并使用动态规划算法分别找出由所述样本序列的编码序列和非编码序列所转换成德数组内的具有最大子段和的区域作为特征子序列MLCDS，以所述MLCDS的长度作为分类模型的第二个特征；In the sequence extraction module, the scoring matrix calculates the window score S-score by using a sliding window to score, and uses this as the first feature of the classification model, and uses a dynamic programming algorithm to find out the codes of the sample sequence respectively. Sequence and non-coding sequence are converted into the region with the largest subsection sum in the array as the feature subsequence MLCDS, with the length of the MLCDS as the second feature of the classification model;

特征提取模块，利用i∈(1,2,3,4,5,6)获取分类模型的第三个特征，其中X是读码方式中MLCDS的长度，Yi代表全部六种读码方式中各自的MLCDS长度；The feature extraction module uses i∈(1,2,3,4,5,6) obtains the third feature of the classification model, where X is the length of the MLCDS in the reading mode, and Yi represents the length of the MLCDS in all six reading modes;

利用j∈(1,2,3,4,5)获取分类模型的第四个特征，其中S是读码方式中MLCDS的S-score，Ej代表剩下其他几种读码方式中MLCDS的S-score；use j∈(1,2,3,4,5) obtains the fourth feature of the classification model, where S is the S-score of MLCDS in the reading mode, and Ej represents the S-score of MLCDS in the remaining reading modes. score;

区分结果获得模块，利用所述正、负两个训练集合的五个特征组成正负两个特征向量集合来训练分类模型，待区分序列利用所述分类模型进行预测得到区分结果。The discrimination result obtaining module uses the five features of the positive and negative training sets to form two sets of positive and negative feature vectors to train the classification model, and the sequence to be distinguished is predicted by the classification model to obtain the discrimination result.

${X x}_{i i} N N = = {Σ Σ}_{j j = = 11}^{n no} {S S}_{j j} (({X x}_{i i}))$

${X x}_{i i} F f = = \frac{{X x}_{i i} N N}{T T}$

进一步的，所述序列提取模块包括：Further, the sequence extraction module includes:

扫描模块，使用滑动窗口分别对编码序列和非编码序列的每条转录本序列按照六框读码的方式进行扫描；The scanning module uses a sliding window to scan each transcript sequence of the coding sequence and the non-coding sequence according to the six-frame reading code;

打分模块，所述打分矩阵会在上述扫描的过程中对所述滑动窗口的每一个子窗口进行打分，将一条由核苷酸序列组成的转录本转化为六个数组，所述数组中的元素就是每个子窗口的窗口分值；Scoring module, the scoring matrix will score each sub-window of the sliding window during the above-mentioned scanning process, and convert a transcript composed of nucleotide sequences into six arrays, the elements in the array is the window score of each sub-window;

候选字段获取模块，利用动态规划算法中的求最大子段和的方式在所述六个数组中的每一个数组中找出一条加和最大的子段，得到六个候选最大字段；The candidate field acquisition module uses the method of seeking the maximum subsection sum in the dynamic programming algorithm to find a subsection with the largest sum in each of the six arrays to obtain six candidate maximum fields;

序列选择模块，所述打分矩阵在所述六个候选最大字段中找出分值最大的那一条作为该转录本的最像CDS区域的特征子序列。In the sequence selection module, the scoring matrix finds the one with the highest score among the six candidate maximum fields as the characteristic subsequence most similar to the CDS region of the transcript.

而所述候选字段获取模块中加和最大的子段X计算公式为：And the formula for calculating the subsection X with the largest sum in the candidate field acquisition module is:

进一步的，所述区分结果获得模块中：将所述待区分序列分为正、负训练集，然后将所述正、负训练集转换成支持向量机SVM所要求的输入格式，并将其放入SVM进行分类模型的训练，得到区分结果。Further, in the module for obtaining the distinguishing result: the sequence to be distinguished is divided into positive and negative training sets, and then the positive and negative training sets are converted into the input format required by the support vector machine SVM, and put into Enter the SVM to train the classification model, and obtain the distinction result.

本发明的有益功效在于：The beneficial effects of the present invention are:

发现序列串连密码子对的频率是区分编码及非编码序列的有效方法，该串联密码子对的频率统计可以准确的将编码序列和非编码序列以及编码区域按照其他五种读码方式所产生的序列区分出来。It is an effective method to distinguish coding and non-coding sequences by finding the frequency of tandem codon pairs. The frequency statistics of the tandem codon pairs can accurately generate coding sequences, non-coding sequences and coding regions according to the other five reading methods sequence is distinguished.

采用动态规划的思想，基于滑动窗口的分析可以从一条完整的序列中提取出最能够代表这一条序列的子序列；通过用整个人类的编码基因转录本作为测试集合，我们的发明找到的这些子序列有98%以上都和标准的CDS区域有全部或者部分重合。Using the idea of dynamic programming, the analysis based on sliding window can extract the subsequence that best represents this sequence from a complete sequence; by using the entire human coding gene transcript as a test set, the subsequences found by our invention More than 98% of the sequences overlap with the standard CDS region in whole or in part.

提出了转录本自身差别特征。对于蛋白编码基因转录本和非编码基因转录本来说，不但两者之间在一些属性和特征上有所差别，而且两者内在的差别更加显著，针对蛋白编码基因的转录本，按照六框读码对其进行翻译产生六条不同的序列，并且发现其中有一条，不论是在代表序列的长度还是编码能力上都很显著的区别于其他五种。但是对非编码基因转录本来说不存在这种自身的差别。Differential features of transcripts themselves were proposed. For protein-coding gene transcripts and non-coding gene transcripts, there are not only differences in some attributes and characteristics between the two, but also more significant internal differences between the two. For protein-coding gene transcripts, read according to six frames Translating it into six different sequences, one of them was found to be significantly different from the other five in terms of the length of the representative sequence and the coding ability. But for non-coding gene transcripts this inherent difference does not exist.

综合序列最大字段分值、长度、比重、值的分布及密码子频率五种特征能够得到最佳的分类效果。The best classification results can be obtained by combining the five characteristics of the maximum field score, length, proportion, value distribution and codon frequency of the sequence.

以下结合附图和具体实施例对本发明进行详细描述，但不作为对本发明的限定。The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments, but not as a limitation of the present invention.

附图说明Description of drawings

图1a是本发明的一种区分编码基因和非编码基因的方法流程图；Fig. 1a is a flow chart of a method for distinguishing coding genes and non-coding genes of the present invention;

图1b是本发明的一种区分编码基因和非编码基因的系统示意图；Figure 1b is a schematic diagram of a system for distinguishing coding genes and non-coding genes according to the present invention;

图2是本发明与现有技术两种方法的ROC曲线比较示意图；Fig. 2 is the comparative schematic diagram of the ROC curve of the present invention and two kinds of methods of prior art;

图3是本发明的出现频率矩阵；Fig. 3 is the occurrence frequency matrix of the present invention;

图4是本发明的ANT在编码与非编码序列上出现的倾向性示意图；Figure 4 is a schematic diagram of the tendency of ANT of the present invention to appear on coding and non-coding sequences;

图5是本发明的S-score分布示意图；Fig. 5 is a schematic diagram of the S-score distribution of the present invention;

图6是本发明的读码方式序列的子窗口分值示意图；Fig. 6 is a schematic diagram of the sub-window score of the code-reading sequence of the present invention;

图7是本发明的MLCDS与标准的CDS序列的重合程度示意图；Fig. 7 is a schematic diagram of the coincidence degree between the MLCDS of the present invention and the standard CDS sequence;

图8是本发明的编码、者非编码RNA与平均值统计比较示意图。Fig. 8 is a schematic diagram of statistical comparison between coding and non-coding RNAs of the present invention and average values.

具体实施方式detailed description

图1a是本发明的一种区分编码基因和非编码基因的方法流程图。如图1a所示，该方法包括：Fig. 1a is a flowchart of a method for distinguishing coding genes and non-coding genes according to the present invention. As shown in Figure 1a, the method includes:

步骤2，,在训练集合中统计出每个相邻核苷酸三聚体ANT在编码序列、非编码序列和基因间区域序列中的出现频率并分别构建出现频率矩阵，基于三个出现频率矩阵通过log2-ratio运算构建打分矩阵；Step 2, count the occurrence frequency of each adjacent nucleotide trimer ANT in the coding sequence, non-coding sequence and intergenic region sequence in the training set and construct the occurrence frequency matrix respectively, based on three occurrence frequency matrices Construct scoring matrix through log2-ratio operation;

步骤3，所述打分矩阵利用滑动窗口进行打分的方式计算出窗口分值S-score，以此作为分类模型的第一个特征，并使用动态规划算法分别找出由所述样本集中的编码序列和非编码序列内所转换成的数组内的具有最大子段和的区域作为特征子序列MLCDS，并且以所述MLCDS的长度作为分类模型的第二个特征，其中最大子段和是指在一个由正数和负数组成的数组中，加和最大的一个连续的子段，即如果对这个连续子段的首尾删除或者增加一个在数组中位置相邻的元素都会降低这个连续子段的加和值；Step 3, the scoring matrix calculates the window score S-score by using a sliding window to score, and uses this as the first feature of the classification model, and uses a dynamic programming algorithm to find out the coding sequences in the sample set And the region with the largest sub-segment sum in the array converted into the non-coding sequence is used as the feature sub-sequence MLCDS, and the length of the MLCDS is used as the second feature of the classification model, wherein the largest sub-segment sum refers to a In an array composed of positive and negative numbers, the continuous subsection with the largest sum, that is, if you delete the beginning and end of this continuous subsection or add an adjacent element in the array, the sum of this continuous subsection will be reduced value;

步骤4，利用i∈(1,2,3,4,5,6)获取分类模型的第三个特征，其中X是读码方式中MLCDS的长度，Yi代表全部六种读码方式中各自的MLCDS长度；Step 4, use i∈(1,2,3,4,5,6) obtains the third feature of the classification model, where X is the length of the MLCDS in the reading mode, and Yi represents the length of the MLCDS in all six reading modes;

${X x}_{i i} N N = = {Σ Σ}_{j j = = 11}^{n no} {S S}_{j j} (({X x}_{i i}))$

${X x}_{i i} F f = = \frac{{X x}_{i i} N N}{T T}$

进一步的，所述步骤3包括：Further, said step 3 includes:

图1b是本发明的一种区分编码基因和非编码基因的系统示意图。如图1b所示，该系统包括：Fig. 1b is a schematic diagram of a system for distinguishing coding genes and non-coding genes according to the present invention. As shown in Figure 1b, the system consists of:

预处理模块100，用于将样本集按照编码和非编码序列分为正、负两个训练集合，对正、负两个训练集合分别执行频率统计模块至特征提取模块；The preprocessing module 100 is used to divide the sample set into positive and negative two training sets according to the coding and non-coding sequences, and executes the frequency statistics module to the feature extraction module respectively for the positive and negative two training sets;

频率统计模块200，用于在训练集合中统计出每个ANT在编码序列、非编码序列和基因间区域序列中的出现频率并分别构建出现频率矩阵，基于三个出现频率矩阵通过log2-ratio运算构建打分矩阵；The frequency statistics module 200 is used to count the frequency of occurrence of each ANT in the coding sequence, non-coding sequence and intergenic region sequence in the training set and construct the frequency matrix respectively, based on three frequency matrices by log2-ratio operation Build a scoring matrix;

序列提取模块300，所述打分矩阵利用滑动窗口进行打分的方式计算出窗口分值S-score，以此作为分类模型的第一个特征，并使用动态规划算法分别找出由所述样本序列的编码序列和非编码序列所转换成的数组内的具有最大子段和的区域作为特征子序列MLCDS，以所述MLCDS的长度作为分类模型的第二个特征，其中最大子段和是指在一个由正数和负数组成的数组中，加和最大的一个连续的子段，即如果对这个连续子段的首尾删除或者增加一个在数组中位置相邻的元素都会降低这个连续子段的加和值；In the sequence extraction module 300, the scoring matrix calculates the window score S-score by using a sliding window to score, and uses this as the first feature of the classification model, and uses a dynamic programming algorithm to find out the S-scores of the sample sequences respectively. The area with the largest sub-segment sum in the array converted from the coding sequence and the non-coding sequence is used as the feature sub-sequence MLCDS, and the length of the MLCDS is used as the second feature of the classification model, wherein the maximum sub-segment sum refers to a In an array composed of positive and negative numbers, the continuous subsection with the largest sum, that is, if you delete the beginning and end of this continuous subsection or add an adjacent element in the array, the sum of this continuous subsection will be reduced value;

特征提取模块400，利用i∈(1,2,3,4,5,6)获取分类模型的第三个特征，其中X是读码方式中MLCDS的长度，Yi代表全部六种读码方式中各自的MLCDS长度；The feature extraction module 400 utilizes i∈(1,2,3,4,5,6) obtains the third feature of the classification model, where X is the length of the MLCDS in the reading mode, and Yi represents the length of the MLCDS in all six reading modes;

区分结果获得模块500，利用所述正、负两个训练集合的五个特征组成正负两个特征向量集合来训练分类模型，待区分序列利用所述分类模型进行预测得到区分结果。The discrimination result obtaining module 500 uses the five features of the positive and negative training sets to form two positive and negative feature vector sets to train the classification model, and the sequence to be distinguished is predicted by the classification model to obtain the discrimination result.

${X x}_{i i} N N = = {Σ Σ}_{j j = = 11}^{n no} {S S}_{j j} (({X x}_{i i}))$

${X x}_{i i} F f = = \frac{{X x}_{i i} N N}{T T}$

进一步的，所述序列提取模块300包括：Further, the sequence extraction module 300 includes:

$X x = = \underset{11 \leq \leq i i \leq \leq j j \leq \leq n no}{max max} {{{Σ Σ}_{k k = = 11}^{j j} a a [[k k]]}}$

进一步的，所述区分结果获得模块500中：将所述待区分序列分为正、负训练集，然后将所述正、负训练集转换成支持向量机SVM所要求的输入格式，并将其放入SVM进行分类模型的训练，得到区分结果。中：将所述待区分序列分为正、负训练集，然后将所述正、负训练集转换成支持向量机SVM所要求的输入格式，并将其放入SVM进行分类模型的训练，得到区分结果。Further, in the discrimination result obtaining module 500: the sequence to be distinguished is divided into positive and negative training sets, and then the positive and negative training sets are converted into the input format required by the support vector machine SVM, and its Put it into SVM to train the classification model, and get the distinction result. Middle: Divide the sequence to be distinguished into positive and negative training sets, then convert the positive and negative training sets into the input format required by the support vector machine SVM, and put it into the SVM for classification model training, and obtain Differentiate the results.

图2是本发明与现有技术两种方法的ROC曲线比较示意图。本发明使用小鼠的测试集合来对CNCI、CPC和phyloCSF进行比较，并且给出ROC曲线。在ROC曲线的基础上，本文还给出了一个更加直观的评价指标：最小错误率点，即在该点上平均的假阳性率和假阴性率的乘积达到最低，换句话说这个最小错误率点就是使分类模型达到均衡的点，是分类模型真实性能体现。整个测试及比较的过程分为三步：第一，我们在本地服务器运行CNCI来计算测试样本中的编码和非编码RNA并且纪录结果。第二，我们将测试样本中的编码和非编码的RNA数据提交到北京大学CPC的wed-server上，等待返回结果并纪录。第三为了运行phyloCSF（downloadedfromhttp://compbio.mit.edu/PhyloCSF)我们将测试样本的bed文件和小鼠的全基因组注释信息上传的Galaxy平台上，在该平台上使用“stitchMAFblocks”这个功能模块，计算出29个哺乳动物的多序列比对数据，并且在unix下转换成phyloCSF所需要的输入格式，然后根据phyloCSF的帮助文档以如下参数配置运行该软件--minCodons=30,--orf=ATGStop,--strategy=fixed,--frames=3,and--removeRefGaps。这三种软件会对所分类的序列给出一个分数值，以零为界限来区分编码和非编码转录本，分值越大则说明该序列的编码能力越强否则就越弱。本文也是通过滑动这三种软件最终给出的分值来计算ROC曲线及确定最小错误率的。图中用一条实线和两种不同型号的虚线来代表这三种软件各自的ROC曲线，并且用实心方块标注出最小错误率点由该图给出的信息我们可以确信，相对于小鼠的测试数据集，CNCI有着比CPC和phyloCSF更可靠的分类效果，和更低的最小错误率，分别为0.05、0.11和0.28(图2)。且耗时能力及便捷度均优于其他两款方法的3倍以上。Fig. 2 is a schematic diagram comparing the ROC curves of the present invention and the prior art. The present invention uses a test set of mice to compare CNCI, CPC and phyloCSF, and presents a ROC curve. On the basis of the ROC curve, this paper also gives a more intuitive evaluation index: the minimum error rate point, that is, the product of the average false positive rate and false negative rate at this point reaches the lowest, in other words, the minimum error rate The point is the point where the classification model reaches equilibrium, and it is the true performance of the classification model. The entire testing and comparison process is divided into three steps: First, we run CNCI on the local server to calculate the coding and non-coding RNA in the test sample and record the results. Second, we submitted the coding and non-coding RNA data in the test sample to the wed-server of Peking University CPC, and waited for the results to be returned and recorded. Third, in order to run phyloCSF (downloaded from http://compbio.mit.edu/PhyloCSF), we upload the bed file of the test sample and the whole genome annotation information of the mouse to the Galaxy platform, and use the "stitchMAFblocks" function module on this platform , calculate the multiple sequence alignment data of 29 mammals, and convert it into the input format required by phyloCSF under unix, and then run the software with the following parameter configurations according to the help documentation of phyloCSF --minCodons=30, --orf= ATGStop, --strategy=fixed, --frames=3, and --removeRefGaps. These three software will give a score value to the classified sequence, and use zero as the boundary to distinguish coding and non-coding transcripts. The larger the score, the stronger the coding ability of the sequence, otherwise the weaker it is. This article also calculates the ROC curve and determines the minimum error rate by sliding the final scores given by these three softwares. In the figure, a solid line and two different types of dotted lines are used to represent the respective ROC curves of the three softwares, and the minimum error rate point is marked with a solid square In the test data set, CNCI has a more reliable classification effect than CPC and phyloCSF, and lower minimum error rates, which are 0.05, 0.11 and 0.28, respectively (Figure 2). And the time-consuming ability and convenience are more than 3 times better than the other two methods.

CNCI整体的架构包括两个主要的部分：CNCI打分矩阵的构建和分类模型的建立。首先我们在作为训练集的物种中统计出每个ANT在编码序列、非编码序列和基因间区域序列中的出现频率，并且分别构建了三个出现频率矩阵。以ANT在基因间区域序列中的出现频率作为参考，本发明用ANT在编码序列的频率矩阵与ANT在非编码序列的频率矩阵做log2-ratio运算，从而构建出我们的CNCI打分矩阵，该矩阵是在本发明中首次被提出的。The overall structure of CNCI includes two main parts: the construction of CNCI scoring matrix and the establishment of classification model. First, we counted the occurrence frequency of each ANT in the coding sequence, non-coding sequence and intergenic region sequence in the species used as the training set, and constructed three occurrence frequency matrices respectively. Taking the frequency of occurrence of ANT in the intergenic region sequence as a reference, the present invention uses the frequency matrix of ANT in the coding sequence and the frequency matrix of ANT in the non-coding sequence to perform a log2-ratio operation, thereby constructing our CNCI scoring matrix, the matrix It is proposed for the first time in this invention.

其次在此打分矩阵的基础上，我们开始构建分类模型，在模型的一开始本发明使用一个滑动窗口对每条转录本序列按照六框读码的方式进行扫描，在扫描的过程中CNCI打分矩阵会对每一个子窗口进行打分，从而将一条由核苷酸序列组成的转录本转化为六个数组，然后引入动态规划算法中的求最大子段和的方法，在每一个数组中都找出一条加和最大的子段。在六框扫描结束之后CNCI会在这六个候选的最大子段中自动找出分值最大的那一条作为该转录本的特征子序列，也就是最像CDS区域的序列（MLCDS）。Secondly, on the basis of this scoring matrix, we began to build a classification model. At the beginning of the model, the present invention uses a sliding window to scan each transcript sequence according to the six-frame reading code. During the scanning process, the CNCI scoring matrix Each sub-window will be scored, so that a transcript composed of nucleotide sequences is converted into six arrays, and then the method of finding the maximum sub-segment sum in the dynamic programming algorithm is introduced to find out in each array A subsection with the largest sum. After the scan of the six frames, CNCI will automatically find the one with the highest score among the six candidate maximum subsegments as the characteristic subsequence of the transcript, that is, the sequence most similar to the CDS region (MLCDS).

最后CNCI通过滑动窗口的方法在训练样本的编码和非编码序列集合中都分别计算出各自的MLCDS集合，并基于每条MLCDS提取五个具有生物学价值和统计学意义的特征，在特征提取完毕之后，我们按照正负样本的标签将其送入SVM模型中进行训练，训练完成之后即可实现对未知的RNA序列实施分类的工作。Finally, CNCI calculates the respective MLCDS sets in the coded and non-coded sequence sets of the training samples through the sliding window method, and extracts five features with biological value and statistical significance based on each MLCDS. After that, we send them into the SVM model for training according to the labels of the positive and negative samples. After the training is completed, the work of classifying unknown RNA sequences can be realized.

本文的研究中，我们分析了ANT在CDS和非编码区域的使用频率，其中非编码区域包括非编码RNA和内含子序列。CDS区域上的密码子或者基因组序列上的核苷酸三聚体一共有64种，因此存在着64×64种ANT，在这里我们按照下面的公式统计了每一种ANT在三种不同类型序列集合中的出现频率：In this study, we analyzed the frequency of ANT usage in CDS and non-coding regions, which include non-coding RNAs and intronic sequences. There are 64 kinds of codons on the CDS region or nucleotide trimers on the genome sequence, so there are 64×64 kinds of ANTs. Here we count each ANT in three different types of sequences according to the following formula Frequency of occurrence in collection:

${X x}_{i i} N N = = {Σ Σ}_{j j = = 11}^{n no} {S S}_{j j} (({X x}_{i i}))$

${X x}_{i i} F f = = \frac{{X x}_{i i} N N}{T T}$

在上述的公式中X代表着某种类型的ANT，S_j(X_i)是X在某一类序列集合中的某一条序列上的出现次数，X_iN根据公式计算则是该种ANT在整个某种序列集合中的出现次数，T则表示所有种类的ANT在该数据集中的总共出现次数。其中m表示ANT的种类数目，n表示该类型的集合中所包含的序列条数。所以X_iF就是我们要计算的某一种ANT在某一类数据集中的使用频率。In the above formula, X represents a certain type of ANT, S _j (X _i ) is the number of occurrences of X on a certain sequence in a certain type of sequence set, and X _i N is calculated according to the formula, which means that this type of ANT is in The number of occurrences in the entire set of certain sequences, and T represents the total number of occurrences of all types of ANTs in the data set. Among them, m represents the number of types of ANT, and n represents the number of sequences contained in this type of collection. So X _i F is the frequency of use of a certain ANT in a certain type of data set that we want to calculate.

我们按照编码区域，非编码RNA和内含子序列三种数据集合，分别在人类和小鼠的RNA数据集上使用上述公式计算了ANT的出现频率，并且构建了出现频率矩阵（图3）。横坐标和纵坐标分别代表64种核苷酸三聚体的类型，颜色越深就说明该种ANT的出现频率越高。我们可以看出，不论是对人类还是小鼠来说，它们的ANTs在编码区域的频率图谱都非常接近，这说明对于ANTs来说人类和小鼠还处在同一个进化模型之内，这也是由于进化的距离没有相差太远的缘故，这两个物种之间的编码蛋白区域还受着同一类型的选择压力的制约。而相对于ANTs在非编码和内含子序列上的使用频率来说，不论是在人类还是小鼠上，都与它们在CDS区域的表达图谱有着显著的差别。这组统计结果说明ANTs在编码区域和非编码区域呈现出一种有偏的分布，而这种特性也为我们区分编码与非编码转录本提供了充分的理论基础。According to the three data sets of coding region, non-coding RNA and intron sequence, we used the above formula to calculate the occurrence frequency of ANT on the human and mouse RNA data sets, and constructed the occurrence frequency matrix (Figure 3). The abscissa and ordinate respectively represent the types of 64 nucleotide trimers, and the darker the color, the higher the occurrence frequency of this ANT. We can see that the frequency maps of their ANTs in the coding region are very similar for both humans and mice, which shows that humans and mice are still in the same evolutionary model for ANTs, which is also Because the evolutionary distance is not too far apart, the protein-coding regions between the two species are also subject to the same type of selective pressure. Relative to the frequency of use of ANTs in non-coding and intronic sequences, both in humans and mice, there are significant differences from their expression profiles in CDS regions. This set of statistical results shows that ANTs present a biased distribution in coding regions and non-coding regions, and this characteristic also provides a sufficient theoretical basis for us to distinguish coding and non-coding transcripts.

本发明介绍了ANTs在人类和小鼠上的不同基因区间的使用频率情况，并且证明了ANTs在编码区域与非编码区域的使用频率有着显著的差别，但是并没有给出这种差别的具体表现形式，既某一种ANT究竟是倾向于在编码区域或是在非编码区域出现的更多一些、是否有一些ANTs就是特异的喜好在编码或者非编码区域出现。我们分别计算了ANT在人类编码RNA集合中的出现频率与ANT在人类非编码RNA集合中的出现频率的log2-ratio，并且定义矩阵：The present invention introduces the frequency of use of ANTs in different gene intervals of humans and mice, and proves that there is a significant difference in the frequency of use of ANTs in the coding region and the non-coding region, but does not give the specific performance of this difference Form, that is, whether a certain ANT tends to appear more in the coding region or in the non-coding region, and whether some ANTs have a specific preference to appear in the coding or non-coding region. We calculated the log2-ratio of the frequency of ANT in the human coding RNA set and the frequency of ANT in the human non-coding RNA set, respectively, and defined the matrix:

$H h - - matrix matrix = = log log ((\frac{HC HC - - matrix matrix}{HN HN - - matrix matrix}))$

用同样的方法在小鼠集合中也定义了矩阵：The matrix is also defined in the mice collection in the same way:

$M m - - matrix matrix = = log log ((\frac{MC MC - - matrix matrix}{MN MN - - matrix matrix}))$

如图4所描述，其上面每一个点都代表某一种ANT在编码与非编码序列上出现的倾向性，颜色越趋近于灰色就说明该种ANT越喜欢在编码区域出现，反之越趋近于黑色就越喜欢在非编码区域出现。可以看出，显著倾向于在非编码区域出现的都是包含终止密码子的ANT，既这些ANT都是以tag、taa和tga及其互补对开头或者结尾的。而倾向于在编码区域出现的ANT基本上都是随机的。这也证明了ANT的使用频率在非编码区域大体上是均匀分布的，而在编码区域某些ANT是特异表达的。过计算这两个矩阵的表达图谱也是基本相同的，这充分说明该打分矩阵在人类到小鼠的这段进化距离范围内基本上是可以通用的，因为人类的基因组数据被研究和关注的更多，所以相对于其他物种，目前被各大数据库所收录的转录本数据中，人类的数据量最大并且质量最高，所以我们接下来的训练样本所用的都是来自人类的数据，提到的CNCI分类模型也都是指用人类RNA数据所训练出来的模型。As shown in Figure 4, each point on it represents the tendency of a certain ANT to appear in the coding and non-coding sequences. The closer the color is to gray, the more the ANT likes to appear in the coding region, and vice versa. The closer to black the more it likes to appear in non-coding areas. It can be seen that the ANTs that are significantly inclined to appear in the non-coding region are all ANTs containing stop codons, that is, these ANTs all start or end with tag, taa and tga and their complementary pairs. The ANTs that tended to appear in coding regions were basically random. This also proves that the frequency of ANT usage is roughly evenly distributed in non-coding regions, while some ANTs are specifically expressed in coding regions. The calculated expression profiles of these two matrices are also basically the same, which fully demonstrates that the scoring matrix is basically universal within the range of evolutionary distance from humans to mice, because human genome data are studied and paid more attention to. There are many, so compared to other species, among the transcript data currently collected by major databases, human data is the largest and the highest quality, so our next training samples are all data from human beings. The mentioned CNCI Classification models also refer to models trained with human RNA data.

根据上面构建的CNCI打分矩阵，我们定义了一个名叫序列分值的指标（sequence-score:S-score）来衡量一段序列的编码能力。S-score的计算方法定义如下：According to the CNCI scoring matrix constructed above, we define an index called sequence score (sequence-score: S-score) to measure the coding ability of a sequence. The calculation method of S-score is defined as follows:

$S S = = {Σ Σ}_{i i = = 11}^{n no} {{{H h}_{p p} (({x x}_{n no}))}}$

在这个计算表达式中，S就代表序列的S-score，Hp是CNCI打分矩阵，X代表各种类型的ANTs，n代表该序列以核苷酸三聚体为单位，经过换算之后所得的长度。（一个核苷酸三聚体=3个核苷酸）。经过该S-score的计算我们就可以对任意一条序列给出一个反应其编码能力的分值，这个值越大就说明这段序列的编码能力越强，反之则编码能力越弱。本发明通过一下数据验证这个S-score是否能够有效的区分出真正编码、非编码序列以及编码序列的其他5种读码方式：首先，我们计算了人类的30507条CDS序列的六种编码方式的以及18566条长非编码RNAsd的分值，得到了7组由序列分值组成的数组，并且将这7个数组中的序列分值按照分布密度进行划分。其次我们应用基于人类的CNCI打分矩阵来计算真正的CDS区域、CDS区域的移位读码序列1、In this calculation expression, S represents the S-score of the sequence, Hp is the CNCI scoring matrix, X represents various types of ANTs, and n represents the length of the sequence after conversion in units of nucleotide trimers . (One nucleotide trimer = 3 nucleotides). After the calculation of the S-score, we can give any sequence a score reflecting its coding ability. The larger the value, the stronger the coding ability of this sequence, and vice versa. The present invention uses the following data to verify whether this S-score can effectively distinguish the real coding sequence, non-coding sequence and the other 5 reading methods of the coding sequence: First, we calculated the six coding methods of 30507 human CDS sequences And the scores of 18566 long non-coding RNAsd, get 7 sets of arrays composed of sequence scores, and divide the sequence scores in these 7 arrays according to the distribution density. Secondly, we apply the human-based CNCI scoring matrix to calculate the true CDS region, the shifted read sequence of the CDS region 1,

CDS区域的移位读码序列2、CDS区域的反链读码序列1、CDS区域的反链读码序列2、CDS区域的反链读码序列3和长非编码RNA的S-score（图5）。图中黑色的实线表示真正CDS区域的S-score分布，5条黑色的虚线线代表5种错误读码序列的S-score分布，而黑色的阴影部分则代码长非编码RNA的S-score分布情况。从图上我们可以明显的看到，真正CDS区域普遍都具有较高的S-score，而且绝大部分都大于零，而错误的读码方式和长非编码RNA的S-score分值基本都分布在零以下，并且这两种序列的分值密度分布几乎没有差别。据此我们可以得到两个信息：第一，S-score的方法可以将编码序列与错误的读码方式同非编码序列显著的区分出来，第二，错误的读码方式与非编码序列在S-score的划分下几乎没有什么差别。The shifted reading frame sequence 2 in the CDS region, the anti-strand reading sequence 1 in the CDS region, the anti-strand reading sequence 2 in the CDS region, the anti-strand reading sequence 3 in the CDS region, and the S-score of the long non-coding RNA (Fig. 5). The black solid line in the figure represents the S-score distribution of the true CDS region, the five black dotted lines represent the S-score distribution of the five misread sequences, and the black shaded part represents the S-score of the code long non-coding RNA Distribution. From the figure, we can clearly see that the real CDS regions generally have higher S-scores, and most of them are greater than zero, while the S-scores of wrong reading methods and long non-coding RNAs are basically the same. distribution below zero, and there is little difference in the distribution of score densities for the two series. According to this, we can get two pieces of information: first, the S-score method can significantly distinguish the coding sequence and the wrong reading method from the non-coding sequence, and second, the wrong reading method and the non-coding sequence are in the S-score There is almost no difference under the division of -score.

本文的研究目的是开发一个能够区分编码和非编码RNA的方法，虽然CNCI打分矩阵可以很好的识别CDS区域和其他种类的非编码序列，但是一个真正的编码蛋白RNA除了有CDS区域以外还有5’非编码端（5’UTR）和一个较长的3’非编码端(3’UTR)，这两端都是在编码蛋白转录本上的非编码区域，它们的存在会在很大的程度上降低一个编码序列的S-score，因此我们还不能将这个打分矩阵直接应用到全长的RNA序列上。为了解决这个问题，我们引入了滑动窗口的方法来辅助分析全长的RNA，这个滑动窗口的跨度可以为N（10-100）个核苷酸三聚体，窗口的步长为1个核苷酸三聚体，也就是每滑动一次，窗口会移动3个核苷酸的长度，通过该方法，我们就可以将一个全长的转录本序列划分为n个（n为序列换算为核苷酸三聚体之后的长度-1，之所以做换算是因为方便按照六种读码方式对序列进行扫描）子序列，子序列的长度就是滑动窗口的大小。我们使用CNCI打分矩阵对每一个子窗口进行打分，计算其S-score，由于一条全长的转录本序列有六种读码方式，所以滑动窗口会对任意一条序列按照六种读码方式扫描六次，于是在滑动窗口的分析下，任意一条序列都会被转化为6个独立的数组，数组中的元素就是其原始读码方式中每个子窗口的S-score。我们分别用人类的编码基因转录本集合和gencode.v11数据库中的非编码转录本集合作为测试集，计算了每一条序列的子窗口分值分布情况，并且对这两个集合的分布情况作了整合。通过图谱可以看到，在同一个编码基因转录本的6个分值集合中，我们可以找到一个具有显著区别的曲线（由加粗的实线表示），该曲线向上突出的部分就是真正的CDS区域，而在非编码基因的转录本中就不具备该性质，为了验证本发明发现的这个现象具有普遍性和可重复性，我们通过抽样插值的方法对30507条序列进行了归一化处理，将这些序列都放缩到同一长度，加粗的实线代表，所有编码基因正确的读码方式序列的子窗口分值（图6）。The purpose of this study is to develop a method that can distinguish between coding and non-coding RNAs. Although the CNCI scoring matrix can identify CDS regions and other types of non-coding sequences very well, a real coding protein RNA has The 5' non-coding end (5'UTR) and a longer 3' non-coding end (3'UTR), both of which are non-coding regions on the coding protein transcript, their presence will be in a large The S-score of a coding sequence is reduced to a certain extent, so we cannot directly apply this scoring matrix to the full-length RNA sequence. In order to solve this problem, we introduced a sliding window method to assist in the analysis of full-length RNA. The span of this sliding window can be N (10-100) nucleotide trimers, and the window step size is 1 nucleotide Acid trimer, that is, every time you slide, the window will move 3 nucleotides in length. By this method, we can divide a full-length transcript sequence into n pieces (n is the sequence converted to nucleotides The length after the trimer -1, the reason for the conversion is because it is convenient to scan the sequence according to the six reading methods) subsequence, the length of the subsequence is the size of the sliding window. We use the CNCI scoring matrix to score each sub-window and calculate its S-score. Since a full-length transcript sequence has six reading modes, the sliding window will scan any sequence in six ways. times, so under the analysis of the sliding window, any sequence will be converted into 6 independent arrays, and the elements in the array are the S-scores of each sub-window in the original reading method. We used the human coding gene transcript set and the non-coding transcript set in the gencode.v11 database as the test set, calculated the sub-window score distribution of each sequence, and made a comparison of the distribution of the two sets integrate. It can be seen from the map that in the 6 score sets of the same coding gene transcript, we can find a curve with significant difference (indicated by the bold solid line), and the upward protruding part of the curve is the real CDS Regions, but the transcripts of non-coding genes do not have this property. In order to verify the universality and repeatability of this phenomenon found in the present invention, we normalized the 30507 sequences by sampling interpolation method, Scale these sequences to the same length, and the bold solid line represents the sub-window score of the correct reading sequence of all coding genes (Figure 6).

根据上述研究我们已经发现，对于编码序列来说通常都会找到一种特殊的读码方式，该读码方式的子窗口数组中存在一段连续并且较长S-score，而在非编码序列中我们并没有看到这一性质。所以，解决RNA分类问题的关键就是：如何能够从编码或者非编码RNA中都提取出一段具有最优编码能力的子序列，并且能够保证来自于编码RNA中的子序列不论是在S-score的分值还是子序列的长度都显著的高于来自非编码的子序列。在本文中我们采用了动态规划算法的思想，用求最大子段和的方式来提取上述的子序列，由于这段子序列在编码RNA中基本上就等同于CDS区域，所以我们给它命名为最像CDS的序列（MLCDS），整个求最大子段和并且找出MLCDS的流程如下：According to the above research, we have found that for coding sequences, there is usually a special reading mode, and there is a continuous and long S-score in the sub-window array of this reading mode, while in non-coding sequences, we do not Did not see this nature. Therefore, the key to solving the RNA classification problem is: how to extract a subsequence with optimal coding ability from coding or non-coding RNA, and ensure that the subsequence from the coding RNA is in the S-score Both the score and the length of the subsequence are significantly higher than those from non-coding subsequences. In this paper, we use the idea of dynamic programming algorithm to extract the above-mentioned subsequence by seeking the maximum subsegment sum. Since this subsequence is basically equivalent to the CDS region in the coding RNA, we named it the most Like a CDS sequence (MLCDS), the entire process of finding the maximum sub-segment sum and finding the MLCDS is as follows:

在这里X代表一种读码方式的最大子段和，a[k]是拥有这个最大子段和的最大子段，i和j分别代表a[k]这个最大子段在这种读码方式中的起始和终止位置，我们的目的是要计算出X、i和j，因此定义了如下经过改进的公式：Here X represents the largest sub-segment sum of a code-reading method, a[k] is the largest sub-segment with this largest sub-segment sum, and i and j represent the largest sub-segment of a[k] in this code-reading mode. The starting and ending positions in , our purpose is to calculate X, i and j, so the following improved formula is defined:

$B B [[j j]] = = \underset{11 \leq \leq j j \leq \leq n no}{max max} {{{Σ Σ}_{k k = = m m}^{j j} a a [[k k]]}}$

m是一个变量,b[j]是局部的最大子段和组，a[j]是b[j]的最右端,因此我们可以定义下面的一些准则：m is a variable, b[j] is the local maximum sub-segment and group, a[j] is the rightmost end of b[j], so we can define the following criteria:

$X x = = \underset{11 \leq \leq j j \leq \leq n no}{max max} b b [[j j]]$

根据b[j]的定义我们可以得出一个结论：当b[j-1]>0时不论a[j]为何值都会有,b[j]=b[j-1]+a[j]。而当b[j-1]<0时则不论a[j]为何值都有，b[j]=a[j]。因此：According to the definition of b[j], we can draw a conclusion: when b[j-1]>0, no matter what value a[j] has, b[j]=b[j-1]+a[j] . And when b[j-1]<0, no matter what value a[j] is, b[j]=a[j]. therefore:

$b b [[j j]] = = \underset{11 \leq \leq j j \leq \leq n no}{max max} {{b b [[j j - - 11]] + + a a [[j j]],, a a [[j j]]}}$

经过上面的几步计算我们就可以得到任意一条编码RNA的任意一种读码方式的子窗口数组的最大子段和，也就是最具有编码能力的那一段子序列MLCDS,而六种读码方式则对应了六个候选的MLCDS,在本文中我们选择值最大的那一个作为真正的MLCDS，并且,它所在的读码方式就被认定为正确的读码方式。因为对编码和非编码RNA我们都进行了同样的操作，所以从非编码RNA中无论怎样都会筛选出一条MLCDS。After the above steps of calculation, we can get the maximum sub-segment sum of the sub-window array of any reading method of any encoding RNA, that is, the subsequence MLCDS with the most coding ability, and the six reading methods It corresponds to six candidate MLCDSs. In this paper, we choose the one with the largest value as the real MLCDS, and the code-reading method it is in is identified as the correct code-reading method. Since we did the same for both coding and non-coding RNAs, one MLCDS would be selected from the non-coding RNAs anyway.

MLCDS的提取是构建整个分类模型的重点，也可以说这些个MLCDS的质量直接决定CNCI最终的分类效果。所以为了证明模型的可信度，我们不得不全面的评估一下我们所找到的这些MLCDS的质量。由于非编码RNA是没有参考CDS区域的，所以本发明参照hg19上的30507个标准的CDS，并且依照它们原始的全长RNA序列，提取出这些CDS所属的读码方式及它们在该读码方式中的起始和终止位置，然后以这些信息为参考条件，和我们从全长RNA中找到的MLCDS做了一下比较（图7）。图中的各种数字表明了MLCDS与标准CDS的重合程度，饼图中的百分数则表示有百分之多少的MLCDS符合这种重合程度，例如图中数字1代表的部分说明我们找到的MLCDS中，与标准的CDS序列的重合程度大于95%的占到了25%。而数字9表示仅仅有1.7%的MLCDS与同其相对应的CDS有小于40%的重合。The extraction of MLCDS is the focus of building the entire classification model. It can also be said that the quality of these MLCDS directly determines the final classification effect of CNCI. So in order to prove the credibility of the model, we had to comprehensively evaluate the quality of the MLCDS we found. Since the non-coding RNA does not refer to the CDS region, the present invention refers to 30,507 standard CDSs on hg19, and extracts the reading frames to which these CDS belong and their reading frames according to their original full-length RNA sequences The starting and ending positions of the RNA, and then using this information as a reference condition, compared with the MLCDS we found from the full-length RNA (Figure 7). The various numbers in the figure indicate the degree of coincidence between MLCDS and standard CDS, and the percentages in the pie chart indicate how many percent of MLCDS meet this degree of coincidence. For example, the part represented by the number 1 in the figure shows that in the MLCDS we found , 25% of which have a coincidence degree greater than 95% with the standard CDS sequence. And the number 9 means that only 1.7% of MLCDSs overlap with their corresponding CDSs by less than 40%.

为了训练分类模型，以便于区分编码和长非编码RNA，在本发明中定义了五个分类特征。首先对于编码RNA来说,它们都普遍具有一段很长而且编码质量较高的CDS区域，而非编RNA不具有这种性质，目前最新的研究发现有一些长非编码RNA中可能会含有一些较短的肽链，但是其长度和编码能力都远逊于真正的CDS区域，而我们找到的MLCDS序列正好可以反映出这一特性。所以MLCDS序列的长度和其S-score的值在本文中被定义为两个最基本的分类特征，它们分别被命名为M-Score和M-Length。其次，根据我们上面的讨论，在编码RNA的六种读码方式中，总会出现一个在子窗口的S-score分布上显著区别于其他五种的方式的子序列，但是非编码RNA却并没有这种属性。根据这种现象，我们从序列自身差别的角度定义了两个原创性的特征，这种特征提取的方法和思想从没有在其他的相关文献中报道过。所谓序列自身的差别就是指，如果分析的是一条编码RNA那么就一定会存在一种指标，该指标可以体现出在这条RNA的六种读码方式中存在着一条特殊的序列，该序列加强了这六个数组之间的离散程度，但是如果是一条非编码RNA，那么这种离散程度就相对不会那么显著。因此，这两个新特征，LENGTH-Percentage和SCORE-distance定义如下：In order to train a classification model to distinguish coding and long non-coding RNAs, five classification features are defined in the present invention. First of all, for coding RNAs, they generally have a long CDS region with high coding quality, while non-coding RNAs do not have this property. The latest research has found that some long non-coding RNAs may contain some relatively high-quality CDS regions. Short peptide chain, but its length and coding ability are far inferior to the real CDS region, and the MLCDS sequence we found can just reflect this characteristic. Therefore, the length of the MLCDS sequence and its S-score value are defined as the two most basic classification features in this paper, and they are named M-Score and M-Length respectively. Secondly, according to our discussion above, in the six reading modes of coding RNA, there will always be a subsequence that is significantly different from the other five ways in the S-score distribution of the sub-window, but non-coding RNA does not. There is no such property. According to this phenomenon, we define two original features from the perspective of the sequence's own differences. This feature extraction method and idea have never been reported in other related literature. The so-called difference in the sequence itself means that if a coding RNA is analyzed, there must be an indicator, which can reflect that there is a special sequence among the six reading methods of this RNA, and this sequence strengthens the However, if it is a non-coding RNA, then this degree of dispersion is relatively insignificant. Therefore, the two new features, LENGTH-Percentage and SCORE-distance are defined as follows:

$LENGTH LENGTH - - Percentage Percentage = = \frac{X x}{{Σ Σ}_{i i = = 00}^{n no} (({Y Y}_{i i}))},, i i &Element; &Element; ((1,2,3,4,5,6 1,2,3,4,5,6))$

X是正确的读码方式中MLCDS的长度，Yi代表全部六种读码方式中各自的MLCDS长度.X is the length of MLCDS in the correct reading mode, and Yi represents the length of MLCDS in all six reading modes.

$SCORE SCORE - - dis dis tan the tan ce ce = = \frac{{Σ Σ}_{j j = = 00}^{n no} ((S S - - {E E.}_{j j}))}{55},, j j &Element; &Element; ((1,2,3,4,5 1,2,3,4,5))$

S是正确的读码方式中MLCDS的S-score，Ej代表剩下其他几种读码方式中MLCDS的S-score。S is the S-score of MLCDS in the correct reading method, and Ej represents the S-score of MLCDS in the remaining reading methods.

为了验证这些特征确实可以显著的用来区别编码和非编码序列，用已知的具有良好基因注释的编码与非编码RNA对上述四种特征在各自类型的转录集合中所计算出来的值做了一下统计，并且求出了这四个特征在编码和非编码集合中的均值（见表1）。In order to verify that these features can indeed be significantly used to distinguish coding and non-coding sequences, the values calculated for the above four features in their respective types of transcript sets were compared with known coding and non-coding RNAs with good gene annotations. Take a look at the statistics, and find the mean of these four features in the coding and non-coding sets (see Table 1).

表1Table 1

为了验证这四种特征在不同的物种中存在普遍的适用性，我们选择了基因注释最好的人类和小鼠这两个物种。由此可以看出这，四个特征在这两个物种里都具有很好的分类效果，它们在编码和非编码基因集合中所计算出的值有着明显的差别，所以这四个特征被我们选定用来训练分类模型。最终，我们用单个核苷酸三聚体在MLCDS区域中的出现频率定义了一个61维的特征（去掉了3个终止密码子）以此来补充完整CNCI的分类模型。核苷酸三聚体在CDS区域叫做密码子，它们是直接与翻译蛋白相关并且具有生物学意义的关键元素，既然蛋白质作为承载生物功能的具体物质，那么它们的组成部分一定不可能是随机的，不会出现20种氨基酸均匀的出现在多肽链上这种情况。一定有些组成生物体所必须的蛋白的密码子在CDS区域是被多次使用的，这些三聚体或者是密码子的使用频率一定远远高于20种氨基酸出现频率的平均值，而不经常被使用的则会出现相反的情况。我们还是先假设这些三聚体在非编码序列上的出现频率近似等于平均值，把这些三聚体在编码和非编码区域的出现频率也对应的做一个log2-ratio运算，那么更加倾向于在编码区域出现的三聚体的使用频率就会高于平均值，而倾向于在非编码区域出现的则是被使用较少的。我们用来做统计的数据集依然是hg19上的30507条编RNAs和Gencode(v11)上的18566条非编RNAs（图8）。纵坐标代表同一个密码子在编码区域出现频率相对于在非编码区域出现频率log-ration值，横坐标代表除去终止密码子之外的61中核苷酸三聚体,可以看出，对于编码或者非编码RNA来说，绝大多数的核苷酸三聚体都有着较为明显的偏好性，即便是偏好性不是很明显的少数几个核苷酸三聚体也会对整体的分类效果起到正向的加权效果，所以我们将这个61维的单个核苷酸三聚体偏好性特征补充加入到CNCI模型之中来加强分类效果。到此步为止，整个分类模型的全部特征都定义完毕。In order to verify the general applicability of these four features in different species, we selected the two species with the best gene annotations, human and mouse. It can be seen from this that the four features have a good classification effect in these two species, and their calculated values in the coding and non-coding gene sets have obvious differences, so these four features are used by us. Selected to train the classification model. Finally, we defined a 61-dimensional feature (with three stop codons removed) using the frequency of individual nucleotide trimers in the MLCDS region to complement the full CNCI classification model. Nucleotide trimers are called codons in the CDS region. They are key elements that are directly related to the translation of proteins and have biological significance. Since proteins are specific substances that carry biological functions, their components must not be random , there will be no such situation that 20 kinds of amino acids appear uniformly on the polypeptide chain. There must be some codons of proteins necessary for the composition of organisms that are used multiple times in the CDS region. The usage frequency of these trimers or codons must be much higher than the average frequency of 20 amino acids, and not often The opposite happens when used. We still assume that the frequency of occurrence of these trimers in non-coding sequences is approximately equal to the average value, and perform a log2-ratio operation on the frequency of occurrence of these trimers in coding and non-coding regions, so it is more inclined to be in Trimers that occur in coding regions are used more frequently than average, while trimers that tend to occur in non-coding regions are used less frequently. The data set we used for statistics is still 30507 encoded RNAs on hg19 and 18566 non-edited RNAs on Gencode (v11) (Figure 8). The ordinate represents the frequency of the same codon in the coding region relative to the log-ration value of the frequency in the non-coding region, and the abscissa represents the 61 nucleotide trimers except the stop codon. It can be seen that for coding or For non-coding RNA, the vast majority of nucleotide trimers have obvious preferences, and even a few nucleotide trimers with less obvious preferences will play a role in the overall classification effect. Positive weighting effect, so we added this 61-dimensional single nucleotide trimer preference feature to the CNCI model to strengthen the classification effect. So far, all the features of the entire classification model have been defined.

对人类编码和非编码RNA数据合集按照分析流程进行特征提取，然后将得到的特征序列分为正、负训练集并且整理成SVM所要求的输入格式，放入其中进行分类模型的训练。在本文中我们使用的SVM为LIBSVM3.0版本，训练模型时选取了RBF核函数，并且为了防止过学习现象的发生，所有的参数包括C和g都采用默认值。Feature extraction is performed on human coding and non-coding RNA data collections according to the analysis process, and then the obtained feature sequences are divided into positive and negative training sets and organized into the input format required by SVM, and put into it for training of classification models. In this paper, the SVM we use is LIBSVM3.0 version, and the RBF kernel function is selected when training the model, and in order to prevent the occurrence of over-learning phenomenon, all parameters including C and g adopt default values.

当然，本发明还可有其它多种实施例，在不背离本发明精神及其实质的情况下，熟悉本领域的技术人员当可根据本发明作出各种相应的改变和变形，但这些相应的改变和变形都应属于本发明所附的权利要求的保护范围。Certainly, the present invention also can have other multiple embodiments, without departing from the spirit and essence of the present invention, those skilled in the art can make various corresponding changes and deformations according to the present invention, but these corresponding Changes and deformations should belong to the scope of protection of the appended claims of the present invention.

Claims

1. distinguish a method for encoding gene and Noncoding gene, it is characterized in that, comprising:

Step 1, is divided into positive and negative two training set by sample set according to coding and non-coding sequence, performs step 2 respectively to step 4 to positive and negative two training set;

Step 2, counts the frequency of occurrences X of each adjacent nucleotide tripolymer ANT in coded sequence, non-coding sequence and intergenic region sequence in training set _if also builds respectively and occurs frequency matrix, builds scoring matrix, wherein frequency of occurrences X based on three frequency of occurrences matrixes by log2-ratio computing _ithe computing formula of F is:

X_{i} N = Σ_{j = 1}^{n} S_{j} (X_{i})

T = Σ_{i = 1}^{m} X_{i} N = Σ_{i = 1}^{m} (Σ_{j = 1}^{n} S_{j} (X_{i}); m = 64 * 64; n = (1........ N)

X_{i} F = \frac{X_{i} N}{T}

Wherein X _irepresent the ANT of i-th certain type, S _j(X _i) be occurrence number in a certain bar sequence of X in a certain class arrangement set, X _in is the occurrence number of this kind of ANT in certain arrangement set whole, and T then represents the altogether occurrence number of the ANT of all kinds in certain arrangement set whole, and m represents the kind number of ANT, and n represents the sequence number comprised in the set of the type;

Step 3, the mode that described scoring matrix utilizes moving window to carry out giving a mark calculates window score value S-score, in this, as first feature of disaggregated model, and in the array using dynamic programming algorithm to find out respectively to be converted to by the coded sequence in described sample set and non-coding sequence have maximum subsegment and region as feature subsequence MLCDS, and using the length of described MLCDS as second of disaggregated model feature;

Step 4, utilizes obtain the 3rd feature of disaggregated model, wherein X is the length of MLCDS in reading code mode, and Yi represents MLCDS length respective in whole six kinds of reading code modes;

Utilize obtain the 4th feature of disaggregated model, wherein S is the S-score of the MLCDS extracted according to correct reading code mode in the six kinds of reading code modes had altogether at nucleotide sequence, the S-score of the MLCDS extracted in remaining other the five kinds wrong reading code modes of Ej representative;

Utilize single core thuja acid tripolymer to carry out log2-ratio computing in the frequency of occurrences of coding and non-coding region, obtain five feature of nucleotide tripolymer Preference as disaggregated model;

Step 5, utilizes five positive and negative two incompatible train classification models of set of eigenvectors of feature composition of described positive and negative two training set, treats that distinguishing sequence utilizes described disaggregated model to carry out prediction and obtains distinguishing result.

2. the method distinguishing encoding gene and Noncoding gene as claimed in claim 1, it is characterized in that, described step 2 comprises:

Step 21, uses moving window to scan according to the mode of six frame reading codes every bar transcript sequence of coded sequence and non-coding sequence respectively;

Step 22, described scoring matrix can be given a mark to each subwindow of described moving window in the process of above-mentioned scanning, the transcript that one is made up of nucleotide sequence is converted into six arrays, the element in described array is exactly the window score value of each subwindow;

Step 23, utilize in dynamic programming algorithm ask maximum subsegment and each array in described six arrays of mode in find out one and add and maximum subsegment, obtain six candidate's largest fields;

Step 24, described scoring matrix find out in described six candidate's largest fields maximum that of score value as this transcript as the feature subsequence in CDS region.

3. the method distinguishing encoding gene and Noncoding gene as claimed in claim 2, is characterized in that, adds with maximum subsegment x computing formula to be in described step 23:

x = \underset{1 \leq i \leq j \leq n}{m a x} {Σ_{k = i}^{j} a [k]}

A [k] be have this maximum subsegment and maximum subsegment, i and j represents the initial sum final position of this maximum subsegment of a [k] in this reading code mode respectively.

4. the method distinguishing encoding gene and Noncoding gene as claimed in claim 1, it is characterized in that, in described step 5: treat that distinguishing sequence is divided into positive and negative training set by described, then described positive and negative training set is converted to input format required by support vector machines, and put it into the training that SVM carries out disaggregated model, obtain distinguishing result.

5. distinguish a system for encoding gene and Noncoding gene, it is characterized in that, comprising:

Pretreatment module, for sample set is divided into positive and negative two training set according to coding and non-coding sequence, performs frequency statistics module to characteristic extracting module respectively to positive and negative two training set;

Frequency statistics module, for counting the frequency of occurrences X of each ANT in coded sequence, non-coding sequence and intergenic region sequence in training set _if also builds respectively and occurs frequency matrix, builds scoring matrix, wherein frequency of occurrences X based on three frequency of occurrences matrixes by log2-ratio computing _ithe computing formula of F is:

X_{i} N = Σ_{j = 1}^{n} S_{j} (X_{i})

T = Σ_{i = 1}^{m} X_{i} N = Σ_{i = 1}^{m} (Σ_{j = 1}^{n} S_{j} (X_{i}); m = 64 * 64; n = (1........ N)

X_{i} F = \frac{X_{i} N}{T}

Sequential extraction procedures module, the mode that described scoring matrix utilizes moving window to carry out giving a mark calculates window score value S-score, in this, as first feature of disaggregated model, and in the array using dynamic programming algorithm to find out respectively to be converted to by the coded sequence of described sample sequence and non-coding sequence have maximum subsegment and region as feature subsequence MLCDS, using the length of described MLCDS as second of disaggregated model feature;

Characteristic extracting module, utilizes obtain the 3rd feature of disaggregated model, wherein X is the length of MLCDS in reading code mode, and Yi represents MLCDS length respective in whole six kinds of reading code modes;

Utilize obtain the 4th feature of disaggregated model, wherein S is the S-score of MLCDS in the S-score of MLCDS in reading code mode, Ej representative other several reading code modes remaining;

Distinguish result and obtain module, utilize five positive and negative two incompatible train classification models of set of eigenvectors of feature composition of described positive and negative two training set, treat that distinguishing sequence utilizes described disaggregated model to carry out prediction and obtains distinguishing result.

6. the system distinguishing encoding gene and Noncoding gene as claimed in claim 5, it is characterized in that, described sequential extraction procedures module comprises:

Scan module, uses moving window to scan according to the mode of six frame reading codes every bar transcript sequence of coded sequence and non-coding sequence respectively;

Scoring modules, described scoring matrix can be given a mark to each subwindow of described moving window in the process of above-mentioned scanning, the transcript that one is made up of nucleotide sequence is converted into six arrays, the element in described array is exactly the window score value of each subwindow;

Candidate's field acquisition module, utilize in dynamic programming algorithm ask maximum subsegment and each array in described six arrays of mode in find out one and add and maximum subsegment, obtain six candidate's largest fields;

Sequence selection module, described scoring matrix find out in described six candidate's largest fields maximum that of score value as this transcript as the feature subsequence in CDS region.

7. the system distinguishing encoding gene and Noncoding gene as claimed in claim 6, is characterized in that, adds with maximum subsegment x computing formula to be in described candidate's field acquisition module:

x = \underset{1 \leq i \leq j \leq n}{m a x} {Σ_{k = i}^{j} a [k]}

8. the system distinguishing encoding gene and Noncoding gene as claimed in claim 5, it is characterized in that, described differentiation result obtains in module: treat that distinguishing sequence is divided into positive and negative training set by described, then described positive and negative training set is converted to input format required by support vector machines, and put it into the training that SVM carries out disaggregated model, obtain distinguishing result.