CN104484580B

CN104484580B - Antibacterial peptide Activity Prediction method based on Multi-label learning

Info

Publication number: CN104484580B
Application number: CN201410712399.6A
Authority: CN
Inventors: 周丰丰; 王普; 肖绚; 葛瑞泉; 刘记奎
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2014-11-28
Filing date: 2014-11-28
Publication date: 2017-08-25
Anticipated expiration: 2034-11-28
Also published as: CN104484580A

Abstract

Then antibacterial peptide Activity Prediction method based on Multi-label learning obtains corresponding moment characteristics according to physico-chemical properties coding, collectively forms the characteristic vector of peptide sequence by extracting the corresponding aminoacid ingredient of peptide sequence.The characteristic vector of every peptide sequence is made up of two parts, and one is aminoacid ingredient, and two be that the moment characteristics extracted are encoded based on physico-chemical properties.Using the Multi-label learning algorithm computational minimization transformation matrix W of least square, then each mark output of sample to be tested can be drawn by transformation matrix W, obtaining prediction class label vector set according to each mark output closes.The activity for quick and precisely predicting antibacterial peptide sequence is closed according to class label vector set.It is specific therefore, it is possible to the shape that obtains peptide sequence all angles, so as to quick, accurate, automatic marking antibacterial peptide activity.

Description

Antimicrobial Peptide Activity Prediction Method Based on Multi-label Learning

技术领域technical field

本发明涉及生物医学工程，特别是涉及一种能够快速、准确、自动标注抗菌肽活性的基于多标记学习的抗菌肽活性预测方法。The invention relates to biomedical engineering, in particular to a method for predicting antimicrobial peptide activity based on multi-label learning that can quickly, accurately and automatically mark antimicrobial peptide activity.

背景技术Background technique

抗菌肽是一种参与固有免疫的小分子多肽，一般由20～60个氨基酸残基组成，这类活性多肽对细菌具有广谱高效杀菌活性。随着人们研究的深入，发现这些抗细菌肽对部分真菌、原虫、病毒及癌细胞等均具有强有力的杀伤作用。抗菌肽的广泛的生物学活性显示了其在医学上良好的应用前景。Antimicrobial peptides are small molecular polypeptides involved in innate immunity, generally composed of 20 to 60 amino acid residues, and these active peptides have broad-spectrum and high-efficiency bactericidal activity against bacteria. With the deepening of people's research, it is found that these antibacterial peptides have a strong killing effect on some fungi, protozoa, viruses and cancer cells. The wide range of biological activities of antimicrobial peptides shows its good application prospects in medicine.

通过实验手段测定抗菌肽的活性，无论是基于体内或体外的技术，不仅非常费时，费用也较昂贵。目前，研究者们已经提出了十多种抗菌肽预测器，然而这些工具基本都是用于判断肽分子是否具有抗菌性，或者说是否属于抗菌肽家族，没有进一步对抗菌肽的具体活性做出预测。大多数都是设计二分类模型用来判断肽分子是否属于抗菌肽；或提出的方法能够实现对抗菌肽的活性预测，但是只限于5种活性，预测精度也有待进一步提高。现有的方法大多数都是二分类模型，只能用于抗菌肽识别。Determining the activity of antimicrobial peptides by experimental means, whether based on in vivo or in vitro techniques, is not only very time-consuming, but also expensive. At present, researchers have proposed more than ten kinds of antimicrobial peptide predictors. However, these tools are basically used to judge whether the peptide molecule has antibacterial properties, or whether it belongs to the antimicrobial peptide family, and no further research has been made on the specific activity of antimicrobial peptides. predict. Most of them design binary classification models to judge whether peptide molecules belong to antimicrobial peptides; or the proposed method can realize the activity prediction of antimicrobial peptides, but it is limited to 5 kinds of activities, and the prediction accuracy needs to be further improved. Most of the existing methods are binary classification models, which can only be used for antimicrobial peptide identification.

发明内容Contents of the invention

基于此，有必要针对提供一种能够快速、准确、自动标注抗菌肽活性的基于多标记学习的抗菌肽活性预测方法。Based on this, it is necessary to provide a method for predicting the activity of antimicrobial peptides based on multi-label learning that can quickly, accurately and automatically mark the activities of antimicrobial peptides.

一种基于多标记学习的抗菌肽活性预测方法，包括以下步骤：A method for predicting antimicrobial peptide activity based on multi-label learning, comprising the following steps:

提取肽序列对应的氨基酸成分，并根据所述氨基酸成分获取对应的矩特征向量x，其中，所述矩特征向量x用于描述肽序列各个角度的形状特点；Extracting the amino acid composition corresponding to the peptide sequence, and obtaining the corresponding moment feature vector x according to the amino acid composition, wherein the moment feature vector x is used to describe the shape characteristics of each angle of the peptide sequence;

采用多标记学习算法并根据公式W＝(X^TX)^-1X^TY计算最小化变换矩阵W，其中，设x的类标签向量为y＝[y₁,y₂,...,y_c]^T；最小化变换矩阵W的公式为min||XW-Y||；c为种类标签数，X表示训练样本矩阵，Y表示训练样本对应的类标记矩阵，每个行向量对应一个样本；则对于待测样本x，其对各标记的输出为f(x,y)＝xW；Using the multi-label learning algorithm and calculating the minimized transformation matrix W according to the formula W=(X ^T X) ^-1 X ^T Y, where the class label vector of x is set to be y=[y ₁ ,y ₂ ,...,y _c ] ^T ; the formula for minimizing the transformation matrix W is min||XW-Y||; c is the number of category labels, X represents the training sample matrix, Y represents the class label matrix corresponding to the training sample, and each row vector corresponds to a sample ; Then for the sample x to be tested, its output to each mark is f(x, y)=xW;

根据各标记输出f(x,y)＝xW获取预测类标签向量集合h(x)＝{y|f(x,y)≥0,y∈{1,2,...,c}}。According to each label output f(x,y)=xW, obtain the prediction class label vector set h(x)={y|f(x,y)≥0, y∈{1,2,...,c}}.

在其中一个实施例中，所述提取肽序列对应的氨基酸成分和矩特征向量x的步骤包括：In one of the embodiments, the step of extracting the amino acid composition and moment feature vector x corresponding to the peptide sequence includes:

根据氨基酸的物理化学属性指标对氨基酸序列作数字编码；The amino acid sequence is digitally coded according to the physical and chemical property indicators of the amino acid;

将氨基酸序列的每个氨基酸残基一一对应转换成数值序列；Convert each amino acid residue of the amino acid sequence into a numerical sequence in one-to-one correspondence;

根据所述数值序列对肽序列的整体、N端和C端计算矩特征向量x，其中，N端指肽序列的前5个氨基酸，C端指肽序列的后5个氨基酸。Calculate the moment feature vector x for the whole, N-terminal and C-terminal of the peptide sequence according to the numerical sequence, wherein the N-terminal refers to the first 5 amino acids of the peptide sequence, and the C-terminal refers to the last 5 amino acids of the peptide sequence.

在其中一个实施例中，所述矩特征向量x包括1阶原点矩、2阶中心矩、3阶中心矩和4阶中心矩。In one embodiment, the moment feature vector x includes a first-order origin moment, a second-order central moment, a third-order central moment, and a fourth-order central moment.

在其中一个实施例中，所述类标签向量为y＝[y₁,y₂,...,y_c]^T中y_i＝1表示样本x具有类标签i；y_i＝-1表示样本x不具有类标签i。In one of the embodiments, the class label vector is y=[y ₁ ,y ₂ ,...,y _c ] ^T where y _i =1 means that the sample x has the class label i; y _i =-1 means that the sample x does not have class label i.

在其中一个实施例中，判断X^TX是否可逆，若否，则用X^TX的广义逆替代。In one embodiment, it is judged whether X ^T X is reversible, and if not, the generalized inverse of X ^T X is used instead.

在其中一个实施例中，还包括采用遗传算法对所述矩特征向量x进行优化。In one of the embodiments, it further includes optimizing the moment eigenvector x by using a genetic algorithm.

在其中一个实施例中，所述采用遗传算法对所述矩特征向量x进行优化的步骤包括：In one of the embodiments, the step of optimizing the moment eigenvector x using a genetic algorithm includes:

选取种群规模；Select the population size;

对染色体编码；encode chromosomes;

选取适应度函数fitness＝海明损失+排序损失+1/10000*特征数目；Select the fitness function fitness = Hamming loss + sorting loss + 1/10000*number of features;

采用精英选择，其中，所述精英选择为上一代种群中最好的2个个体直接带入下一代；Adopt elite selection, wherein, the elite selection is the best 2 individuals in the population of the previous generation directly brought into the next generation;

选取杂交比例0.8；Select a hybridization ratio of 0.8;

当适应度函数值基本不变时，终止进化，选取此时对应的矩特征向量集合。When the fitness function value is basically unchanged, the evolution is terminated, and the corresponding moment feature vector set is selected at this time.

在其中一个实施例中，所述种群进化到150代数，所述适应度函数数值基本不变。In one of the embodiments, when the population evolves to 150 generations, the value of the fitness function remains basically unchanged.

在其中一个实施例中，，采用海明损失、子集准确率、排序损失、覆盖范围、一位错误及平均查准率对所述基于多标记学习的抗菌肽活性预测方法进行评测。In one embodiment, the method for predicting antimicrobial peptide activity based on multi-label learning is evaluated by using Hamming loss, subset accuracy, sorting loss, coverage, one-bit error and average precision.

在其中一个实施例中，采用十折交叉验证评测所述基于多标记学习的抗菌肽活性预测方法，并将计算结果取20次交叉验证的均值。In one of the embodiments, the multi-label learning-based antimicrobial peptide activity prediction method is evaluated by using 10-fold cross-validation, and the calculation result is taken as the mean value of 20 times of cross-validation.

上述基于多标记学习的抗菌肽活性预测方法通过提取肽序列对应的氨基酸成分，然后根据物理化学属性编码获取对应的矩特征，共同构成肽序列的特征向量。每条肽序列的特征向量是由两部分构成，一是氨基酸成分，二是基于物理化学属性编码提取的矩特征。采用最小二乘的多标记学习算法计算最小化变换矩阵W，则能够通过变换矩阵W得出待测样本的各标记输出，根据各标记输出获取预测类标签向量集合。根据类标签向量集合快速准确预测抗菌肽序列的活性。因此，能够获取肽序列各个角度的形状特定，从而能够快速、准确、自动标注抗菌肽活性。The above antimicrobial peptide activity prediction method based on multi-label learning extracts the amino acid components corresponding to the peptide sequence, and then obtains the corresponding moment features according to the physical and chemical attribute encoding, which jointly constitute the feature vector of the peptide sequence. The feature vector of each peptide sequence is composed of two parts, one is the amino acid composition, and the other is the moment feature extracted based on the physical and chemical attribute encoding. Using the least squares multi-label learning algorithm to calculate the minimum transformation matrix W, the label output of the sample to be tested can be obtained through the transformation matrix W, and the predicted class label vector set can be obtained according to the label output. Rapid and accurate prediction of activity of antimicrobial peptide sequences from ensembles of class label vectors. Therefore, the shape specificity of each angle of the peptide sequence can be obtained, so that the antimicrobial peptide activity can be marked quickly, accurately and automatically.

附图说明Description of drawings

图1为基于多标记学习的抗菌肽活性预测方法的流程图；Fig. 1 is the flow chart of antimicrobial peptide activity prediction method based on multi-label learning;

图2为20种氨基酸的物理化学属性值列表；Fig. 2 is the physicochemical attribute value list of 20 kinds of amino acids;

图3为遗传特征选择进化图。Figure 3 is an evolutionary diagram of genetic feature selection.

具体实施方式detailed description

如图1所示，为基于多标记学习的抗菌肽活性预测方法的流程图。As shown in Figure 1, it is a flow chart of the antimicrobial peptide activity prediction method based on multi-marker learning.

步骤S110，提取肽序列对应的氨基酸成分，并根据所述氨基酸成分获取对应的矩特征向量x，其中，所述矩特征向量x用于描述肽序列各个角度的形状特点。Step S110, extract the amino acid composition corresponding to the peptide sequence, and obtain the corresponding moment feature vector x according to the amino acid composition, wherein the moment feature vector x is used to describe the shape characteristics of each angle of the peptide sequence.

步骤S110包括：Step S110 includes:

根据氨基酸的物理化学属性指标对氨基酸序列作数字编码。The amino acid sequence is digitally coded according to the physical and chemical property indicators of the amino acid.

将氨基酸序列的每个氨基酸残基一一对应转换成数值序列。Each amino acid residue of the amino acid sequence is converted into a numerical sequence in one-to-one correspondence.

所述矩特征向量x包括1阶原点矩、2阶中心矩、3阶中心矩和4阶中心矩。The moment feature vector x includes a first-order origin moment, a second-order central moment, a third-order central moment, and a fourth-order central moment.

具体的，肽序列是由20种氨基酸所组成，一个从N端到C端、长度为L的序列表示如下：Specifically, the peptide sequence is composed of 20 amino acids, and a sequence from the N-terminal to the C-terminal with a length of L is expressed as follows:

P＝R₁R₂R₃R₄…R_L P＝R ₁ R ₂ R ₃ R ₄ …R _L

对该序列可提取氨基酸成分(Amino Acid Composition,AAC，即20种氨基酸的出现频率)和矩特征向量。在提取矩特征向量时，首先根据氨基酸的物理化学属性指标对序列作数字编码。假设H_i(i＝1,2,…,20)为20种氨基酸的某种物理化学属性值，据此将蛋白质序列的每个氨基酸残基一一对应地转化成数值，表示为[H(R₁),H(R₂),…,H(R_L)]。对该数值序列可分别对整体、N端(前5个氨基酸)和C端(后5个氨基酸)计算矩特征值，包括1阶原点矩(期望)，2阶中心矩(方差)，3阶中心矩(偏态)和4阶中心矩(峰态)，这些矩特征能从不同角度反映序列的形状特点。本实施例中，将采用5种氨基酸物理化学属性用于氨基酸编码，这5种属性分别为：亲水性(hydropathy index)、分子量(Molecular weight)、PI，pK1(alpha-COOH)、pK2(NH3)。具体值列图2中。经过以上步骤，每条氨基酸序列都可以被表达为80维特征空间里的一个点，或者说向量：The amino acid composition (Amino Acid Composition, AAC, that is, the frequency of occurrence of 20 amino acids) and moment feature vectors can be extracted from the sequence. When extracting the moment feature vector, the sequence is first coded according to the physical and chemical property index of the amino acid. Assuming that H _i (i=1,2,...,20) is a certain physical and chemical attribute value of 20 amino acids, each amino acid residue in the protein sequence is converted into a value correspondingly, expressed as [H( R ₁ ), H(R ₂ ),...,H(R _L )]. For this numerical sequence, the moment eigenvalues can be calculated for the whole, N-terminal (first 5 amino acids) and C-terminal (last 5 amino acids), including the first-order origin moment (expectation), the second-order central moment (variance), and the third-order Central moment (skewness) and fourth-order central moment (kurtosis), these moment features can reflect the shape characteristics of the sequence from different angles. In this example, 5 kinds of amino acid physicochemical attributes will be used for amino acid encoding, and these 5 attributes are: hydrophilicity (hydropathy index), molecular weight (Molecular weight), PI, pK1 (alpha-COOH), pK2 ( NH3). The specific values are listed in Figure 2. After the above steps, each amino acid sequence can be expressed as a point in the 80-dimensional feature space, or a vector:

x＝[x₁,x₂,…,x₈₀]^T x＝[x ₁ ,x ₂ ,…,x ₈₀ ] ^T

80维特征空间可以改变，如40维、120维等。The 80-dimensional feature space can be changed, such as 40-dimensional, 120-dimensional, etc.

步骤S120，采用多标记学习算法并根据公式W＝(X^TX)^-1X^TY计算最小化变换矩阵W，其中，设x的类标签向量为y＝[y₁,y₂,…,y_c]^T；最小化变换矩阵W的公式为min||XW-Y||；c为种类标签数，X表示训练样本矩阵，Y表示训练样本对应的类标记矩阵，每个行向量对应一个样本；则对于待测样本x，其对各标记的输出为f(x,y)＝xW。Step S120, using the multi-label learning algorithm and calculating the minimized transformation matrix W according to the formula W=(X ^T X) ^-1 X ^T Y, where the class label vector of x is set as y=[y ₁ ,y ₂ ,..., y _c ] ^T ; the formula for minimizing the transformation matrix W is min||XW-Y||; c is the number of category labels, X represents the training sample matrix, Y represents the class label matrix corresponding to the training sample, and each row vector corresponds to a sample; then for the sample x to be tested, its output for each label is f(x,y)=xW.

所述类标签向量为y＝[y₁,y₂,...,y_c]^T中y_i＝1表示样本x具有类标签i；y_i＝-1表示样本x不具有类标签i。The class label vector is _y ₌ [y ₁ , _y ₂ , ^.

判断X^TX是否可逆，若否，则用X^TX的广义逆替代。Determine whether X ^T X is reversible, if not, replace it with the generalized inverse of X ^T X.

具体的，设共有c(此处c＝10)种类标签，样本x的类标签向量为y＝[y₁,y₂,...,y_c]^T，其中y_i＝1表示样本具有类标签i，y_i＝-1则表示样本不具有类标签i。则需要找到一个变换矩阵W，使得在训练样本集上的经验风险达到最小化，即min||XW-Y||；Specifically, assuming that there are c (here c=10) category labels in total, the class label vector of sample x is y=[y ₁ ,y ₂ ,...,y _c ] ^T , where y _i =1 means that the sample has class Label i, y _i =-1 means that the sample does not have class label i. Then it is necessary to find a transformation matrix W that minimizes the empirical risk on the training sample set, ie min||XW-Y||;

其中X表示训练样本矩阵，Y表示训练样本对应的类标记矩阵，其中每个行向量对应一个样本。用最小二乘法可得：Among them, X represents the training sample matrix, Y represents the class label matrix corresponding to the training sample, and each row vector corresponds to a sample. It can be obtained by the method of least squares:

W＝(X^TX)^-1X^TYW＝(X ^T X) ^-1 X ^T Y

如果X^TX不可逆，就用其广义逆代替。则对于待测样本x，其对各标记的输出为：If X ^T X is irreversible, replace it with its generalized inverse. Then for the sample x to be tested, its output for each label is:

f(x,y)＝xW；f(x,y)=xW;

进而可知其预测类标记集合。。Then we can know the set of predicted class labels. .

步骤S130，根据各标记输出f(x,y)＝xW获取预测类标签向量集合h(x)＝{y|f(x,y)≥0,y∈{1,2,...,c}}。Step S130, according to each label output f(x, y) = xW to obtain the predicted class label vector set h(x) = {y|f(x, y)≥0, y∈{1,2,...,c }}.

在本实施例中，变换矩阵W采用使得在训练样本集上的经验风险达到最小化时的值，即作为f(x，y)的分类器参数。In this embodiment, the transformation matrix W adopts a value that minimizes the empirical risk on the training sample set, that is, it is used as a classifier parameter of f(x, y).

因为矩特征向量能够从不同角度反映肽序列的形状特点，因而，在获知肽序列的同时，也能够获取肽序列各个角度的形状特点，进而能够快速、准确、自动标注抗菌肽活性。Because the moment eigenvector can reflect the shape characteristics of the peptide sequence from different angles, therefore, while the peptide sequence is known, the shape characteristics of each angle of the peptide sequence can also be obtained, and then the antimicrobial peptide activity can be marked quickly, accurately and automatically.

基于上述所有实施例，采用海明损失、子集准确率、排序损失、覆盖范围、一位错误及平均查准率对所述基于多标记学习的抗菌肽活性预测方法进行评测。从而验证基于标记学习的抗菌肽活性预测方法具有有效性。Based on all the above-mentioned embodiments, the antimicrobial peptide activity prediction method based on multi-label learning was evaluated by using Hamming loss, subset accuracy, sorting loss, coverage, one-bit error and average precision. In order to verify the validity of the antimicrobial peptide activity prediction method based on label learning.

在验证过程中，为了使验证结果准确可靠，故采用十折交叉验证评测所述基于多标记学习的抗菌肽活性预测方法，并将计算结果取20次交叉验证的均值。In the verification process, in order to make the verification results accurate and reliable, the antimicrobial peptide activity prediction method based on multi-label learning was evaluated by using ten-fold cross-validation, and the calculation results were taken as the average of 20 times of cross-validation.

基于上述所有实施例，在提取矩特征向量过程中，为了剔除冗余特征。提高分类器精度，因此，需要对初始的矩特征向量集进行优化。Based on all the above embodiments, in the process of extracting moment feature vectors, in order to eliminate redundant features. To improve the classifier accuracy, therefore, the initial set of moment feature vectors needs to be optimized.

基于多标记学习的抗菌肽活性预测方法还包括采用遗传算法对所述矩特征向量x进行优化。The antimicrobial peptide activity prediction method based on multi-label learning also includes optimizing the moment feature vector x by using a genetic algorithm.

采用遗传算法对所述矩特征向量x进行优化的步骤包括：The step of optimizing the moment eigenvector x by using a genetic algorithm comprises:

选取种群规模。一般的，种群规模为50。Choose a population size. Generally, the population size is 50.

对染色体编码。每条染色体由长度80的0-1串组成，1表示选择对应位置的特征，0则表示不包含对应位置的特征。Chromosomal coding. Each chromosome is composed of 0-1 strings with a length of 80, 1 indicates that the feature of the corresponding position is selected, and 0 indicates that the feature of the corresponding position is not included.

选取适应度函数fitness＝海明损失+排序损失+1/10000*特征数目。Select the fitness function fitness = Hamming loss + sorting loss + 1/10000*number of features.

采用精英选择，其中，所述精英选择为上一代种群中最好的2个个体直接带入下一代；其他不做改变。Adopt elite selection, wherein, the elite selection is the best 2 individuals in the previous generation population directly brought into the next generation; others remain unchanged.

选取杂交比例0.8；即(50-2)*0.8≈38个个体由杂交产生，用于杂交的父辈个体通过锦标赛方法挑选。Select the hybridization ratio of 0.8; that is, (50-2)*0.8≈38 individuals are produced by hybridization, and the parent individuals for hybridization are selected by the championship method.

除了精英选择和杂交，其他后代通过基因突变生成，突变方法为均匀突变，突变概率0.1。In addition to elite selection and hybridization, other offspring are generated through genetic mutation, the mutation method is uniform mutation, and the mutation probability is 0.1.

种群进化到150代数，所述适应度函数数值基本不变。When the population evolves to 150 generations, the value of the fitness function remains basically unchanged.

请结合图3。当种群经过150代进化后，适应度函数值基本不再变化，此时最好结果对应的特征集合包含40个特征。在原始特征空间和优化特征空间，分别对数据集做十折交叉验证，测试结果如下表所示：Please combine with Figure 3. After the population has evolved for 150 generations, the fitness function value basically does not change. At this time, the feature set corresponding to the best result contains 40 features. In the original feature space and the optimized feature space, ten-fold cross-validation is performed on the data set, and the test results are shown in the following table:

从上表可以看出，优化后特征数目少了一半，各项目指标反而得到了提高。其中，↓表示越小越好，↑表示越大越好。表明本实施例中提出的遗传算法优化策略具有很不错的性能，作为一种计算机辅助工具，完全能够成为生物实验的有效补充，大大提高药物开发的效率，同时降低成本。It can be seen from the above table that after optimization, the number of features is reduced by half, and the indicators of each project have been improved instead. Among them, ↓ means the smaller the better, ↑ means the bigger the better. It shows that the genetic algorithm optimization strategy proposed in this example has very good performance. As a computer-aided tool, it can completely become an effective supplement to biological experiments, greatly improve the efficiency of drug development, and reduce costs at the same time.

基于上述所有实施例，基于多标记学习的抗菌肽活性预测方法的实施过程为：Based on all the above-mentioned embodiments, the implementation process of the antimicrobial peptide activity prediction method based on multi-label learning is:

获取抗菌肽序列。抗菌肽序列对应有10种活性。每条肽序列至少具有这10种的1种活性，最多则同时具有7种活性。Obtain antimicrobial peptide sequences. Antimicrobial peptide sequences correspond to 10 activities. Each peptide sequence has at least one of these 10 activities, and at most seven activities at the same time.

首先是提取矩特征向量。The first step is to extract the moment eigenvectors.

根据氨基酸的物理化学属性指标对序列作数字编码。然后将肽序列的每个氨基酸残基一一对应转化成数值。该数值序列可分别对整体、N端和C端计算矩特征向量。这些矩特征向量能够从不同角度反应肽序列的形状特点。本实施例中，采用5种氨基酸物理化学属性用于氨基酸编码。The sequence is digitally coded according to the physicochemical property index of the amino acid. Each amino acid residue of the peptide sequence is then converted into a numerical value in a one-to-one correspondence. This sequence of values allows calculation of moment eigenvectors for the whole, N-terminus and C-terminus, respectively. These moment eigenvectors can reflect the shape characteristics of peptide sequences from different angles. In this example, five kinds of amino acid physicochemical properties are used for amino acid encoding.

然后根据最小二乘的多标记学习算法依次计算出变换矩阵W。Then, the transformation matrix W is sequentially calculated according to the multi-label learning algorithm of least squares.

设共有c(此处c＝10)种类标签，样本x的类标签向量为y＝[y₁,y₂,...,y_c]^T，其中y_i＝1表示样本具有类标签i，y_i＝-1则表示样本不具有类标签i。则需要找到一个变换矩阵W，使得在训练样本集上的经验风险达到最小化，即min||XW-Y||；Assuming that there are c (here c=10) category labels, the class label vector of sample x is y=[y ₁ ,y ₂ ,...,y _c ] ^T , where y _i =1 means that the sample has class label i, y _i =-1 means that the sample does not have the class label i. Then it is necessary to find a transformation matrix W that minimizes the empirical risk on the training sample set, ie min||XW-Y||;

然后，可得出其对各标记的输出为：Then, its output for each token can be derived as:

f(x,y)＝xW；f(x,y)=xW;

进而可知其预测类标记集合为：It can be seen that the set of predicted class labels is:

h(x)＝{y|f(x,y)≥0,y∈{1,2,...,c}}。h(x)={y|f(x,y)≥0, y∈{1,2,...,c}}.

最后，根据与矩特征向量x对应的类标签向量集合h(x)能够反推出对应的氨基酸成分及氨基酸成分对应的肽序列。又因为矩特征向量能够从不同角度反映肽序列的形状特点，因而，在获知肽序列的同时，也能够获取肽序列各个角度的形状特点，进而能够快速、准确、自动标注抗菌肽活性。Finally, according to the class label vector set h(x) corresponding to the moment feature vector x, the corresponding amino acid composition and the peptide sequence corresponding to the amino acid composition can be deduced. And because the moment eigenvector can reflect the shape characteristics of the peptide sequence from different angles, therefore, while the peptide sequence is known, the shape characteristics of each angle of the peptide sequence can also be obtained, and then the antimicrobial peptide activity can be marked quickly, accurately and automatically.

上述基于多标记学习的抗菌肽活性预测方法用于抗菌肽药物开发时，能够用于分子筛选，对旧药新用也有启发作用。When the antimicrobial peptide activity prediction method based on multi-label learning is used in the development of antimicrobial peptide drugs, it can be used for molecular screening, and it can also inspire new uses of old drugs.

基于上述所有实施例，除最小二乘法以外，还可以用其他多标记学习算法。Based on all the above-mentioned embodiments, other multi-label learning algorithms can be used in addition to the least squares method.

基于多标记学习的抗菌肽活性预测方法通过提取肽序列对应的氨基酸成分，然后根据物理化学属性编码获取对应的矩特征，共同构成肽序列的特征向量。每条肽序列的特征向量是由两部分构成，一是氨基酸成分，二是基于物理化学属性编码提取的矩特征。采用最小二乘的多标记学习算法计算最小化变换矩阵W，则能够通过变换矩阵W得出待测样本的各标记输出，根据各标记输出获取预测类标签向量集合。根据类标签向量集合快速准确预测抗菌肽序列的活性。因此，能够获取肽序列各个角度的形状特定，从而能够快速、准确、自动标注抗菌肽活性。The antimicrobial peptide activity prediction method based on multi-label learning extracts the amino acid components corresponding to the peptide sequence, and then obtains the corresponding moment features according to the physical and chemical attribute encoding, which together constitute the feature vector of the peptide sequence. The feature vector of each peptide sequence is composed of two parts, one is the amino acid composition, and the other is the moment feature extracted based on the physical and chemical attribute encoding. Using the least squares multi-label learning algorithm to calculate the minimum transformation matrix W, the label output of the sample to be tested can be obtained through the transformation matrix W, and the predicted class label vector set can be obtained according to the label output. Rapid and accurate prediction of activity of antimicrobial peptide sequences from ensembles of class label vectors. Therefore, the shape specificity of each angle of the peptide sequence can be obtained, so that the antimicrobial peptide activity can be marked quickly, accurately and automatically.

以上所述实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above-mentioned embodiments can be combined arbitrarily. To make the description concise, all possible combinations of the technical features in the above-mentioned embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, should be considered as within the scope of this specification.

以上所述实施例仅表达了本发明的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进，这些都属于本发明的保护范围。因此，本发明专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation modes of the present invention, and the descriptions thereof are relatively specific and detailed, but should not be construed as limiting the patent scope of the invention. It should be pointed out that those skilled in the art can make several modifications and improvements without departing from the concept of the present invention, and these all belong to the protection scope of the present invention. Therefore, the protection scope of the patent for the present invention should be based on the appended claims.

Claims

1. A method for predicting antimicrobial peptide activity based on multi-label learning, comprising the following steps:

Extract the amino acid composition corresponding to the peptide sequence, and obtain the corresponding moment feature vector x according to the amino acid composition, wherein the moment feature vector x is used to describe the shape characteristics of each angle of the peptide sequence; The sequence is digitally coded; each amino acid residue of the amino acid sequence is converted into a numerical sequence one by one; according to the numerical sequence, the moment feature vector x is calculated for the whole, N-terminal and C-terminal of the peptide sequence, wherein the N-terminal refers to the peptide The first 5 amino acids of the sequence, the C-terminal refers to the last 5 amino acids of the peptide sequence

Using a multi-label learning algorithm and calculating the minimized transformation matrix W according to the formula W=(X ^T X) ^-1 X ^T Y, where the class label vector of x is set to be y=[y ₁ ,y ₂ ,...,y _c ] ^T ; the formula for minimizing the transformation matrix W is min||XW-Y||; y ₁ , y ₂ ,..., y _c is the element value of the label vector, c is the number of category labels, X represents the training sample matrix, Y represents the class label matrix corresponding to the training sample, and each row vector corresponds to a sample; then for the sample x to be tested, the output of each label is f(x,y)=xW;

According to each label output f(x,y)=xW, obtain the prediction class label vector set h(x)={y|f(x,y)≥0, y∈{1,2,...,c}}.

2. the antimicrobial peptide activity prediction method based on multi-label learning according to claim 1, is characterized in that, described moment feature vector x comprises 1st order origin moment, 2nd order central moment, 3rd order central moment and 4th order central moment .

3. The antimicrobial peptide activity prediction method based on multi-label learning according to claim 1, wherein the class label vector is y in y=[y ₁ , y ₂ ,..., _y _c ] ^T =1 means sample x has class label i; y _i =-1 means sample x does not have class label i.

4. The antimicrobial peptide activity prediction method based on multi-label learning according to claim 1, characterized in that, it is judged whether X ^T X is reversible, if not, the generalized reverse substitution of X ^T X is used.

5. The antimicrobial peptide activity prediction method based on multi-label learning according to claim 1, further comprising optimizing the moment feature vector x by using a genetic algorithm.

6. the antimicrobial peptide activity prediction method based on multi-label learning according to claim 5, is characterized in that, the described step that adopts genetic algorithm to optimize described moment characteristic vector x comprises:

Select the population size;

encode chromosomes;

Select the fitness function fitness = Hamming loss + sorting loss + 1/10000*number of features;

Adopt elite selection, wherein, the elite selection is the best 2 individuals in the population of the previous generation directly brought into the next generation;

Select a hybridization ratio of 0.8;

When the fitness function value is basically unchanged, the evolution is terminated, and the corresponding moment feature vector set is selected at this time.

7. The antimicrobial peptide activity prediction method based on multi-label learning according to claim 6, wherein the population evolves to 150 generations, and the value of the fitness function is basically unchanged.

8. The antimicrobial peptide activity prediction method based on multi-label learning according to any one of claims 1-7, characterized in that, using Hamming loss, subset accuracy, sorting loss, coverage, one-bit error and average The precision rate was used to evaluate the antimicrobial peptide activity prediction method based on multi-label learning.

9. The antimicrobial peptide activity prediction method based on multi-label learning according to claim 8, wherein the antimicrobial peptide activity prediction method based on multi-label learning is evaluated by using ten-fold cross-validation, and the calculation result is obtained 20 times Cross-validated mean.