CN118522342A

CN118522342A - Active framework-based targeted antigen peptide sequence generation and screening method

Info

Publication number: CN118522342A
Application number: CN202410667529.2A
Authority: CN
Inventors: 段宏亮; 吴志鹏; 徐玟; 宋英
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2024-05-28
Filing date: 2024-05-28
Publication date: 2024-08-20

Abstract

The invention discloses a method for generating and screening a targeting antigen peptide sequence based on an active framework, which comprises the following specific steps: acquiring a source domain dataset of a protein structure and a target domain dataset of a pHLA I complex structure; generating a model based on the source domain data set pre-training protein sequence to obtain new parameters and the model; obtaining pHLA I composite structure independent test datasets; testing on the obtained target domain data set by using the pre-fine tuning model and the obtained post-fine tuning model; testing the pre-fine tuning model and the obtained post-fine tuning model by utilizing a structure prediction model; creating a scoring system for screening the antigenic peptides based on pHLA I affinity prediction models and interaction interface analysis tools; taking two HLA I targets as an example, generating antigen peptide by using a fine-tuned model; screening was performed on the resulting antigenic peptides using a scoring system. The invention has more excellent generation capacity of the target antigen peptide sequence, and provides a scoring system for screening antigen peptides with higher affinity.

Description

A method for generating and screening targeted antigen peptide sequences based on active skeletons

技术领域Technical Field

本发明属于多肽药物研发技术领域，具体涉及一种基于活性骨架的靶向抗原肽序列生成及筛选方法。The present invention belongs to the technical field of polypeptide drug research and development, and specifically relates to a method for generating and screening targeted antigen peptide sequences based on an active skeleton.

背景技术Background Art

在人体细胞中，多肽是一段长度小于50个氨基酸的短链氨基酸序列，发挥着信号传导、抗菌等关键作用。除了这些天然功能外，多肽还被用于补充抗体和小分子的治疗模式。与抗体类似，多肽能够以高亲和力和高选择性结合在蛋白质表面，可以靶向那些目前抗体或小分子无法作用的疾病相关靶点，即所谓的“不可用药空间”。与传统抗体疫苗相比，多肽疫苗展现出多种优势，其易于合成和纯化，成本低廉，并且在诱导人体内的CD8或CD4 T细胞反应方面非常有效。然而，实验设计和筛选多肽的过程非常耗时耗力，因此目前迫切需要一种能够快速可靠地生成靶向抗原肽的工具和筛选方法，以加速相关研究的发展。In human cells, peptides are short-chain amino acid sequences of less than 50 amino acids in length, which play key roles in signal transduction, antibacterial and other fields. In addition to these natural functions, peptides are also used to supplement the therapeutic model of antibodies and small molecules. Similar to antibodies, peptides can bind to the surface of proteins with high affinity and high selectivity, and can target disease-related targets that are currently incapable of antibodies or small molecules, the so-called "undruggable space". Compared with traditional antibody vaccines, peptide vaccines show many advantages. They are easy to synthesize and purify, low cost, and very effective in inducing CD8 or CD4 T cell responses in the human body. However, the process of experimental design and screening of peptides is very time-consuming and labor-intensive. Therefore, there is an urgent need for a tool and screening method that can quickly and reliably generate targeted antigen peptides to accelerate the development of related research.

传统的多肽或蛋白设计依赖于物理能量函数的频繁计算，这些基于物理原理的传统方法依赖于物理能量函数的准确性，并需要结构生物学专业知识的指导，因此传统方法在设计的准确性和效率方面存在局限性。深度学习技术的快速发展为计算蛋白设计领域带来了革命性变革。这些基于深度学习的模型无需使用能量函数，便可以将骨架结构作为条件模拟氨基酸序列的条件分布，自主生成能够折叠成给定结构骨架的序列，从而实现快速且精确的设计。SPROF，ProteinMPNN，PiFold，ProDesignLE等方法的迅速涌现，进一步标志着深度学习在结构生物学领域的应用潜力和成效。然而，将蛋白质设计模型应用于特定靶点的靶向抗原肽生成任务，存在一定的局限性。蛋白质序列生成模型通常基于广泛的蛋白质类型进行训练，因而用于特定靶点的多肽生成任务时，模型的泛化能力会受到限制。尽管这种方法涵盖范围广，但也会使模型难以准确捕捉特定任务的深层信息。此外，针对特定靶点的多肽设计还面临可用实验数据稀缺的挑战。Traditional peptide or protein design relies on frequent calculations of physical energy functions. These traditional methods based on physical principles rely on the accuracy of physical energy functions and require the guidance of structural biology expertise. Therefore, traditional methods have limitations in terms of design accuracy and efficiency. The rapid development of deep learning technology has brought revolutionary changes to the field of computational protein design. These deep learning-based models can simulate the conditional distribution of amino acid sequences as conditions without using energy functions, and autonomously generate sequences that can fold into a given structural skeleton, thereby achieving fast and accurate design. The rapid emergence of methods such as SPROF, ProteinMPNN, PiFold, and ProDesignLE further marks the application potential and effectiveness of deep learning in the field of structural biology. However, there are certain limitations in applying protein design models to the task of generating targeted antigenic peptides for specific targets. Protein sequence generation models are usually trained based on a wide range of protein types, so when used for peptide generation tasks for specific targets, the generalization ability of the model is limited. Although this method covers a wide range, it also makes it difficult for the model to accurately capture the deep information of a specific task. In addition, peptide design for specific targets also faces the challenge of scarce available experimental data.

发明内容Summary of the invention

针对上述问题，本发明的目的在于提供一种基于活性骨架的靶向抗原肽序列生成及筛选方法，并以HLA I靶点为例子具体展示。In view of the above problems, the purpose of the present invention is to provide a method for generating and screening targeted antigen peptide sequences based on an active skeleton, and to specifically demonstrate it using the HLA I target as an example.

具体的技术方案如下：The specific technical solutions are as follows:

一种基于活性骨架的靶向抗原肽序列生成及筛选方法，包括如下步骤：A method for generating and screening targeted antigen peptide sequences based on an active skeleton comprises the following steps:

步骤一：从RCSB Protein Data Bank数据库中，以分辨率小于且氨基酸数量少于10000个为条件，并以2021年8月2日为截止日期获取蛋白质结构数据，然后使用mmseqs2聚类工具，以30％的序列同一性为截止值对所获得的结构数据进行聚类，将获取的蛋白质结构源域数据集随机分为训练集、验证集、测试集。Step 1: From the RCSB Protein Data Bank database, select The number of amino acids was less than 10,000, and the protein structure data were obtained with a cutoff date of August 2, 2021. Then, the mmseqs2 clustering tool was used to cluster the obtained structural data with a sequence identity of 30% as the cutoff value, and the obtained protein structure source domain dataset was randomly divided into training set, validation set, and test set.

步骤二：从步骤一获取的PDB文件中抽取蛋白质骨架的N、C_α、C、O以及虚拟C_β原子表征结构特征，将这些骨架原子作为节点特征，并加入一组独热编码作为额外输入，输入编码器，同时选取骨架原子在欧几里得空间中与之最近的48个氨基酸原子之间的距离，作为边特征输入编码器中，这些特征由三层MPNN在模型反向传播过程中不断更新；Step 2: Extract the N, C _α , C, O and virtual C _β atoms of the protein skeleton from the PDB file obtained in step 1 to represent the structural features. Use these skeleton atoms as node features and add a set of one-hot encoding as additional input to the encoder. At the same time, select the distance between the skeleton atoms and the 48 amino acid atoms closest to them in Euclidean space as edge features and input them into the encoder. These features are continuously updated by the three-layer MPNN during the back propagation process of the model.

步骤三：将通过三层编码器处理后的节点特征和边特征，以及采用自回归掩码技术处理的编码蛋白质序列，输入到解码器中，解码器使用无序性解码策略，在每个解码步骤中随机选择下一个要预测的氨基酸，在任何预测步骤中，模型都会利用之前所有预测的氨基酸信息，以确保提供完整的上下文，对于固定靶点的多肽生成任务来说，模型可以将已知的部分序列作为背景序列，对具有未知区域的结构进行有效推断，训练模型时，通过最小化每个残差的交叉熵损失来优化模型性能，在解码顺序上，用无序性自回归解码方法取代常规的从N端到C端的顺序解码方式；Step 3: Input the node features and edge features processed by the three-layer encoder and the encoded protein sequence processed by the autoregressive masking technology into the decoder. The decoder uses a disordered decoding strategy to randomly select the next amino acid to be predicted in each decoding step. In any prediction step, the model will use all previously predicted amino acid information to ensure that a complete context is provided. For the peptide generation task of a fixed target, the model can use the known partial sequence as the background sequence to effectively infer the structure with unknown regions. When training the model, the model performance is optimized by minimizing the cross entropy loss of each residual. In terms of decoding order, the conventional sequential decoding method from N-terminus to C-terminus is replaced by the disordered autoregressive decoding method.

步骤四：利用步骤一得到的训练集对蛋白质序列生成模型进行训练和优化；Step 4: Use the training set obtained in step 1 to train and optimize the protein sequence generation model;

步骤五：以HLA I靶点为例，从RCSB Protein Data Bank数据库中获取pHLA I复合物结构的目标域数据集，将得到的PDB文件进一步筛选和处理，以形成一个重组数据集。将该数据集中的每个基因型以8:1:1的比例划分为训练集、验证集、测试集。对于数据量不足10条的基因型，其PDB条目仅用于训练和验证，不包括在测试集中；Step 5: Taking the HLA I target as an example, the target domain dataset of the pHLA I complex structure was obtained from the RCSB Protein Data Bank database, and the obtained PDB files were further screened and processed to form a recombinant dataset. Each genotype in the dataset was divided into a training set, a validation set, and a test set in a ratio of 8:1:1. For genotypes with less than 10 data records, their PDB entries were only used for training and validation and were not included in the test set;

步骤六：利用步骤五得到的训练集和验证集对步骤四得到的蛋白质序列生成模型进行微调，得到新的模型，该模型适用于靶向HLA I抗原肽序列的生成，并使用微调前模型和微调后的模型在测试集上进行测试；Step 6: Use the training set and validation set obtained in step 5 to fine-tune the protein sequence generation model obtained in step 4 to obtain a new model, which is suitable for the generation of targeted HLA I antigen peptide sequences, and use the pre-fine-tuning model and the fine-tuning model to test on the test set;

步骤七：使用微调前模型和微调后的模型生成的序列与相对应的靶点序列组成pHLA I复合物，将pHLA I复合物序列输入到结构预测模型AlphaFold2中进行测试，根据结构置信度分数判断设计肽的结构恢复能力、与天然肽的结构相似度和与靶点的结合能力；Step 7: Use the sequences generated by the pre-fine-tuning model and the fine-tuning model to form a pHLA I complex with the corresponding target sequence, input the pHLA I complex sequence into the structure prediction model AlphaFold2 for testing, and judge the structural recovery ability of the designed peptide, the structural similarity with the natural peptide, and the binding ability with the target according to the structural confidence score;

步骤八：使用pHLA I亲和力预测模型MHCfovea和靶点-受体相互作用界面分析工具PDBePISA创建一套筛选抗原肽的评分体系；Step 8: Use the pHLA I affinity prediction model MHCfovea and the target-receptor interaction interface analysis tool PDBePISA to create a scoring system for screening antigen peptides;

步骤九：以两个HLA I靶点为例，利用步骤六得到的靶向抗原肽序列生成模型各生成20条抗原肽，并利用步骤八得到的评分体系，通过在本地计算机上部署MHCfovea模型，将生成的抗原肽序列和对应的靶点序列输入到亲和力预测模型中，与天然肽序列一起计算预测结合概率，筛选出结合概率大于天然肽的设计肽；Step 9: Taking two HLA I targets as an example, 20 antigen peptides were generated using the targeted antigen peptide sequence generation model obtained in step 6, and the scoring system obtained in step 8 was used to deploy the MHCfovea model on a local computer. The generated antigen peptide sequences and the corresponding target sequences were input into the affinity prediction model, and the predicted binding probability was calculated together with the natural peptide sequence to screen out the designed peptides with a binding probability greater than that of the natural peptides.

步骤十：利用步骤九筛选出来的设计肽和靶点形成复合物，输入到靶点-受体相互作用界面分析工具进行测试，预测设计肽和天然肽的吉布斯自由能和相互作用接触面积，筛选出吉布斯自由能小于天然肽的设计肽。Step 10: Use the designed peptide screened in step 9 to form a complex with the target, input it into the target-receptor interaction interface analysis tool for testing, predict the Gibbs free energy and interaction contact area of the designed peptide and the natural peptide, and screen out the designed peptide with a Gibbs free energy smaller than that of the natural peptide.

作为本发明的技术方案，蛋白质结构数据集根据相关文献使用的相应数据集进行了一定的删减，最终得到训练集23358个样本，验证集1464个样本，测试集1539个样本。pHLAI复合物结构数据集也根据相关文献的数据集进行了一定的调整，最终得到训练集438个样本，验证集67个样本，测试集51个样本。As a technical solution of the present invention, the protein structure dataset was reduced to a certain extent according to the corresponding dataset used in the relevant literature, and finally 23,358 samples were obtained for the training set, 1,464 samples for the validation set, and 1,539 samples for the test set. The pHLAI complex structure dataset was also adjusted to a certain extent according to the dataset of the relevant literature, and finally 438 samples were obtained for the training set, 67 samples for the validation set, and 51 samples for the test set.

作为本发明的技术方案，模型首先在蛋白质结构数据集上进行预训练，之后在pHLA I复合物结构数据集上进行微调。数据量不足10条的基因型样本不参与训练。模型基于PyTorch实现，预训练和微调过程中使用的优化器为Adam，学习率为5×10^-4，损失函数为交叉熵损失函数。训练过程中，根据验证集的损失进行早停，如果在10个epoch内验证集的损失没有改善，则将学习率降低10倍。As a technical solution of the present invention, the model is first pre-trained on a protein structure dataset and then fine-tuned on a pHLA I complex structure dataset. Genotype samples with less than 10 data are not trained. The model is implemented based on PyTorch. The optimizer used in the pre-training and fine-tuning process is Adam, the learning rate is 5×10 ^-4 , and the loss function is the cross entropy loss function. During the training process, early stopping is performed according to the loss of the validation set. If the loss of the validation set does not improve within 10 epochs, the learning rate is reduced by 10 times.

本发明的有益效果在于：The beneficial effects of the present invention are:

1)本发明提出了一种基于活性骨架的靶向抗原肽序列生成及筛选方法，相比于现有方法，本发明具有更好的序列设计能力和精确的筛选流程，可以生成并筛选出亲和力更高的靶向抗原肽。1) The present invention proposes a method for generating and screening targeted antigen peptide sequences based on an active skeleton. Compared with existing methods, the present invention has better sequence design capabilities and precise screening processes, and can generate and screen targeted antigen peptides with higher affinity.

2)本发明提出了一种结合pHLA I亲和力预测模型MHCfovea和靶点-受体相互作用界面分析工具PDBePISA的筛选抗原肽的评分体系，可以快速筛选亲和力比天然肽更高的设计肽。2) The present invention proposes a scoring system for screening antigen peptides that combines the pHLA I affinity prediction model MHCfovea and the target-receptor interaction interface analysis tool PDBePISA, which can quickly screen designed peptides with higher affinity than natural peptides.

3)本发明设计了一种基于迁移学习的靶向抗原肽序列生成模型的训练方法，模型首先在蛋白质结构数据上训练，然后用pHLA I复合物数据进行微调，进一步提高了模型的生成性能。3) The present invention designs a training method for a targeted antigen peptide sequence generation model based on transfer learning. The model is first trained on protein structure data and then fine-tuned with pHLA I complex data, further improving the generation performance of the model.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为靶向抗原肽序列生成及筛选方法流程图；FIG1 is a flow chart of a method for generating and screening targeting antigen peptide sequences;

图2为基于迁移学习的靶向抗原肽序列生成模型的训练流程图。Figure 2 is a training flowchart of the targeted antigen peptide sequence generation model based on transfer learning.

具体实施方式DETAILED DESCRIPTION

下面将结合说明书附图和实施例对本发明实施例中的技术方案进行清楚、完整地描述，所描述的实施例仅仅是本发明的一部分实施例。基于本发明中的实施例，本领域中的普通技术人员在没有做出创造性劳动的前提下，所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be described clearly and completely below in conjunction with the drawings and embodiments of the specification. The described embodiments are only part of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

实施例Example

如图1和图2所示，一种基于活性骨架的靶向抗原肽序列生成及筛选方法，包括如下步骤：As shown in FIG1 and FIG2 , a method for generating and screening a targeting antigen peptide sequence based on an active skeleton comprises the following steps:

从RCSB Protein Data Bank数据库中获取蛋白质结构的源域数据集并对得到的PDB文件进一步处理，最终得到训练集23358个样本，验证集1464个样本，测试集1539个样本；从得到的PDB文件中提取蛋白质骨架结构的N、C_α、C、O以及虚拟C_β原子表征结构特征，将这些骨架原子作为节点特征，并加入一组独热编码作为额外输入，输入编码器。同时，选取骨架原子在欧几里得空间中与之最近的48个氨基酸原子之间的距离，作为边特征输入编码器中，这些特征在模型反向传播过程中不断更新；将通过三层编码器处理后的节点特征和边特征，以及采用自回归掩码技术处理的编码蛋白质序列，输入到解码器中。解码器使用无序性解码策略，在每个解码步骤中随机选择下一个要预测氨基酸。在任何预测步骤中，模型都会利用之前所有预测的氨基酸信息，以确保提供完整上下文；利用得到的训练集对蛋白质序列生成模型ProteinMPNN进行训练和优化；以HLA I靶点为例，从RCSB Protein DataBank数据库中获取pHLA I复合物结构的目标域数据集，将得到的PDB文件进一步筛选和处理，以形成一个重组数据集，将该数据集中的每个基因型以8:1:1的比例划分为训练集、验证集和测试集。对于数据量不足10条的基因型，其PDB条目仅用于训练和验证，不包括在测试集中，最终得到的训练集438个样本，验证集67个样本，测试集51个样本；利用得到的训练集和验证集在得到的蛋白质序列生成模型上进行微调，得到新的靶向抗原肽序列生成模型，并使用微调前模型和微调后的模型在测试集上进行测试；使用微调前模型和微调后的模型生成的序列与相对应的靶点序列组成pHLA I复合物，将pHLA I复合物序列输入到结构预测模型AlphaFold2中进行测试，根据结构置信度分数判断设计肽的结构恢复能力、与天然肽的结构相似度和与靶点的结合能力；使用pHLA I亲和力预测模型MHCfovea和靶点-受体相互作用界面分析工具PDBePISA创建一套筛选抗原肽的评分体系；以两个HLA I靶点为例，利用得到的靶向抗原肽序列生成模型各生成20条抗原肽，并利用得到的评分体系，将生成的抗原肽序列和对应的靶点序列输入到亲和力预测模型中，与天然肽序列一起计算预测结合概率，筛选出结合概率大于天然肽的设计肽；利用筛选出来的设计肽和靶点形成复合物，输入到靶点-受体相互作用界面分析工具进行测试，预测设计肽和天然肽的吉布斯自由能和相互作用接触面积，筛选出吉布斯自由能小于天然肽的设计肽。The source domain dataset of protein structure was obtained from the RCSB Protein Data Bank database and the obtained PDB files were further processed, finally obtaining 23358 samples in the training set, 1464 samples in the validation set, and 1539 samples in the test set; the N, C _α , C, O and virtual C _β atoms of the protein skeleton structure were extracted from the obtained PDB files to represent the structural features, and these skeleton atoms were used as node features, and a set of one-hot encoding was added as an additional input to the encoder. At the same time, the distance between the skeleton atoms and the 48 amino acid atoms closest to them in the Euclidean space was selected as the edge feature input to the encoder, and these features were continuously updated during the back propagation of the model; the node features and edge features processed by the three-layer encoder, as well as the encoded protein sequence processed by the autoregressive masking technology, were input into the decoder. The decoder uses a disordered decoding strategy to randomly select the next amino acid to be predicted in each decoding step. In any prediction step, the model will utilize all previously predicted amino acid information to ensure that a complete context is provided; the obtained training set is used to train and optimize the protein sequence generation model ProteinMPNN; taking the HLA I target as an example, the target domain data set of the pHLA I complex structure is obtained from the RCSB Protein DataBank database, and the obtained PDB files are further screened and processed to form a recombinant data set, and each genotype in the data set is divided into training set, validation set and test set in a ratio of 8:1:1. For genotypes with less than 10 data records, their PDB entries were only used for training and validation and were not included in the test set. The final training set consisted of 438 samples, the validation set consisted of 67 samples, and the test set consisted of 51 samples. The training set and validation set were used to fine-tune the protein sequence generation model to obtain a new targeted antigen peptide sequence generation model, and the pre-fine-tuning model and the fine-tuned model were used to test the model on the test set. The sequences generated by the pre-fine-tuning model and the fine-tuned model were combined with the corresponding target sequences to form a pHLA I complex, and the pHLA I complex sequence was input into the structure prediction model AlphaFold2 for testing. The structural confidence score was used to determine the structural recovery ability of the designed peptide, the structural similarity with the natural peptide, and the binding ability with the target. The pHLA I affinity prediction model MHCfovea and the target-receptor interaction interface analysis tool PDBePISA were used to create a scoring system for screening antigen peptides. Two HLA Taking target I as an example, 20 antigen peptides were generated using the obtained targeted antigen peptide sequence generation model, and the generated antigen peptide sequences and the corresponding target sequences were input into the affinity prediction model using the obtained scoring system, and the predicted binding probability was calculated together with the natural peptide sequence to screen out the designed peptides with a binding probability greater than that of the natural peptides; the screened designed peptides were used to form a complex with the target, and the complex was input into the target-receptor interaction interface analysis tool for testing, and the Gibbs free energy and interaction contact area of the designed peptides and the natural peptides were predicted, and the designed peptides with a Gibbs free energy less than that of the natural peptides were screened out.

本发明采用的序列评价指标包括序列恢复率和困惑度。序列恢复率是衡量模型恢复自然序列能力的指标，困惑度是评价模型预测样本天然氨基酸确定性的指标。本发明采用的结构评价指标包括pLDDT，pTM和ipTM。pLDDT是一种基于lDDT的置信度度量，反映结构中的局部置信度，用于评估单个结构域内的置信度。pTM是衡量整体蛋白质复合物结构准确性的指标。ipTM是反映预测的蛋白质复合物相互作用界面的质量和靶点与受体之间结合潜力的指标。The sequence evaluation indicators used in the present invention include sequence recovery rate and perplexity. The sequence recovery rate is an indicator for measuring the ability of the model to recover the natural sequence, and the perplexity is an indicator for evaluating the certainty of the model in predicting the natural amino acids of the sample. The structural evaluation indicators used in the present invention include pLDDT, pTM and ipTM. pLDDT is a confidence metric based on lDDT, which reflects the local confidence in the structure and is used to evaluate the confidence within a single domain. pTM is an indicator for measuring the accuracy of the overall protein complex structure. ipTM is an indicator reflecting the quality of the predicted protein complex interaction interface and the binding potential between the target and the receptor.

在本发明的一个实施例中，以HLA I靶点为例，将本发明中微调后的靶向抗原肽序列生成模型与微调前的蛋白质序列生成模型进行比较，结果如表1和表2所示。在序列评价指标方面，本发明微调后的靶向抗原肽序列生成模型的性能优于微调前的蛋白质序列生成模型，其中序列恢复率高于微调前模型，且困惑度低于微调前模型，这表明本发明的微调后模型能够生成与天然肽更为相似的抗原肽。在结构评价指标方面，本发明微调后的靶向抗原肽序列生成模型的性能与蛋白质序列生成模型性能相当，这表明本发明的预训练模型生成的抗原肽能够准确恢复所需结构。In one embodiment of the present invention, taking the HLA I target as an example, the fine-tuned targeted antigen peptide sequence generation model of the present invention is compared with the protein sequence generation model before fine-tuning, and the results are shown in Tables 1 and 2. In terms of sequence evaluation indicators, the performance of the fine-tuned targeted antigen peptide sequence generation model of the present invention is better than the protein sequence generation model before fine-tuning, wherein the sequence recovery rate is higher than the model before fine-tuning, and the perplexity is lower than the model before fine-tuning, which indicates that the fine-tuned model of the present invention can generate antigen peptides that are more similar to natural peptides. In terms of structural evaluation indicators, the performance of the fine-tuned targeted antigen peptide sequence generation model of the present invention is comparable to that of the protein sequence generation model, which indicates that the antigen peptides generated by the pre-training model of the present invention can accurately restore the required structure.

表1微调后模型与微调前模型在序列生成任务上的性能比较Table 1 Performance comparison of the fine-tuned model and the pre-fine-tuned model on the sequence generation task

方法method 序列恢复率Sequence recovery rate 困惑度Perplexity ProteinMPNNProteinMPNN 0.2540.254 1.0221.022 本方法This method 0.5170.517 0.4860.486

表2微调后模型与微调前模型在生成序列的结构恢复任务上的性能比较Table 2 Performance comparison of the fine-tuned model and the pre-fine-tuned model on the structure recovery task of the generated sequence

方法method pLDDTpDDT pTMpTM ipTMipTM ProteinMPNNProteinMPNN 97.48897.488 0.9370.937 0.9040.904 本方法This method 97.53397.533 0.9390.939 0.9020.902

在本发明的一个实施例中，以HLA-A*02:01靶点和HLA-B*27:05靶点为例，将本发明中微调后的靶向抗原肽序列生成模型用于靶向HLA I抗原肽的生成任务上，各生成20条抗原肽后，使用pHLA I亲和力预测模型MHCFovea预测结合概率，结果如表3所示。本发明在生成高亲和力的靶向HLA I抗原肽生成任务上取得了优异的结果。In one embodiment of the present invention, taking HLA-A*02:01 target and HLA-B*27:05 target as examples, the fine-tuned targeted antigen peptide sequence generation model of the present invention is used for the generation task of targeted HLA I antigen peptides, and after generating 20 antigen peptides respectively, the pHLA I affinity prediction model MHC Fovea is used to predict the binding probability, and the results are shown in Table 3. The present invention has achieved excellent results in the task of generating high-affinity targeted HLA I antigen peptides.

表3微调后模型生成的靶向HLA I抗原肽在结合概率预测任务上的表现Table 3 Performance of the targeted HLA I antigen peptides generated by the fine-tuned model in the binding probability prediction task

在本发明的一个实施例中，以HLA-A*02:01靶点和HLA-B*27:05靶点为例，在本发明中微调后的靶向抗原肽序列生成模型生成的20条抗原肽中，筛选出预测结合概率大于天然肽的设计肽，并将其输入到靶点-受体相互作用界面分析工具进行测试，结果如表4和表5所示。最后，进一步筛选出吉布斯自由能小于天然肽的设计肽，即亲和力强于天然肽的设计肽。这说明本发明可以快速准确地筛选出亲和力强于天然肽的设计肽。In one embodiment of the present invention, taking HLA-A*02:01 target and HLA-B*27:05 target as examples, among the 20 antigen peptides generated by the fine-tuned targeted antigen peptide sequence generation model in the present invention, the designed peptides with a predicted binding probability greater than that of the natural peptides were screened out, and were input into the target-receptor interaction interface analysis tool for testing, and the results are shown in Tables 4 and 5. Finally, the designed peptides with a Gibbs free energy less than that of the natural peptides, that is, the designed peptides with a stronger affinity than that of the natural peptides, were further screened out. This shows that the present invention can quickly and accurately screen out designed peptides with a stronger affinity than that of the natural peptides.

表4微调后模型生成的靶向HLA-A*02:01抗原肽在相互作用预测任务上的表现Table 4 Performance of the targeted HLA-A*02:01 antigen peptides generated by the fine-tuned model in the interaction prediction task

表5微调后模型生成的靶向HLA-B*27:05抗原肽在相互作用预测任务上的表现Table 5 Performance of the targeted HLA-B*27:05 antigen peptides generated by the fine-tuned model in the interaction prediction task

以上所述只是本发明的较佳具体实施例，并不对本发明起到任何限制作用。任何所属技术领域的技术人员，在不脱离本发明的技术方案的范围内，对本发明揭露的技术方案和技术内容做任何形式的等同替换或修改等变动，或直接或间接运用到其他相关的技术领域，均属于未脱离本发明技术方案的内容，仍包含在本发明的保护范围之内。The above description is only a preferred specific embodiment of the present invention and does not limit the present invention in any way. Any technician in the relevant technical field, without departing from the scope of the technical solution of the present invention, makes any form of equivalent replacement or modification to the technical solution and technical content disclosed in the present invention, or directly or indirectly applies it to other related technical fields, which belongs to the content that does not depart from the technical solution of the present invention and is still included in the protection scope of the present invention.

Claims

1. A method for generating and screening a targeting antigen peptide sequence based on an active skeleton, characterized in that it comprises the following steps:

Step 1: Obtain the source domain dataset of protein structure from the RCSB Protein Data Bank database and divide it into training set, validation set, and test set;

Step 2: Extract the backbone structure features of the protein from the PDB file obtained in step 1, use these backbone atoms as node features, add a set of one-hot encodings as additional input, and input them into the encoder. At the same time, select the distance between the backbone atoms and the 48 amino acid atoms closest to them in Euclidean space and input them into the encoder as edge features. These features are continuously updated during the back propagation process of the model.

Step 3: Input the node features and edge features processed by the three-layer encoder and the encoded protein sequence processed by the autoregressive masking technology into the decoder. The decoder uses a disordered decoding strategy to randomly select the next amino acid to be predicted in each decoding step. In any prediction step, the model will use all previously predicted amino acid information to ensure that a complete context is provided.

Step 4: Use the training set obtained in step 1 to train and optimize the protein sequence generation model;

Step 5: Taking the HLAI target as an example, the target domain dataset of the pHLAI complex structure was obtained from the RCSB Protein Data Bank database. The obtained PDB files were further screened and processed to form a recombinant dataset. Each genotype in the dataset was divided into a training set, a validation set, and a test set at a ratio of 8:1:1. For genotypes with less than 10 data records, their PDB entries were only used for training and validation and were not included in the test set.

Step 6: Use the training set and validation set obtained in step 5 to fine-tune the protein sequence generation model obtained in step 4 to obtain a new target antigen peptide sequence generation model, and use the pre-fine-tuning model and the post-fine-tuning model to test on the test set;

Step 7: Use the sequences generated by the pre-fine-tuning model and the post-fine-tuning model and the corresponding target sequence to form a pHLAI complex, input the pHLAI complex sequence into the structure prediction model AlphaFold2 for testing, and evaluate the structural recovery ability of the designed peptide, the structural similarity with the natural peptide, and the binding ability with the target according to the structural confidence score;

Step 8: Use the pHLA I affinity prediction model MHCfovea and the target-receptor interaction interface analysis tool PDBePISA to create a scoring system for screening antigen peptides;

Step 9: Taking two HLAI targets as an example, 20 antigen peptides were generated using the targeted antigen peptide sequence generation model obtained in step 6, and the generated antigen peptide sequences and the corresponding target sequences were input into the affinity prediction model using the scoring system obtained in step 8 to obtain the predicted binding probability, and the designed peptides with a binding probability greater than that of the natural peptide were screened out;

Step 10: Use the designed peptides screened in step 9 to form a complex with the target and input it into AphlaFold2 to predict the structure. Input the complex structure into the target-receptor interaction interface analysis tool for testing to screen out designed peptides with Gibbs free energy less than that of natural peptides.

2. A method for generating and screening targeted antigen peptide sequences based on an active skeleton as described in claim 1, characterized in that the pHLAI complex structure target domain dataset is a small sample dataset, and the protein structure source domain dataset is a large sample dataset for the target domain association task; the source domain dataset and the target domain dataset are obtained from the RCSB Protein Data Bank database and public literature.

3. A method for generating and screening a targeting antigen peptide sequence based on an active skeleton as described in claim 2, characterized in that in step 3, in terms of decoding order, a disordered autoregressive decoding method is used to replace the conventional sequential decoding method from N-terminus to C-terminus.