[go: up one dir, main page]

CN118629514B - Sequence immunogenicity prediction method, device, electronic device and storage medium - Google Patents

Sequence immunogenicity prediction method, device, electronic device and storage medium Download PDF

Info

Publication number
CN118629514B
CN118629514B CN202411082472.6A CN202411082472A CN118629514B CN 118629514 B CN118629514 B CN 118629514B CN 202411082472 A CN202411082472 A CN 202411082472A CN 118629514 B CN118629514 B CN 118629514B
Authority
CN
China
Prior art keywords
sequence
immunogenicity
features
feature
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411082472.6A
Other languages
Chinese (zh)
Other versions
CN118629514A (en
Inventor
刘成琨
倪晓天
王葳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Orange Fan Pharmaceutical Co ltd
Original Assignee
Shanghai Orange Fan Pharmaceutical Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Orange Fan Pharmaceutical Co ltd filed Critical Shanghai Orange Fan Pharmaceutical Co ltd
Priority to CN202411082472.6A priority Critical patent/CN118629514B/en
Publication of CN118629514A publication Critical patent/CN118629514A/en
Application granted granted Critical
Publication of CN118629514B publication Critical patent/CN118629514B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a sequence immunogenicity prediction method, a sequence immunogenicity prediction device, electronic equipment and a storage medium, and relates to the technical field of immunogenicity prediction. According to the sequence immunogenicity prediction method, the sequence immunogenicity prediction device, the electronic equipment and the storage medium, related information of an existing immunogenicity database is utilized to realize the excavation of related characteristics of the immunogenicity, so that a linear relation between sequence characteristics of a known peptide fragment and the immunogenicity of the sequence is established, a sequence immunogenicity prediction model is established based on the linear relation, and then the sequence characteristics of a sequence to be predicted and the sequence immunogenicity prediction model are utilized to realize the prediction of the immunogenicity of the sequence to be predicted. Compared with the prior art, the method can improve the interpretability and the accuracy of the immunogenicity prediction result, has great significance in meeting the supervision requirement and improving the research and development efficiency, and is beneficial to improving the safety evaluation accuracy and accelerating the drug marketing process.

Description

序列免疫原性预测方法、装置、电子设备及存储介质Sequence immunogenicity prediction method, device, electronic device and storage medium

技术领域Technical Field

本发明涉及免疫原性预测技术领域,具体涉及一种序列免疫原性预测方法、装置、电子设备及存储介质。The present invention relates to the technical field of immunogenicity prediction, and in particular to a sequence immunogenicity prediction method, device, electronic device and storage medium.

背景技术Background Art

免疫原性预测是指通过生物信息学方法预测蛋白质或肽段的免疫原性,即预测其能够引发免疫反应的可能性,其可以用在多肽、抗体、PDC(peptide-drugconjugate)、抗体药物偶联物(Antibody-DrugConjugate,ADC)等多个方面。免疫原性预测对于疫苗研发、过敏原鉴定、自身免疫疾病研究等领域具有重要意义。Immunogenicity prediction refers to the prediction of the immunogenicity of proteins or peptides through bioinformatics methods, that is, the prediction of their possibility of inducing an immune response. It can be used in peptides, antibodies, PDC (peptide-drug conjugate), antibody-drug conjugate (ADC), etc. Immunogenicity prediction is of great significance in the fields of vaccine development, allergen identification, autoimmune disease research, etc.

目前,免疫原性预测主要有基于序列和基于空间结构的两种方法。其中,基于序列的免疫原性预测方法主要通过挖掘、提取序列本身的规律和特征从而实现免疫原性预测,而基于空间结构的免疫原性预测方法依赖于肽-MHC晶体结构和打分函数来预测肽-MHC相互作用,进而实现免疫原性预测。At present, there are two main methods for immunogenicity prediction: sequence-based and spatial structure-based. Among them, the sequence-based immunogenicity prediction method mainly realizes immunogenicity prediction by mining and extracting the rules and characteristics of the sequence itself, while the spatial structure-based immunogenicity prediction method relies on the peptide- MHC crystal structure and scoring function to predict the peptide- MHC interaction, thereby realizing immunogenicity prediction.

然而,基于序列的方法虽然在准确性上有一定优势,但其准确性需要依赖大量高质量的训练数据,且该方法属于黑盒模型,其可解释性差;而基于结构的方法虽然不依赖训练数据且具有较强的可解释性,但其预测准确性较差。可见,现有的免疫原性预测方法很难兼顾预测的准确性和可解释性。However, although the sequence-based method has certain advantages in accuracy, its accuracy depends on a large amount of high-quality training data, and this method is a black box model with poor interpretability; while the structure-based method does not rely on training data and has strong interpretability, its prediction accuracy is poor. It can be seen that it is difficult for existing immunogenicity prediction methods to take into account both prediction accuracy and interpretability.

发明内容Summary of the invention

(一)解决的技术问题1. Technical issues to be resolved

针对现有技术的不足,本发明提供了一种序列免疫原性预测方法、装置、电子设备及存储介质,至少解决了现有免疫原性预测方法无法兼顾预测准确性和可解释性的问题。In view of the deficiencies in the prior art, the present invention provides a sequence immunogenicity prediction method, device, electronic device and storage medium, which at least solves the problem that the existing immunogenicity prediction methods cannot take into account both prediction accuracy and interpretability.

(二)技术方案(II) Technical solution

为实现以上目的,本发明通过以下技术方案予以实现:To achieve the above objectives, the present invention is implemented through the following technical solutions:

第一方面,本发明提供一种序列免疫原性预测方法,所述方法包括:In a first aspect, the present invention provides a method for predicting sequence immunogenicity, the method comprising:

提取免疫原性已知的肽段序列的序列特征,并基于所述序列特征构建特征矩阵;所述序列特征包括k-mer特征;其中,k为正整数;Extracting sequence features of peptide sequences with known immunogenicity, and constructing a feature matrix based on the sequence features; the sequence features include k-mer features; wherein k is a positive integer;

建立所述特征矩阵与免疫原性之间的回归关系,并基于所述回归关系获取不同序列特征所对应的权重;Establishing a regression relationship between the feature matrix and immunogenicity, and obtaining weights corresponding to different sequence features based on the regression relationship;

基于所述权重构建序列免疫原性预测模型;Constructing a sequence immunogenicity prediction model based on the weights;

基于待预测序列的序列特征和所述序列免疫原性预测模型预测所述待预测序列的免疫原性。The immunogenicity of the sequence to be predicted is predicted based on the sequence features of the sequence to be predicted and the sequence immunogenicity prediction model.

在一些实施例中,所述待预测序列选自多肽或者抗体。In some embodiments, the sequence to be predicted is selected from a polypeptide or an antibody.

进一步的,抗体包括ADC。Furthermore, antibodies include ADCs.

在一些实施例中,多肽包括长度为8-13个氨基酸的肽段。In some embodiments, the polypeptide comprises a peptide segment of 8-13 amino acids in length.

在一些实施例中,序列特征还包括:肽段序列的CPP特征和肽段序列的热力学信息特征。In some embodiments, the sequence features further include: CPP features of the peptide sequence and thermodynamic information features of the peptide sequence.

在一些实施例中,肽段序列的热力学信息特征包括:蛋白质序列的脂肪族指数、蛋白质序列的BLOSUM62指数、Boman指数、蛋白质序列的理论净电荷量、蛋白质序列的Cruciani属性、蛋白质序列的FASGAI向量、蛋白质序列的疏水力矩、蛋白质序列的疏水指数、蛋白质序列的不稳定性指数、蛋白质序列的基德拉因子、蛋白质序列的等电位点、蛋白质序列的protFP描述因子、蛋白质序列的VHSE尺度、蛋白质序列的Z-scale参数。In some embodiments, the thermodynamic information features of the peptide sequence include: the aliphatic index of the protein sequence, the BLOSUM62 index of the protein sequence, the Boman index, the theoretical net charge of the protein sequence, the Cruciani property of the protein sequence, the FASGAI vector of the protein sequence, the hydrophobic moment of the protein sequence, the hydrophobic index of the protein sequence, the instability index of the protein sequence, the Kidela factor of the protein sequence, the isoelectric point of the protein sequence, the protFP descriptor of the protein sequence, the VHSE scale of the protein sequence, and the Z-scale parameter of the protein sequence.

在一些实施例中,所述回归关系为线性回归关系。In some embodiments, the regression relationship is a linear regression relationship.

进一步的,在一个实施例中,基于所述回归关系获取不同序列特征所对应的权重,包括:基于弹性网络回归获取不同序列特征所对应的权重。Furthermore, in one embodiment, obtaining weights corresponding to different sequence features based on the regression relationship includes: obtaining weights corresponding to different sequence features based on elastic network regression.

在一些实施例中,提取所述k-mer特征包括:In some embodiments, extracting the k-mer features comprises:

自定义k的值;其中,k表示要考虑的肽段序列片段的长度,且k为正整数;Customize the value of k ; where k represents the length of the peptide sequence fragment to be considered, and k is a positive integer;

从给定蛋白质序列的一个端开始,遍历所述蛋白质序列中所有k个氨基酸组合,并统计所有氨基酸组合的类型和数量。Starting from one end of a given protein sequence, all k amino acid combinations in the protein sequence are traversed, and the types and quantities of all amino acid combinations are counted.

进一步的,在一些实施例中,k取2-10中任意一个正整数。Furthermore, in some embodiments, k is any positive integer between 2 and 10.

更进一步的,在一个实施例中,k=2或k=3。Furthermore, in one embodiment, k =2 or k =3.

在一些实施例中,提取所述肽段序列的CPP特征包括:In some embodiments, extracting the CPP features of the peptide sequence comprises:

将每个肽段序列与TCR序列与进行全局比对,并将比对得分的分位数作为CPP特征。Each peptide sequence was globally aligned with the TCR sequence, and the quantile of the alignment score was used as the CPP feature.

进一步的,在一些实施例中,基于氨基酸打分矩阵将每个肽段序列与TCR序列与进行全局比对。Furthermore, in some embodiments, each peptide sequence is globally aligned with the TCR sequence based on an amino acid scoring matrix.

更进一步的,在一个实施例中,将比对得分的10%与90%分位数作为CPP特征。Furthermore, in one embodiment, the 10% and 90% quantiles of the alignment score are used as CPP features.

第二方面,本发明提供一种序列免疫原性预测装置,所述装置包括:In a second aspect, the present invention provides a sequence immunogenicity prediction device, the device comprising:

免疫原性预测模型建模单元,用于提取免疫原性已知的肽段序列的序列特征,并基于所述序列特征构建特征矩阵;建立所述特征矩阵与免疫原性之间的回归关系,并基于所述回归关系获取不同序列特征所对应的权重;基于所述权重构建序列免疫原性预测模型;其中,所述序列特征包括k-mer特征;其中,k为正整数;The immunogenicity prediction model building unit is used to extract sequence features of peptide sequences with known immunogenicity, and construct a feature matrix based on the sequence features; establish a regression relationship between the feature matrix and immunogenicity, and obtain weights corresponding to different sequence features based on the regression relationship; and construct a sequence immunogenicity prediction model based on the weights; wherein the sequence features include k-mer features; wherein k is a positive integer;

免疫原性预测单元,用于基于待预测序列的序列特征和所述序列免疫原性预测模型预测所述待预测序列的免疫原性。The immunogenicity prediction unit is used to predict the immunogenicity of the sequence to be predicted based on the sequence characteristics of the sequence to be predicted and the sequence immunogenicity prediction model.

在一些实施例中,所述待预测序列选自多肽或者抗体。In some embodiments, the sequence to be predicted is selected from a polypeptide or an antibody.

进一步的,抗体包括ADC。Furthermore, antibodies include ADCs.

在一些实施例中,多肽包括长度为8-13个氨基酸的肽段。In some embodiments, the polypeptide comprises a peptide segment of 8-13 amino acids in length.

在一些实施例中,所述免疫原性预测模型建模单元还包括CPP特征提取子单元和热力学相关特征提取子单元;所述CPP特征提取子单元用于提取肽段序列的CPP特征,所述热力学相关特征提取子单元用于提取肽段序列的热力学信息特征。In some embodiments, the immunogenicity prediction model building unit also includes a CPP feature extraction subunit and a thermodynamic related feature extraction subunit; the CPP feature extraction subunit is used to extract the CPP features of the peptide sequence, and the thermodynamic related feature extraction subunit is used to extract the thermodynamic information features of the peptide sequence.

在一些实施例中,肽段序列的热力学信息特征包括:蛋白质序列的脂肪族指数、蛋白质序列的BLOSUM62指数、Boman指数、蛋白质序列的理论净电荷量、蛋白质序列的Cruciani属性、蛋白质序列的FASGAI(FactorAnalysis Scales of Generalized AminoAcidInformation)向量、蛋白质序列的疏水力矩、蛋白质序列的疏水指数、蛋白质序列的不稳定性指数、蛋白质序列的基德拉因子、蛋白质序列的等电位点(pI)、蛋白质序列的protFP描述因子、蛋白质序列的VHSE尺度、蛋白质序列的Z-scale参数。In some embodiments, the thermodynamic information features of the peptide sequence include: the aliphatic index of the protein sequence, the BLOSUM62 index of the protein sequence, the Boman index, the theoretical net charge of the protein sequence, the Cruciani property of the protein sequence, the FASGAI (Factor Analysis Scales of Generalized AminoAcid Information) vector of the protein sequence, the hydrophobic moment of the protein sequence, the hydrophobic index of the protein sequence, the instability index of the protein sequence, the Kidela factor of the protein sequence, the isoelectric point (pI) of the protein sequence, the protFP descriptor of the protein sequence, the VHSE scale of the protein sequence, and the Z-scale parameter of the protein sequence.

进一步的,在一个实施例中,使用R软件的peptide包中的对应函数计算肽段序列的热力学信息特征。Furthermore, in one embodiment, the thermodynamic information characteristics of the peptide sequence are calculated using the corresponding function in the peptide package of R software.

在一些实施例中,所述回归关系为线性回归关系。In some embodiments, the regression relationship is a linear regression relationship.

进一步的,在一些实施例中,基于所述回归关系获取不同序列特征所对应的权重,包括:基于弹性网络回归获取不同序列特征所对应的权重。Furthermore, in some embodiments, obtaining weights corresponding to different sequence features based on the regression relationship includes: obtaining weights corresponding to different sequence features based on elastic network regression.

更进一步的,在一些实施例中,基于弹性网络回归获取不同序列特征所对应的权重时,采用5折交叉验证的方式进行参数估计。Furthermore, in some embodiments, when obtaining weights corresponding to different sequence features based on elastic network regression, a 5-fold cross validation method is used to perform parameter estimation.

在一些实施例中,所述免疫原性预测模型建模单元包括k-mer特征提取子单元,所述k-mer特征提取子单元提取所述k-mer特征时包括:In some embodiments, the immunogenicity prediction model building unit includes a k-mer feature extraction subunit, and the k-mer feature extraction subunit extracts the k-mer feature including:

自定义k的值;其中,k表示要考虑的肽段序列片段的长度,且k为正整数;Customize the value of k ; where k represents the length of the peptide sequence fragment to be considered, and k is a positive integer;

从给定蛋白质序列的一个端开始,遍历所述蛋白质序列中所有k个氨基酸组合,并统计所有所述氨基酸组合的类型和数量。Starting from one end of a given protein sequence, all k amino acid combinations in the protein sequence are traversed, and the types and quantities of all the amino acid combinations are counted.

进一步的,在一些实施例中,k取2-10中任意一个正整数。Furthermore, in some embodiments, k is any positive integer between 2 and 10.

更进一步的,在一个实施例中,k=2或k=3。Furthermore, in one embodiment, k =2 or k =3.

在一些实施例中,所述CPP特征提取子单元提取所述肽段序列的CPP特征时包括:In some embodiments, the CPP feature extraction subunit extracts the CPP feature of the peptide sequence including:

将每个肽段序列与TCR序列与进行全局比对,并将比对得分的分位数作为CPP特征。Each peptide sequence was globally aligned with the TCR sequence, and the quantile of the alignment score was used as the CPP feature.

进一步的,在一些实施例中,基于氨基酸打分矩阵将每个肽段序列与TCR序列与进行全局比对。Furthermore, in some embodiments, each peptide sequence is globally aligned with the TCR sequence based on an amino acid scoring matrix.

更进一步的,在一个实施例中,将比对得分的10%与90%分位数作为CPP特征。Furthermore, in one embodiment, the 10% and 90% quantiles of the alignment score are used as CPP features.

第三方面,本发明提供一种电子设备,包括处理器、通信接口、存储器和总线,其中,处理器,通信接口,存储器通过总线完成相互间的通信,处理器可以调用存储器中的逻辑指令,以执行如第一方面所提供的方法的步骤。In a third aspect, the present invention provides an electronic device comprising a processor, a communication interface, a memory and a bus, wherein the processor, the communication interface and the memory communicate with each other via the bus, and the processor can call logic instructions in the memory to execute the steps of the method provided in the first aspect.

第四方面,本发明提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如第一方面所提供的方法的步骤。In a fourth aspect, the present invention provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method provided in the first aspect.

(三)有益效果(III) Beneficial effects

本发明提供了一种序列免疫原性预测方法、装置、电子设备及存储介质。与现有技术相比,具备以下有益效果:The present invention provides a sequence immunogenicity prediction method, device, electronic device and storage medium. Compared with the prior art, it has the following beneficial effects:

本发明提供的一种序列免疫原性预测方法、装置、电子设备及存储介质,通过已有免疫原性数据库的相关信息,实现对免疫原性相关特征的挖掘,从而建立已知肽段序列其序列特征与免疫原性之间的线性关系,并将这种关系作为序列免疫原性预测模型,然后利用待预测序列的序列特征和上述序列免疫原性预测模型,对待预测序列的免疫原性进行预测。本发明相比于现有技术,免疫原性预测结果可解释性强,且准确性好。The present invention provides a method, device, electronic device and storage medium for predicting sequence immunogenicity, which mines immunogenicity-related features through relevant information of an existing immunogenicity database, thereby establishing a linear relationship between the sequence features and immunogenicity of a known peptide sequence, and using this relationship as a sequence immunogenicity prediction model, and then using the sequence features of the sequence to be predicted and the above sequence immunogenicity prediction model to predict the immunogenicity of the sequence to be predicted. Compared with the prior art, the immunogenicity prediction results of the present invention are highly interpretable and accurate.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1为本发明一种序列免疫原性预测方法的流程图;FIG1 is a flow chart of a method for predicting sequence immunogenicity of the present invention;

图2为本发明实施例1中构建序列特征的流程图;FIG2 is a flow chart of constructing sequence features in Example 1 of the present invention;

图3为本发明实施例1中特征提取的流程图;FIG3 is a flow chart of feature extraction in Example 1 of the present invention;

图4为本发明实施例1中交叉验证过程中的auc变化曲线;FIG4 is a curve showing auc changes during the cross validation process in Example 1 of the present invention;

图5为本发明实施例2中一种序列免疫原性预测装置的结构示意图;FIG5 is a schematic diagram of the structure of a sequence immunogenicity prediction device in Example 2 of the present invention;

图6为本发明实施例3中一种序列免疫原性预测电子设备的结构示意图。FIG6 is a schematic diagram of the structure of an electronic device for predicting sequence immunogenicity in Example 3 of the present invention.

具体实施方式DETAILED DESCRIPTION

为使本发明实施例的目的、技术方案和优点更加清楚,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

抗体药物偶联物(Antibody-DrugConjugate,ADC)可以将抗体的特异性、靶向性和细胞毒性药物的疗效结合在一起,从而实现对肿瘤细胞的精准打击。ADC结合了抗体的特异性和药物的高效性,在提高治疗癌症的精确度和减少对正常组织的毒性方面意义重大。但由于其复杂的结构和作用机制,ADC可能引发特定的免疫响应(如抗药物抗体ADA的产生等),导致疗效降低,或者加剧不良反应。所以亟需开发既准确又高度可解释的ADC药物免疫源性评估、预测方法。Antibody-Drug Conjugate (ADC) can combine the specificity and targeting of antibodies with the efficacy of cytotoxic drugs to achieve precise attacks on tumor cells. ADC combines the specificity of antibodies with the high efficiency of drugs, which is of great significance in improving the accuracy of cancer treatment and reducing toxicity to normal tissues. However, due to its complex structure and mechanism of action, ADC may trigger specific immune responses (such as the production of anti-drug antibodies ADA, etc.), resulting in reduced efficacy or aggravated adverse reactions. Therefore, it is urgent to develop an accurate and highly interpretable method for evaluating and predicting the immunogenicity of ADC drugs.

当下,医药监管标准逐步提升,对ADC药物免疫源性评估的严格要求也在日益增加,迫切需要开发既准确又高度可解释的预测方法。At present, pharmaceutical regulatory standards are gradually improving, and the stringent requirements for the evaluation of the immunogenicity of ADC drugs are also increasing. There is an urgent need to develop prediction methods that are both accurate and highly interpretable.

截至目前,免疫原性预测方法主要是基于序列的免疫原性预测方法和基于空间结构的免疫原性预测方法。其中,基于序列的免疫原性预测方法,如SMM((B.Peters,A .Sette,BMC bioinformatics 6 ,132(2005))使用基于矩阵的算法,利用位点特异的得分矩阵来确定肽序列是否与具体MHC等位基因的结合基序匹配;又如NetMHC利用人工神经网络,通过每个等位基因的数据(例如“HLA-A*02:01”、“H2-Db”等),来训练具有结合的预测能力的潜在的模式或特征。基于序列的免疫原性预测方法,虽然简单直观容易实现,也有较高的准确性,但其需要依赖大量的训练数据,且对数据量和数据质量都有一定的要求,而大量数据的处理会导致模型运行速度慢,成本高,难以大规模实施;同时基于序列的方法是黑盒模型,其结果可解释性差、难以溯源。基于结构的方法,需要依赖于肽-MHC晶体结构和打分函数,通过能量最小化来预测肽-MHC相互作用,进而实现免疫原性预测。基于结构的方法虽然不依赖训练数据,且有一定的可解释性,但其准确性却较差。To date, the immunogenicity prediction methods are mainly sequence-based immunogenicity prediction methods and spatial structure-based immunogenicity prediction methods. Among them, sequence-based immunogenicity prediction methods, such as SMM (B. Peters, A. Sette, BMC bioinformatics 6, 132 (2005)), use a matrix-based algorithm and a site-specific scoring matrix to determine whether a peptide sequence matches the binding motif of a specific MHC allele; NetMHC uses an artificial neural network to train potential patterns or features with binding prediction capabilities through the data of each allele (e.g., "HLA-A*02:01", "H2-Db", etc.). Although sequence-based immunogenicity prediction methods are simple, intuitive, easy to implement, and have high accuracy, they rely on a large amount of training data and have certain requirements for data volume and data quality. The processing of a large amount of data will lead to slow model operation, high cost, and difficulty in large-scale implementation; at the same time, sequence-based methods are black box models, and their results are poorly interpretable and difficult to trace. Structure-based methods rely on peptide- MHC crystal structures and scoring functions to predict peptide-MHC by energy minimization. MHC interactions, thereby achieving immunogenicity prediction. Although structure-based methods do not rely on training data and have certain interpretability, their accuracy is poor.

综上可知,现有的免疫原性预测方法在可解释性和准确性上无法兼顾,且运行速度慢,成本高,难以大规模实施,无法很好的协助ADC药物的研发。In summary, the existing immunogenicity prediction methods cannot balance explainability and accuracy, and are slow to run, costly, and difficult to implement on a large scale, and cannot effectively assist in the development of ADC drugs.

针对上述问题,本申请通过已有免疫原性数据库中存在的相关信息和数据,实现对免疫原性相关特征的挖掘,进而建立上述特征与免疫原性之间的关系,并利用这种关系实现对待预测序列的免疫原性预测。In response to the above problems, the present application uses the relevant information and data in the existing immunogenicity database to mine immunogenicity-related features, and then establishes the relationship between the above features and immunogenicity, and uses this relationship to predict the immunogenicity of the sequence to be predicted.

需要说明的是,本发明所提出的免疫原性预测技术可以用在多肽、抗体、PDC(peptide-drugconjugate)、ADC等多方面的免疫原性预测,因为免疫原性预测仅是对多肽或者抗体进行的预测,所以理论上只要含有多肽的药物其实都可以适用本发明所提出的免疫原性预测技术。但为了便于技术方案的阐述,本发明实施例中以ADC为例进行说明,但此举不视为对本发明技术方案保护范围的限定。It should be noted that the immunogenicity prediction technology proposed in the present invention can be used for the immunogenicity prediction of peptides, antibodies, PDC (peptide-drug conjugate), ADC and other aspects, because the immunogenicity prediction is only a prediction of peptides or antibodies, so in theory, as long as the drug contains a peptide, the immunogenicity prediction technology proposed in the present invention can actually be applied. However, in order to facilitate the elaboration of the technical solution, ADC is used as an example in the embodiments of the present invention, but this is not regarded as limiting the scope of protection of the technical solution of the present invention.

基于此,本发明提出一种序列免疫原性预测方法、装置、电子设备及存储介质。为了便于对本申请技术方案进行理解,对本申请中出现名词和相关概念进行解释:Based on this, the present invention proposes a sequence immunogenicity prediction method, device, electronic device and storage medium. In order to facilitate the understanding of the technical solution of this application, the nouns and related concepts appearing in this application are explained:

表位元数据是指包含了生物分子(如蛋白质、抗体等)与其相互作用的信息的数据。Epitope metadata refers to data that contains information about the interactions between biological molecules (such as proteins, antibodies, etc.).

肽段是由氨基酸残基组成的短链分子,通常由2到50个氨基酸组成。当氨基酸残基通过肽键连接在一起时,就形成了肽段。肽段是蛋白质的组成部分,而蛋白质则是由一个或多个肽段组成的长链分子。Peptides are short-chain molecules composed of amino acid residues, usually consisting of 2 to 50 amino acids. Peptides are formed when amino acid residues are linked together by peptide bonds. Peptides are the building blocks of proteins, which are long-chain molecules composed of one or more peptides.

MHC(MajorHistocompatibilityComplex,主要组织相容性复合物)是一组编码特定蛋白质的基因,其序列非常复杂,因为MHC基因具有多态性,即存在各种不同的等位基因。MHC基因编码的蛋白质在免疫系统中起着重要作用,特别是在抗原呈递过程中。MHC (Major Histocompatibility Complex) is a group of genes that encode specific proteins. Its sequence is very complex because MHC genes are polymorphic, that is, there are various different alleles. The proteins encoded by MHC genes play an important role in the immune system, especially in the process of antigen presentation.

MHCLigand Assays(ClassII)是一种用于研究MHC(主要组织相容性复合体)类II分子与其配体(ligand)相互作用的实验方法。MHC类II分子主要在抗原呈递过程中发挥作用,它们可以结合并呈递外源性抗原给CD4+T细胞,从而引发免疫应答。MHC Ligand Assays (Class II) is an experimental method used to study the interaction between MHC (Major Histocompatibility Complex) Class II molecules and their ligands. MHC Class II molecules mainly play a role in the antigen presentation process. They can bind and present exogenous antigens to CD4+T cells, thereby triggering an immune response.

TCR数据是指“Tissue-ResidentMemory”细胞数据,即组织驻留记忆细胞数据。这些细胞是一种特殊类型的免疫细胞,主要存在于组织中,而不是循环于血液中。它们在免疫应答中起着重要作用,可以快速响应再次感染,并提供长期的免疫保护。TCR数据通常用于研究免疫系统的应答和记忆机制。 TCR data refers to "Tissue-Resident Memory" cell data, that is, tissue-resident memory cell data. These cells are a special type of immune cells that are mainly found in tissues rather than circulating in the blood. They play an important role in immune response, can quickly respond to reinfection, and provide long-term immune protection. TCR data are often used to study the response and memory mechanisms of the immune system.

去尾平均数是在计算平均数时去掉最大值和最小值,然后对剩余数值进行求平均。The truncated mean is to remove the maximum and minimum values when calculating the average, and then average the remaining values.

分位数是将数据集按照大小顺序排列后,将其分成几个等份的值。常见的分位数包括四分位数(将数据分为四等份)、中位数(将数据分为两等份)、十分位数(将数据分为十等份)等。在统计学和数据分析中,分位数是一种常用的描述性统计量。其可以帮助更好地理解数据的分布情况和集中趋势,同时也可以用来识别异常值和评估数据的离散程度。Quantiles are values that divide a data set into several equal parts after arranging it in order of size. Common quantiles include quartiles (dividing data into four equal parts), medians (dividing data into two equal parts), deciles (dividing data into ten equal parts), etc. Quantiles are a commonly used descriptive statistic in statistics and data analysis. They can help better understand the distribution and central tendency of data, and can also be used to identify outliers and assess the degree of dispersion of data.

R软件是一种开源的统计分析软件,广泛应用于数据分析、统计建模和可视化等领域。R software is an open source statistical analysis software, which is widely used in data analysis, statistical modeling and visualization.

k-mer指的是长度为k的所有可能的序列片段(或称作“子串”),这些片段可以是DNA、RNA或蛋白质序列中的一部分。在蛋白质序列分析中,k-mer方法涉及将蛋白质序列分割成所有可能的长度为k的片段,并利用这些片段进行进一步的分析,如序列比对、模式识别、功能预测等。 K-mer refers to all possible sequence fragments (or "substrings") of length k , which can be part of DNA, RNA or protein sequences. In protein sequence analysis, the k-mer method involves dividing the protein sequence into all possible fragments of length k and using these fragments for further analysis, such as sequence alignment, pattern recognition, function prediction, etc.

弹性网络回归(ElasticNet Regression)是一种结合了L1正则化(Lasso)和L2正则化(Ridge)的线性回归方法。ElasticNet Regression is a linear regression method that combines L1 regularization (Lasso) and L2 regularization (Ridge).

下面结合附图,以及对具体步骤、结构的解释,来详细说明本发明各实施例的实现过程。The implementation process of each embodiment of the present invention is described in detail below with reference to the accompanying drawings, as well as explanations of specific steps and structures.

实施例1:Embodiment 1:

第一方面,本发明首先提出了一种序列免疫原性预测方法,如图1所示,该方法包括:In the first aspect, the present invention first proposes a method for predicting sequence immunogenicity, as shown in FIG1 , the method comprising:

S1、提取免疫原性已知的肽段序列的序列特征,并基于所述序列特征构建特征矩阵;所述序列特征包括k-mer特征;其中,k为正整数;S1. Extracting sequence features of peptide sequences with known immunogenicity, and constructing a feature matrix based on the sequence features; the sequence features include k-mer features; wherein k is a positive integer;

S2、建立所述特征矩阵与免疫原性之间的回归关系,并基于所述回归关系获取不同序列特征所对应的权重;S2. establishing a regression relationship between the feature matrix and immunogenicity, and obtaining weights corresponding to different sequence features based on the regression relationship;

S3、基于所述权重构建序列免疫原性预测模型;S3, constructing a sequence immunogenicity prediction model based on the weights;

S4、基于待预测序列的序列特征和所述序列免疫原性预测模型预测所述待预测序列的免疫原性。S4. Predicting the immunogenicity of the sequence to be predicted based on the sequence features of the sequence to be predicted and the sequence immunogenicity prediction model.

本实施提供的一种序列免疫原性预测方法,通过已有免疫原性数据库的相关信息,实现对免疫原性相关特征的挖掘,从而建立已知肽段序列特征与其免疫原性之间的线性关系,然后利用待预测序列的序列特征和上述线性关系,对待预测序列的免疫原性进行预测。本实施例相比于现有技术,免疫原性预测结果准确性高,可解释性强。This embodiment provides a sequence immunogenicity prediction method, which mines immunogenicity-related features through relevant information in an existing immunogenicity database, thereby establishing a linear relationship between known peptide sequence features and their immunogenicity, and then predicts the immunogenicity of the sequence to be predicted using the sequence features of the sequence to be predicted and the above linear relationship. Compared with the prior art, the immunogenicity prediction results of this embodiment are highly accurate and interpretable.

下面结合附图1-3,以及对S1-S4步骤的具体解释,来详细说明本发明其中一个实施例的实现过程。The implementation process of one embodiment of the present invention is described in detail below with reference to Figures 1-3 and specific explanations of steps S1-S4.

S1、提取免疫原性已知的肽段序列的序列特征,并基于所述序列特征构建特征矩阵;所述序列特征包括k-mer特征;其中,k为正整数。S1. Extract sequence features of peptide sequences with known immunogenicity, and construct a feature matrix based on the sequence features; the sequence features include k-mer features; wherein k is a positive integer.

S11、获取表位元数据。在本实施例中,表位元数据为从IEDB中获取MHCLigandAssays(ClassII)数据库,然后从ClassII中随机抽取6338条数据,该数据包含肽段与肽段的免疫原性信息。如下表1为示例的部分MHCLigand Assas数据。S11, obtain epitope metadata. In this embodiment, the epitope metadata is to obtain the MHC Ligand Assays (Class II) database from IEDB, and then randomly extract 6338 data from Class II, which contains the immunogenicity information of peptide segments and peptide segments. The following Table 1 is an example of some MHC Ligand Assas data.

表1部分MHCLigand Assas数据示例Table 1 Some examples of MHC Ligand Assas data

将上述6338条数据依次标记为样本1-6338。其中,样本1-6338中,免疫原性为阳性的样本共3315个,阴性的样本为3023个。进一步地,将上述样本的免疫原性内容转换为独热编码,即将免疫原性的阳性(Positive)的样本标记为1,阴性(Negtive)标记为0,则上述样本肽段的免疫原性可用向量表示为:The above 6338 data are marked as samples 1-6338. Among samples 1-6338, there are 3315 samples with positive immunogenicity and 3023 samples with negative immunogenicity. Further, the immunogenicity content of the above samples is converted into a one-hot encoding, that is, the positive immunogenicity samples are marked as 1 and the negative immunogenicity samples are marked as 0. The immunogenicity vector of the above sample peptides is It is expressed as:

其中,每个样本肽段的免疫原性,满足i表示样本序号,n表示样本总数。Among them, the immunogenicity of each sample peptide ,satisfy ; i represents the sample number, and n represents the total number of samples.

S12、基于表位元数据提取免疫原性已知的肽段序列的序列特征,并基于所述序列特征构建特征矩阵。S12. Sequence features of peptide sequences with known immunogenicity are extracted based on epitope metadata, and a feature matrix is constructed based on the sequence features.

在现有免疫原性预测技术中,基于序列的免疫原性预测方法一般是挖掘序列本身的规律,或者通过从序列本身提取疏水性,电荷等特征后,结合亲和力或者其它能表示免疫原性的参数一起进行建模,然后,基于构建好的模型预测序列的免疫原性。而在另一种免疫原性预测技术——基于结构的免疫原性预测方法,主要通过MHC晶体空间结构来分析两个分子在空间结合的情况,实现对免疫原性的预测。然而,上述两种方法各有优势,但均存在序列免疫原性预测的准确性和可解释性无法兼顾的问题。Among the existing immunogenicity prediction technologies, sequence-based immunogenicity prediction methods generally mine the rules of the sequence itself, or extract hydrophobicity, charge and other characteristics from the sequence itself, and then combine affinity or other parameters that can represent immunogenicity to build a model, and then predict the immunogenicity of the sequence based on the constructed model. In another immunogenicity prediction technology, the structure-based immunogenicity prediction method, it mainly analyzes the spatial binding of two molecules through the spatial structure of the MHC crystal to predict immunogenicity. However, the above two methods have their own advantages, but both have the problem that the accuracy and interpretability of sequence immunogenicity prediction cannot be taken into account at the same time.

而在本实施例中,参见图2-3,为了解决这一问题,在提取肽段序列的序列特征时,所选取的序列特征包括基于序列本身的特征和基于序列结构的特征两个部分。其中,基于序列本身的特征包括:序列的热力学信息特征和序列的k-mer特征,基于序列结构的特征包括序列的CPP特征。具体的,针对上述获取的6338条样本数据,分别构建训练样本特征的CPP特征数据集、热力学信息特征数据集,以及k-mer特征数据集。In this embodiment, referring to Figures 2-3, in order to solve this problem, when extracting sequence features of peptide sequences, the selected sequence features include two parts: features based on the sequence itself and features based on the sequence structure. Among them, the features based on the sequence itself include: thermodynamic information features of the sequence and k-mer features of the sequence, and the features based on the sequence structure include CPP features of the sequence. Specifically, for the 6338 sample data obtained above, the CPP feature data set, thermodynamic information feature data set, and k-mer feature data set of the training sample features are constructed respectively.

S121、基于表位元数据提取肽段序列的热力学信息特征。S121. Extract the thermodynamic information characteristics of peptide sequences based on epitope metadata.

肽段序列的热力学信息相关特征为免疫原性预测技术中常用的理化指标。在提取序列的热力学信息相关特征时,一般利用相应的函数计算序列的热力学信息。Thermodynamic information-related features of peptide sequences are commonly used physical and chemical indicators in immunogenicity prediction technology. When extracting thermodynamic information-related features of sequences, the corresponding functions are generally used to calculate the thermodynamic information of the sequences.

在一个实施例中,使用R软件的peptide包中的对应函数,计算肽段序列的以下热力学相关特征。肽段序列的热力学信息及对应的计算函数具体如下表2所示。In one embodiment, the following thermodynamic-related characteristics of the peptide sequence are calculated using the corresponding functions in the peptide package of the R software. The thermodynamic information of the peptide sequence and the corresponding calculation function are specifically shown in Table 2 below.

表2热力学特征及其对应函数Table 2 Thermodynamic characteristics and their corresponding functions

按照表2中的函数分别计算肽段序列的热力学信息特征,并构建热力学信息特征矩阵。在本实施例中,肽段序列共包含14项热力学信息特征,用公式可表示为:The thermodynamic information features of the peptide sequence are calculated according to the functions in Table 2, and a thermodynamic information feature matrix is constructed. In this embodiment, the peptide sequence contains a total of 14 thermodynamic information features, which can be expressed as follows:

其中,表示样本肽段序列的热力学信息特征数据集。in, A dataset representing thermodynamic information features of sample peptide sequences.

S122、基于表位元数据提取肽段序列的k-mer特征。S122. Extract k-mer features of peptide sequences based on epitope metadata.

在构建肽段序列特征工程的时候,除了使用肽段序列的热力学信息特征外,为了提高序列免疫原性预测模型的泛化能力,还要对序列本身的规律特征进行学习。When constructing peptide sequence feature engineering, in addition to using the thermodynamic information characteristics of the peptide sequence, in order to improve the generalization ability of the sequence immunogenicity prediction model, the regular characteristics of the sequence itself must also be learned.

在一个实施例中,使用k-mer模型来提取序列本身的规律特征。具体的,计算肽段序列k-mer的步骤如下:In one embodiment, the k-mer model is used to extract the regular features of the sequence itself. Specifically, the steps for calculating the k-mer of a peptide sequence are as follows:

定义k的值:k是一个预先确定的正整数,代表要考虑的序列片段的长度。k值越大模型越复杂越稀疏,当模型过于复杂时计算量过大难以实现。在本实施例中,为了简化模型,令k=2,即通过遍历两个氨基酸组合,然后统计组合的数量,来实现对序列特征进行挖掘的目的。Define the value of k : k is a predetermined positive integer representing the length of the sequence segment to be considered. The larger the k value, the more complex and sparse the model is. When the model is too complex, the amount of calculation is too large to be implemented. In this embodiment, in order to simplify the model, k = 2 is set, that is, by traversing two amino acid combinations and then counting the number of combinations, the purpose of mining sequence features is achieved.

但是在其他的实施例中,k的值不做具体限定,只要k为正整数,均在本申请的范围之内。例如,在一个实施例中,k取2-10中任意一个正整数。更加优选的,k的值设置为2或3。However, in other embodiments, the value of k is not specifically limited, as long as k is a positive integer, it is within the scope of the present application. For example, in one embodiment, k is any positive integer between 2 and 10. More preferably, the value of k is set to 2 or 3.

生成k-mer:对于给定的蛋白质序列,通过从序列的一个端开始,每次移动一个残基,截取长度为k的所有可能的连续序列片段(k-mer),并统计每个片段类型的数量。这一步骤会产生大量的k-mer,其数量与序列长度和k的值有关。例如,有一个编号为SEQID NO:5的肽段,其序列为:AAAKLD,当k=2时,则其k-mer特征如下表3所示。Generate k-mer : For a given protein sequence, by starting from one end of the sequence and moving one residue at a time, all possible continuous sequence fragments ( k-mers ) of length k are intercepted, and the number of each fragment type is counted. This step will generate a large number of k-mers , the number of which is related to the length of the sequence and the value of k . For example, there is a peptide numbered SEQID NO:5, and its sequence is: AAAKLD. When k = 2, its k-mer characteristics are shown in Table 3 below.

表3k=2时肽段“AAAKLD”的k-mer特征Table 3 k-mer features of peptide “AAAKLD” when k = 2

在本实施例中,以20个氨基酸为例,共可获得400个k-mer特征,则所构建的k-mer特征可用公式表示为:In this embodiment, taking 20 amino acids as an example, a total of 400 k-mer features can be obtained, and the constructed k-mer features can be expressed by the formula:

其中,表示样本肽段序列的k-mer特征数据集。样本肽段序列的部分k-mer特征示例如下表4。in, The k-mer feature dataset represents the sample peptide sequence. Some examples of k-mer features of the sample peptide sequence are shown in Table 4.

表4样本肽段序列的k-mer特征示例Table 4 Examples of k-mer features of sample peptide sequences

S123、基于表位元数据提取肽段序列的CPP特征。S123. Extract CPP features of peptide sequences based on epitope metadata.

为了进一步提升肽段序列免疫原性预测时的准确性和可解释性,本实施例在提取肽段序列的本身的特征(即肽段序列的热力学信息特征和肽段序列的k-mer特征)以外,还添加了肽段序列的结构信息,通过将肽段序列与TCR数据进行比对,来提取肽段序列的CPP特征。In order to further improve the accuracy and interpretability of peptide sequence immunogenicity prediction, this embodiment not only extracts the characteristics of the peptide sequence itself (i.e., the thermodynamic information characteristics of the peptide sequence and the k-mer characteristics of the peptide sequence), but also adds the structural information of the peptide sequence, and extracts the CPP characteristics of the peptide sequence by comparing the peptide sequence with the TCR data.

具体的,首先,直接从现有记载的文献中获取TCR数据,部分示例的TCR数据如下表5所示。Specifically, first, TCR data are directly obtained from existing literature, and some example TCR data are shown in Table 5 below.

表5部分TCR数据示例Table 5 Some examples of TCR data

然后,通过TCR-肽接触电位分析(TCR-peptidecontact potentialprofiling),获取CPP特征数据集。具体的,对上述获取到的6338条样本肽段使用特定的氨基酸打分矩阵,将每个样本肽段序列与数据库中的TCR序列与进行全局比对,然后记录比对得分,部分比对得分的示例如下表6所示。Then, through TCR -peptide contact potential profiling, the CPP feature data set was obtained. Specifically, a specific amino acid scoring matrix was used for the 6338 sample peptides obtained above, and each sample peptide sequence was globally compared with the TCR sequence in the database, and then the comparison score was recorded. Examples of some comparison scores are shown in Table 6 below.

表6样本肽段序列与TCR序列比对得分结果示例Table 6 Example of score results for comparison between sample peptide sequences and TCR sequences

在一个实施例中,为了确保CPP特征的有效性,从而提高后续免疫原性预测的准确性,针对每个肽段序列,统计打分结果的算数平均值、10%去尾平均数、标准差、标准误、中位数、MAD、偏度、丰度、IQR,最后将10%与90%分位数作为CPP特征用。In one embodiment, in order to ensure the effectiveness of the CPP feature and thus improve the accuracy of subsequent immunogenicity prediction, the arithmetic mean, 10% tailed mean, standard deviation, standard error, median, MAD, skewness, abundance, and IQR of the scoring results are statistically analyzed for each peptide sequence, and finally the 10% and 90% quantiles are used as CPP features.

在本实施例中,特定的氨基酸打分矩阵共有35个。上述每个氨基酸打分矩阵可以构建11个特征,合并所有的CPP特征为CPP特征矩阵,共包含385项CPP特征,用公式可表示为:In this embodiment, there are 35 specific amino acid scoring matrices. Each of the above amino acid scoring matrices can construct 11 features, and all CPP features are combined into a CPP feature matrix, which contains a total of 385 CPP features, which can be expressed as follows:

其中,表示样本肽段序列的CPP特征数据集。in, CPP feature dataset representing sample peptide sequences.

S124、基于肽段序列的热力学信息特征数据集、k-mer特征,以及CPP特征生成特征矩阵。S124. Generate feature matrix based on thermodynamic information feature data set of peptide sequence, k-mer feature, and CPP feature.

将上述获取的肽段序列的热力学特征、k-mer特征,以及CPP特征进行整合形成特征矩阵,其中,为整合三个特征后的特征总数,向量为每个样本相应的特征变量,T表示向量矩阵的转置。The thermodynamic characteristics, k-mer characteristics, and CPP characteristics of the peptide sequences obtained above are integrated to form a feature matrix , ,in, is the total number of features after integrating the three features, the vector is the feature variable corresponding to each sample, and T represents the transpose of the vector matrix.

在本步骤中,针对肽段序列基于序列和基于结构的特征进行提取后,获得特征的数量很大,一般来说,将获得500-8000个特征。In this step, after extracting sequence-based and structure-based features for the peptide sequence, a large number of features are obtained. Generally speaking, 500-8000 features will be obtained.

本实施例中针对序列,同时挖掘其热力学信息特征、k-mer特征,以及CPP特征,不仅对序列免疫原性预测常用的理化指标特征进行提取,还对序列本身的规律特征进行了学习,更进一步的还对序列的结构信息也进行了挖掘,多种特征的整合,不仅提高了免疫原性预测模型的泛化能力,还能够找到影响免疫原性结果的特征,使得序列免疫原性预测结果可解释、可溯源。In this embodiment, for the sequence, its thermodynamic information characteristics, k-mer characteristics, and CPP characteristics are mined at the same time. Not only the physical and chemical index characteristics commonly used in sequence immunogenicity prediction are extracted, but also the regular characteristics of the sequence itself are learned. Furthermore, the structural information of the sequence is also mined. The integration of multiple features not only improves the generalization ability of the immunogenicity prediction model, but also can find the characteristics that affect the immunogenicity results, making the sequence immunogenicity prediction results explainable and traceable.

S2、建立所述特征矩阵与免疫原性之间的回归关系,并基于所述回归关系获取不同序列特征所对应的权重。S2. Establishing a regression relationship between the feature matrix and immunogenicity, and obtaining weights corresponding to different sequence features based on the regression relationship.

S21、建立所述特征矩阵与免疫原性之间的回归关系。S21. Establishing a regression relationship between the characteristic matrix and immunogenicity.

在序列免疫原性预测技术中,针对序列特征的建模和分析,目前主流方法还是深度学习。但是利用深度模型建模和分析序列特征时,虽然深度模型参数调试比较容易,但其整体运算量巨大,且模型可解释性差,难以对得到的结果进行解释和溯源。而由于S1步骤中获得的序列的特征数量庞大,这无疑将会更大程度上增加后续特征建模、分析的运算量和运算难度。In the sequence immunogenicity prediction technology, the current mainstream method for modeling and analyzing sequence features is still deep learning. However, when using deep models to model and analyze sequence features, although it is relatively easy to debug deep model parameters, the overall amount of computation is huge, and the model has poor interpretability, making it difficult to interpret and trace the results. And because the number of sequence features obtained in the S1 step is huge, this will undoubtedly increase the amount of computation and difficulty of subsequent feature modeling and analysis.

针对这一问题,一些实施例中,利用线性模型对肽段序列的特征矩阵与其免疫原性建模。具体的,针对上述样本的特征矩阵与样本的免疫原性结果建立回归关系,用公式可表示为:To address this problem, in some embodiments, a linear model is used to model the characteristic matrix of peptide sequences and their immunogenicity. Specifically, a regression relationship is established between the characteristic matrix of the above sample and the immunogenicity result of the sample, which can be expressed as:

其中,表示免疫原性结果为阳性的概率;,...,为序列特征,具体用向量表示;为回归系数,表示免疫原性与特征之间的关系程度。in, represents the probability of a positive immunogenicity result; , , ..., is a sequence feature, specifically represented by a vector; is the regression coefficient, which indicates the degree of relationship between immunogenicity and the characteristic.

S22、基于所述回归关系获取不同序列特征所对应的权重。S22. Obtain weights corresponding to different sequence features based on the regression relationship.

由于在本实施例中,样本肽段的免疫原性是可知的,即的值可知(例如,上述样本肽段的免疫原性为),则利用弹性网络回归可以求得打分权重向量 In this example, the immunogenicity of the sample peptides is known, i.e. The value of (for example, the immunogenicity of the above sample peptide is ), then the scoring weight vector can be obtained by elastic network regression

argmin所在的括号中的值为最小值时,的值即为我们需要的估计值。其中,表示通过模型估计得到的回归系数;表示序列特征矩阵,即表示L1正则系数,用于特征选择;表示L2正则系数,用于防止模型过拟合。When the value in the brackets around argmin is the minimum value, The value of is the estimated value we need. represents the regression coefficient estimated by the model; represents the sequence feature matrix, that is ; Represents the L1 regularization coefficient, which is used for feature selection; Represents the L2 regularization coefficient, which is used to prevent the model from overfitting.

在数学上为同一个参数,一般表示这个参数是估计得到的,并不是实际真实存在的结果。当描述理论时,通常使用,当描述对参数的估计结果时,则使用 and Mathematically, it is the same parameter. Generally, this parameter is estimated and does not actually exist. When describing a theory, it is usually used , when describing the estimated results of the parameters, we use .

再令,并代入上述弹性网回归模型的公式中,则:Re-order , and substitute it into the formula of the above elastic net regression model, then:

采用5折交叉验证的方式进行参数估计。如图4所示,得到AUC最大时的值。其中,图4中分别为样本肽段的免疫原性和特征矩阵所对应的值。最后,根据弹性网回归公式计算得到The parameter estimation was performed using a 5-fold cross validation method. As shown in Figure 4, the maximum AUC was obtained. Value. Among them, in Figure 4 and are the values corresponding to the immunogenicity and characteristic matrix of the sample peptides respectively. Finally, the elastic net regression formula is used to calculate .

本实施例使用稀疏线性回归模型解决MHC预测问题,相比于现有主流技术的深度学习模型,在针对极其庞大的特征数据时,其模型运算量更小,分析速度更快,在合成库这种需要进行大批量筛选的过程中,可以节约大量时间。同时,模型中阈值等参数的调整更加灵活。This embodiment uses a sparse linear regression model to solve the MHC prediction problem. Compared with the deep learning model of the existing mainstream technology, the model has a smaller amount of calculation and a faster analysis speed when targeting extremely large feature data. In the process of synthesizing libraries that require large-scale screening, a lot of time can be saved. At the same time, the adjustment of parameters such as thresholds in the model is more flexible.

S3、基于所述权重构建序列免疫原性预测模型。S3. Constructing a sequence immunogenicity prediction model based on the weights.

基于S2步骤中所求的序列免疫原性与序列特征之间的线性关系和与不同序列特征对应的权重构建序列免疫原性预测模型,用公式可表示为:Based on the linear relationship between sequence immunogenicity and sequence features obtained in step S2 and the weights corresponding to different sequence features, a sequence immunogenicity prediction model is constructed, which can be expressed as follows:

其中,为待预测肽段序列的免疫原性为Positive(阳性)的概率;为待预测序列特征所对应的打分权重向量;表示待预测肽段序列的特征矩阵。in, The probability that the immunogenicity of the peptide sequence to be predicted is Positive; is the scoring weight vector corresponding to the features of the sequence to be predicted; The feature matrix representing the peptide sequence to be predicted.

S4、基于待预测序列的序列特征和所述序列免疫原性预测模型预测所述待预测序列的免疫原性。S4. Predicting the immunogenicity of the sequence to be predicted based on the sequence features of the sequence to be predicted and the sequence immunogenicity prediction model.

按照上述步骤S12,构建待预测序列的特征矩阵,其中,待预测序列的特征矩阵包括待预测序列的CPP特征、待预测序列的热力学特征,以及待预测序列的k-mer特征。According to the above step S12, the feature matrix of the sequence to be predicted is constructed , wherein the feature matrix of the sequence to be predicted includes the CPP feature of the sequence to be predicted, the thermodynamic feature of the sequence to be predicted, and the k-mer feature of the sequence to be predicted.

然后利用S3步骤中所求的序列免疫原性预测模型即可对待预测序列的免疫原性进行预测。Then, the sequence immunogenicity prediction model obtained in step S3 can be used to predict the immunogenicity of the sequence to be predicted.

其中,本实施例所提供的免疫原性预测方法,适用于长度为8-13个氨基酸的肽段均可以用于免疫原性的预测。在一个实施例中待预测序列选自多肽或者抗体肽段序列。更加优选的,在另一个实施例中,抗体包括ADC。Among them, the immunogenicity prediction method provided in this embodiment is applicable to peptides with a length of 8-13 amino acids and can be used for the prediction of immunogenicity. In one embodiment, the sequence to be predicted is selected from a polypeptide or an antibody peptide sequence. More preferably, in another embodiment, the antibody includes ADC.

至此,则完成了本实施例一种序列免疫原性预测方法的全部流程。At this point, the entire process of the sequence immunogenicity prediction method of this embodiment is completed.

为了验证上述实施例一种序列免疫原性预测方法(简称“本方法”)在预测准确性和预测效率方面的优越性,下面通过实验进行证明:In order to verify the superiority of the sequence immunogenicity prediction method of the above embodiment (hereinafter referred to as "this method") in terms of prediction accuracy and prediction efficiency, the following experiments are performed to prove that:

1)本方法准确性验证。1) Verification of the accuracy of this method.

将上述步骤S1中得到样本肽段的特征矩阵代入用于待预测肽段序列免疫原性预测的公式中,然后得到每个样本肽段的免疫原性预测结果,当 >0时,判断序列为免疫原性序列,当0时,判断序列为无免疫原性序列。然后将该预测结果与原有已知样本肽段的免疫原性真实结果进行比较,可以得到如下表7所示的结果:Substitute the characteristic matrix of the sample peptide obtained in step S1 into the formula for predicting the immunogenicity of the peptide sequence to be predicted: Then the immunogenicity prediction results of each sample peptide are obtained. ,when > 0, the sequence is judged to be an immunogenic sequence. When the value is 0, the sequence is judged to be a non-immunogenic sequence. Then the predicted result is compared with the real immunogenicity result of the original known sample peptide segment, and the results shown in Table 7 can be obtained:

表7免疫原性预测结果准确性评估Table 7 Evaluation of the accuracy of immunogenicity prediction results

在上述表7中,In the above Table 7,

,

,

,

.

其中:in:

TP(True Positive)表示:预测为1,真实值也为1; TP (True Positive) means: the prediction is 1 and the true value is also 1;

FP(False Positive)表示:预测为1,真实值为0; FP (False Positive) means: the prediction is 1 and the true value is 0;

TN(True Negative)表示:预测为0,真实值也为0; TN (True Negative) means: the prediction is 0 and the true value is also 0;

FN(False Negative)表示:预测为0,真实值为1。 FN (False Negative) means: the prediction is 0 and the true value is 1.

从上述表7的结果可以看出,本方法的准确率能够满足目前ADC药物研发的要求,且整个流程可追溯性好,能够更好的对结果进行解释,且检测通量大,效率高。From the results in Table 7 above, it can be seen that the accuracy of this method can meet the current requirements of ADC drug development, and the entire process has good traceability, can better interpret the results, and has a large detection throughput and high efficiency.

2)本方法与现有技术免疫原性预测方法的准确性比较。2) Comparison of the accuracy of this method with existing immunogenicity prediction methods.

重新获取6540个待测序列,依次命名为待测序列1-待测序列6540。使用步骤S12描述的过程构建序列的特征,得到特征矩阵。然后使用步骤S3中计算得到的代入公式。当 >0时,判断序列为免疫原性序列,当0时,判断序列为无免疫原性序列。然后将预测结果与标准结果进行比较,准确性如表8所示:Re-obtain 6540 test sequences, named test sequence 1 to test sequence 6540. Use the process described in step S12 to construct the features of the sequence, and obtain the feature matrix Then use the value calculated in step S3 Substitute into the formula .when > 0, the sequence is judged to be an immunogenic sequence. When the value is 0, the sequence is judged to be a non-immunogenic sequence. The prediction results are then compared with the standard results, and the accuracy is shown in Table 8:

表8免疫原性测试结果准确性评估Table 8 Evaluation of the accuracy of immunogenicity test results

同时,使用主流工具netMHCPan4.0对上述6540个待测序列的免疫原性进行预测,并将其预测结果与标准结果进行比较,准确性结果如表9所示。At the same time, the mainstream tool netMHCPan4.0 was used to predict the immunogenicity of the above 6540 test sequences, and its prediction results were compared with the standard results. The accuracy results are shown in Table 9.

表9netMHCPan免疫原性预测结果准确性Table 9 Accuracy of netMHCPan immunogenicity prediction results

对比表8与表9,可以看出,本方法的准确性等指标均高于netMHCPan4.0,说明本方法能够用于免疫原性的预测,且免疫原性预测准确性优于现有主流的免疫原性预测方法。Comparing Table 8 with Table 9, it can be seen that the accuracy and other indicators of this method are higher than those of netMHCPan4.0, indicating that this method can be used to predict immunogenicity, and the accuracy of immunogenicity prediction is better than the existing mainstream immunogenicity prediction methods.

3)本方法预测效率评估。3) Evaluation of the prediction efficiency of this method.

对比在实验1)与实验2)中本方法与netMHCPan预测免疫原性方法所花费的时间,得到如下表10。由表10可知,本方法的运行效率远高于netMHCPan,在ADC药物的相关研发过程中能够显著提高研发效率。在评估本方法预测效率的实验中,测试全部使用单线程运行,使用的CPU为Intel(R)Xeon(R) Silver 4214R。The time taken by this method and netMHCPan to predict immunogenicity in Experiment 1) and Experiment 2) is compared, and the following Table 10 is obtained. As shown in Table 10, the operating efficiency of this method is much higher than that of netMHCPan, and it can significantly improve the research and development efficiency in the relevant research and development process of ADC drugs. In the experiment to evaluate the prediction efficiency of this method, all tests were run using a single thread, and the CPU used was Intel(R) Xeon(R) Silver 4214R.

表10本方法与netMHCPan运行时间对比Table 10 Comparison of running time between this method and netMHCPan

实施例2:Embodiment 2:

第二方面,本发明提供了一种序列免疫原性预测装置,如图5所示,所述装置包括:In a second aspect, the present invention provides a sequence immunogenicity prediction device, as shown in FIG5 , the device comprising:

免疫原性预测模型建模单元,用于提取免疫原性已知的肽段序列的序列特征,并基于所述序列特征构建特征矩阵;建立所述特征矩阵与免疫原性之间的回归关系,并基于所述回归关系获取不同序列特征所对应的权重;基于所述权重构建序列免疫原性预测模型;其中,所述序列特征包括k-mer特征;其中,k为正整数;The immunogenicity prediction model building unit is used to extract sequence features of peptide sequences with known immunogenicity, and construct a feature matrix based on the sequence features; establish a regression relationship between the feature matrix and immunogenicity, and obtain weights corresponding to different sequence features based on the regression relationship; and construct a sequence immunogenicity prediction model based on the weights; wherein the sequence features include k-mer features; wherein k is a positive integer;

免疫原性预测单元,用于基于待预测序列的序列特征和所述序列免疫原性预测模型预测所述待预测序列的免疫原性。The immunogenicity prediction unit is used to predict the immunogenicity of the sequence to be predicted based on the sequence characteristics of the sequence to be predicted and the sequence immunogenicity prediction model.

本实施提供的一种序列免疫原性预测装置,通过免疫原性预测模型建模单元,对已有免疫原性数据库的相关信息,实现对免疫原性相关特征的挖掘,从而建立已知肽段序列特征与其免疫原性之间的线性关系,并将其作为序列免疫原性预测模型;然后免疫原性预测单元利用待预测序列的序列特征和上述构建好的序列免疫原性预测模型,对待预测序列的免疫原性实现预测。本实施例相比于现有技术,免疫原性预测结果准确性高,可解释性强。The present embodiment provides a sequence immunogenicity prediction device, which mines immunogenicity-related features for relevant information of an existing immunogenicity database through an immunogenicity prediction model modeling unit, thereby establishing a linear relationship between known peptide sequence features and their immunogenicity, and using it as a sequence immunogenicity prediction model; then the immunogenicity prediction unit uses the sequence features of the sequence to be predicted and the above-constructed sequence immunogenicity prediction model to predict the immunogenicity of the sequence to be predicted. Compared with the prior art, the immunogenicity prediction results of this embodiment are highly accurate and interpretable.

在一些实施例中,所述待预测序列选自多肽或者抗体。In some embodiments, the sequence to be predicted is selected from a polypeptide or an antibody.

进一步的,抗体包括ADC。Furthermore, antibodies include ADCs.

在一些实施例中,多肽包括长度为8-13个氨基酸的肽段。In some embodiments, the polypeptide comprises a peptide segment of 8-13 amino acids in length.

在一些实施例中,所述免疫原性预测模型建模单元还包括CPP特征提取子单元和热力学相关特征提取子单元;所述CPP特征提取子单元用于提取肽段序列的CPP特征,所述热力学相关特征提取子单元用于提取肽段序列的热力学信息特征。In some embodiments, the immunogenicity prediction model building unit also includes a CPP feature extraction subunit and a thermodynamic related feature extraction subunit; the CPP feature extraction subunit is used to extract the CPP features of the peptide sequence, and the thermodynamic related feature extraction subunit is used to extract the thermodynamic information features of the peptide sequence.

在一些实施例中,肽段序列的热力学信息特征包括:蛋白质序列的脂肪族指数、蛋白质序列的BLOSUM62指数、Boman指数、蛋白质序列的理论净电荷量、蛋白质序列的Cruciani属性、蛋白质序列的FASGAI(FactorAnalysis Scales of Generalized AminoAcidInformation)向量、蛋白质序列的疏水力矩、蛋白质序列的疏水指数、蛋白质序列的不稳定性指数、蛋白质序列的基德拉因子、蛋白质序列的等电位点(pI)、蛋白质序列的protFP描述因子、蛋白质序列的VHSE尺度、蛋白质序列的Z-scale参数。In some embodiments, the thermodynamic information features of the peptide sequence include: the aliphatic index of the protein sequence, the BLOSUM62 index of the protein sequence, the Boman index, the theoretical net charge of the protein sequence, the Cruciani property of the protein sequence, the FASGAI (Factor Analysis Scales of Generalized AminoAcid Information) vector of the protein sequence, the hydrophobic moment of the protein sequence, the hydrophobic index of the protein sequence, the instability index of the protein sequence, the Kidela factor of the protein sequence, the isoelectric point (pI) of the protein sequence, the protFP descriptor of the protein sequence, the VHSE scale of the protein sequence, and the Z-scale parameter of the protein sequence.

进一步的,在一个实施例中,使用R软件的peptide包中的对应函数计算肽段序列的热力学信息特征。Furthermore, in one embodiment, the thermodynamic information characteristics of the peptide sequence are calculated using the corresponding function in the peptide package of R software.

在一些实施例中,所述回归关系为线性回归关系。In some embodiments, the regression relationship is a linear regression relationship.

进一步的,在一些实施例中,基于所述回归关系获取不同序列特征所对应的权重,包括:基于弹性网络回归获取不同序列特征所对应的权重。Furthermore, in some embodiments, obtaining weights corresponding to different sequence features based on the regression relationship includes: obtaining weights corresponding to different sequence features based on elastic network regression.

更进一步的,在一些实施例中,基于弹性网络回归获取不同序列特征所对应的权重时,采用5折交叉验证的方式进行参数估计。Furthermore, in some embodiments, when obtaining weights corresponding to different sequence features based on elastic network regression, a 5-fold cross validation method is used to perform parameter estimation.

在一些实施例中,所述免疫原性预测模型建模单元包括k-mer特征提取子单元,所述k-mer特征提取子单元提取所述k-mer特征时包括:In some embodiments, the immunogenicity prediction model building unit includes a k-mer feature extraction subunit, and the k-mer feature extraction subunit extracts the k-mer feature including:

自定义k的值;其中,k表示要考虑的肽段序列片段的长度,且k为正整数;Customize the value of k ; where k represents the length of the peptide sequence fragment to be considered, and k is a positive integer;

从给定蛋白质序列的一个端开始,遍历所述蛋白质序列中所有k个氨基酸组合,并统计所有所述氨基酸组合的类型和数量。Starting from one end of a given protein sequence, all k amino acid combinations in the protein sequence are traversed, and the types and quantities of all the amino acid combinations are counted.

进一步的,在一些实施例中,k取2-10中任意一个正整数。Furthermore, in some embodiments, k is any positive integer between 2 and 10.

更进一步的,在一个实施例中,k=2或3。Furthermore, in one embodiment, k =2 or 3.

在一些实施例中,所述CPP特征提取子单元提取所述肽段序列的CPP特征时包括:In some embodiments, the CPP feature extraction subunit extracts the CPP feature of the peptide sequence including:

将每个肽段序列与TCR序列与进行全局比对,并将比对得分的分位数作为CPP特征。Each peptide sequence was globally aligned with the TCR sequence, and the quantile of the alignment score was used as the CPP feature.

进一步的,在一些实施例中,基于氨基酸打分矩阵将每个肽段序列与TCR序列与进行全局比对。Furthermore, in some embodiments, each peptide sequence is globally aligned with the TCR sequence based on an amino acid scoring matrix.

更进一步的,在一个实施例中,将比对得分的10%与90%分位数作为CPP特征。Furthermore, in one embodiment, the 10% and 90% quantiles of the alignment score are used as CPP features.

可理解的是,本实施例提供的序列免疫原性预测装置与上述序列免疫原性预测方法相对应,其有关内容的解释、举例、有益效果等部分可以参照序列免疫原性预测方法中的相应内容,此处不再赘述。It is understandable that the sequence immunogenicity prediction device provided in this embodiment corresponds to the above-mentioned sequence immunogenicity prediction method, and the explanations, examples, beneficial effects, etc. of the relevant contents can refer to the corresponding contents in the sequence immunogenicity prediction method, which will not be repeated here.

实施例3:Embodiment 3:

第三方面,本发明实施例提供一种电子设备,参见图6,该电子设备包括处理器、通信接口、存储器和总线,其中,处理器,通信接口,存储器通过总线完成相互间的通信,处理器可以调用存储器中的逻辑指令,以执行如第一方面所提供的方法的步骤。In a third aspect, an embodiment of the present invention provides an electronic device, see Figure 6, the electronic device includes a processor, a communication interface, a memory and a bus, wherein the processor, the communication interface and the memory communicate with each other through the bus, and the processor can call the logic instructions in the memory to execute the steps of the method provided in the first aspect.

可理解的是,本实施例提供的序列免疫原性预测电子设备与上述序列免疫原性预测方法相对应,其有关内容的解释、举例、有益效果等部分可以参照序列免疫原性预测方法中的相应内容,此处不再赘述。It is understandable that the sequence immunogenicity prediction electronic device provided in this embodiment corresponds to the above-mentioned sequence immunogenicity prediction method, and the explanations, examples, beneficial effects, etc. of its relevant contents can refer to the corresponding contents in the sequence immunogenicity prediction method, which will not be repeated here.

实施例4:Embodiment 4:

第四方面,本发明实施例提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如第一方面所提供的方法的步骤。In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method provided in the first aspect.

可理解的是,本实施例提供的序列免疫原性预测的计算机可读存储介质与上述序列免疫原性预测方法相对应,其有关内容的解释、举例、有益效果等部分可以参照序列免疫原性预测方法中的相应内容,此处不再赘述。It is understandable that the computer-readable storage medium for sequence immunogenicity prediction provided in this embodiment corresponds to the above-mentioned sequence immunogenicity prediction method, and the explanations, examples, beneficial effects, etc. of the relevant contents can refer to the corresponding contents in the sequence immunogenicity prediction method, which will not be repeated here.

综上所述,与现有技术相比,具备以下有益效果:In summary, compared with the prior art, the present invention has the following beneficial effects:

1、本发明提供的一种序列免疫原性预测方法、装置、电子设备及存储介质,通过已有免疫原性数据库的相关信息,实现对免疫原性相关特征的挖掘,从而建立已知肽段序列特征与其免疫原性之间的线性关系,并将其作为序列免疫原性预测模型,然后利用待预测序列的序列特征和上述序列免疫原性预测模型,对待预测序列的免疫原性实现预测。本发明相比于现有技术,免疫原性预测结果可解释性强,且准确性好。1. The present invention provides a method, device, electronic device and storage medium for predicting sequence immunogenicity. Through the relevant information of the existing immunogenicity database, the invention realizes the mining of immunogenicity-related features, thereby establishing a linear relationship between the known peptide sequence features and their immunogenicity, and using it as a sequence immunogenicity prediction model. Then, the sequence features of the sequence to be predicted and the above sequence immunogenicity prediction model are used to predict the immunogenicity of the sequence to be predicted. Compared with the prior art, the immunogenicity prediction results of the present invention are highly interpretable and accurate.

2、本发明提供的一种序列免疫原性预测方法、装置、电子设备及存储介质,在对序列的免疫原性进行预测时,除了使用常用的理化指标外,还使用k-mer的模型来提取序列本身的规律和特征,同时利用肽段与TCR数据比对来获取序列的CPP特征,代替MHC晶体结构来反映序列的结构信息。本发明相比于现有技术,通过对肽段序列本身特征和对序列结构特征的挖掘和分析,在保证预测结果准确性的基础上,还能对免疫原性预测结果进行解释、溯源。2. The present invention provides a method, device, electronic device and storage medium for predicting sequence immunogenicity. In addition to using commonly used physical and chemical indicators, the k-mer model is used to extract the regularity and characteristics of the sequence itself when predicting the immunogenicity of the sequence. At the same time, the CPP characteristics of the sequence are obtained by comparing the peptide segment with the TCR data, replacing the MHC crystal structure to reflect the structural information of the sequence. Compared with the prior art, the present invention can interpret and trace the immunogenicity prediction results on the basis of ensuring the accuracy of the prediction results by mining and analyzing the characteristics of the peptide sequence itself and the structural characteristics of the sequence.

3、本发明提供的一种序列免疫原性预测方法、装置、电子设备及存储介质,在对序列的免疫原性进行预测时,使用稀疏线性回归模型辅助免疫原性预测,相比于现有技术中的深度模型,其运算量小,分析速度快,结果可解释性强,且研发过程中可以根据实际需要灵活调整模型参数。3. The present invention provides a sequence immunogenicity prediction method, device, electronic device and storage medium. When predicting the immunogenicity of a sequence, a sparse linear regression model is used to assist in immunogenicity prediction. Compared with the deep model in the prior art, it has a small amount of calculation, a fast analysis speed, and strong interpretability of the results. In addition, the model parameters can be flexibly adjusted according to actual needs during the research and development process.

需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the sentence "comprise a ..." do not exclude the presence of other identical elements in the process, method, article or device including the elements.

以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。The above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit the same. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that the technical solutions described in the aforementioned embodiments may still be modified, or some of the technical features may be replaced by equivalents. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A method of predicting sequence immunogenicity, the method comprising:
extracting sequence characteristics of peptide segment sequences with known immunogenicity, and constructing a feature matrix based on the sequence characteristics; the sequence features include k-mer features; wherein k is a positive integer;
Establishing a regression relation between the feature matrix and immunogenicity, and acquiring weights corresponding to different sequence features based on the regression relation;
Constructing a sequence immunogenicity prediction model based on the weights;
predicting the immunogenicity of a sequence to be predicted based on sequence features of the sequence to be predicted and the sequence immunogenicity prediction model;
The sequence feature further comprises: CPP characteristics of the peptide fragment sequence and thermodynamic information characteristics of the peptide fragment sequence;
extracting the k-mer feature comprises:
customizing the value of k; wherein k represents the length of the peptide fragment sequence fragment to be considered, and k is a positive integer;
Starting from one end of a given protein sequence, traversing all k amino acid combinations in the protein sequence, and counting the types and numbers of all amino acid combinations;
extracting the CPP feature of the peptide fragment sequence includes:
Each peptide fragment sequence was globally aligned with the TCR sequence and the quantiles of the alignment score were used as CPP features.
2. The method of claim 1, wherein the regression relationship is a linear regression relationship.
3. The method of claim 2, wherein obtaining weights corresponding to different sequence features based on the regression relationship comprises: and acquiring weights corresponding to the different sequence features based on elastic network regression.
4. A sequence immunogenicity prediction device, the device comprising:
the immunogenicity prediction model modeling unit is used for extracting sequence characteristics of peptide segment sequences with known immunogenicity and constructing a feature matrix based on the sequence characteristics; establishing a regression relation between the feature matrix and immunogenicity, and acquiring weights corresponding to different sequence features based on the regression relation; constructing a sequence immunogenicity prediction model based on the weights; wherein the sequence features comprise k-mer features; wherein k is a positive integer;
an immunogenicity prediction unit for predicting the immunogenicity of a sequence to be predicted based on sequence characteristics of the sequence to be predicted and the sequence immunogenicity prediction model;
The immunogenicity prediction model modeling unit comprises a k-mer feature extraction subunit, a CPP feature extraction subunit and a thermodynamically related feature extraction subunit; the k-mer feature extraction subunit is used for extracting k-mer features of the peptide fragment sequence, the CPP feature extraction subunit is used for extracting CPP features of the peptide fragment sequence, and the thermodynamically related feature extraction subunit is used for extracting thermodynamic information features of the peptide fragment sequence;
The k-mer feature extraction subunit, when extracting the k-mer feature, comprises:
customizing the value of k; wherein k represents the length of the peptide fragment sequence fragment to be considered, and k is a positive integer;
traversing all k amino acid combinations in a given protein sequence starting from one end of the protein sequence, and counting the type and number of all the amino acid combinations;
The CPP feature extraction subunit extracts the CPP feature of the peptide fragment sequence, including:
Each peptide fragment sequence was globally aligned with the TCR sequence and the quantiles of the alignment score were used as CPP features.
5. The apparatus of claim 4, wherein the regression relationship is a linear regression relationship.
6. The apparatus of claim 5, wherein obtaining weights corresponding to different sequence features based on the regression relationship comprises: and acquiring weights corresponding to the different sequence features based on elastic network regression.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps of the sequence immunogenicity prediction method according to any of claims 1 to 3.
8. A non-transitory computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the sequence immunogenicity prediction method according to any of claims 1 to 3.
CN202411082472.6A 2024-08-08 2024-08-08 Sequence immunogenicity prediction method, device, electronic device and storage medium Active CN118629514B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411082472.6A CN118629514B (en) 2024-08-08 2024-08-08 Sequence immunogenicity prediction method, device, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411082472.6A CN118629514B (en) 2024-08-08 2024-08-08 Sequence immunogenicity prediction method, device, electronic device and storage medium

Publications (2)

Publication Number Publication Date
CN118629514A CN118629514A (en) 2024-09-10
CN118629514B true CN118629514B (en) 2024-10-29

Family

ID=92610703

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411082472.6A Active CN118629514B (en) 2024-08-08 2024-08-08 Sequence immunogenicity prediction method, device, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN118629514B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994643A (en) * 2023-07-14 2023-11-03 清华珠三角研究院 Method, system, device and storage medium for predicting immunogenic peptide presentation

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130267429A1 (en) * 2009-12-21 2013-10-10 Lawrence Livermore National Security, Llc Biological sample target classification, detection and selection methods, and related arrays and oligonucleotide probes
KR102505543B1 (en) * 2014-03-28 2023-03-02 옵코 다이어그노스틱스, 엘엘씨 Compositions and methods related to diagnosis of prostate cancer
KR102159921B1 (en) * 2020-03-24 2020-09-25 주식회사 테라젠바이오 Method for predicting neoantigen using a peptide sequence and hla allele sequence and computer program
WO2022072722A1 (en) * 2020-09-30 2022-04-07 The Board Of Regents Of The University Of Texas System Deep learning system for predicting the t cell receptor binding specificity of neoantigens
US20220136062A1 (en) * 2020-10-30 2022-05-05 Seekin, Inc. Method for predicting cancer risk value based on multi-omics and multidimensional plasma features and artificial intelligence
CN113762416B (en) * 2021-10-15 2023-05-30 南京澄实生物科技有限公司 Antigen immunogenicity prediction method and system based on multi-modal depth coding
CN115896242A (en) * 2022-11-25 2023-04-04 绵溢(河北雄安)生物科技有限公司 Intelligent cancer screening model and method based on peripheral blood immune characteristics
WO2024115638A1 (en) * 2022-12-01 2024-06-06 Immunewatch Bv Method to evaluate the conspicuousness of an epitope towards the repertoire of t-cell receptors
CN116959581A (en) * 2023-01-17 2023-10-27 腾讯科技(深圳)有限公司 Training method, device, equipment and storage medium for immunogenicity prediction model
CN118212975A (en) * 2024-03-08 2024-06-18 中国人民解放军军事科学院军事医学研究院 Peptide, MHC and TCR binding prediction method and system based on multitasking learning

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994643A (en) * 2023-07-14 2023-11-03 清华珠三角研究院 Method, system, device and storage medium for predicting immunogenic peptide presentation

Also Published As

Publication number Publication date
CN118629514A (en) 2024-09-10

Similar Documents

Publication Publication Date Title
A. Schäffer et al. IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices
Wei et al. Enhanced protein fold prediction method through a novel feature extraction technique
CN112885412B (en) Genome annotation method, apparatus, visualization platform and storage medium
WO2018232580A1 (en) Method and device for haplotype phasing of diploid genome based on third generation capture sequencing
CN112837743A (en) A Machine Learning-Based Drug Relocation Method
CN116580771A (en) Method and device for predicting tumor neoantigen
CN117037897B (en) Peptide and MHC class I protein affinity prediction method based on protein domain feature embedding
CN118351934A (en) Tumor neoantigen recognition method and system based on second-generation sequencing data
CN118629514B (en) Sequence immunogenicity prediction method, device, electronic device and storage medium
Jiang et al. Predicting MHC class I binder: existing approaches and a novel recurrent neural network solution
JP2023530719A (en) Machine learning techniques for predicting surface-displayed peptides
KR102563901B1 (en) Prediction method for property of pharmaceutical active ingredient
CN116525108A (en) SNP data-based prediction method, device, equipment and storage medium
CN111798919A (en) A kind of tumor neoantigen prediction method, prediction device and storage medium
Joyce et al. Navigating phylogenetic conflict and evolutionary inference in plants with target capture data
JP7104441B2 (en) rmsd Multi-feature-based method for analyzing drug molecular dynamics results
Cannon et al. Comparison of probability and likelihood models for peptide identification from tandem mass spectrometry data
Wojciechowska et al. Non-standard proteins in the lenses of AlphaFold3-case study of amyloids
Hu et al. Ensemble approaches for improving HLA class I-peptide binding prediction
Hu et al. The identification of metal ion ligand-binding residues by adding the reclassified relative solvent accessibility
WO2024217010A1 (en) Gene and disease association analysis method and apparatus, computer device and storage medium
Scott et al. Isling: a tool for detecting integration of wild-type viruses and clinical vectors
Li et al. Multidimensional scaling method for prediction of lysine glycation sites
CN117457079B (en) MHC prediction model construction method and system based on degeneracy coding and deep learning
CN113851187A (en) Glycopeptide epitope prediction method, storage medium and terminal device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant