TW202228153A

TW202228153A - System and method for predicting and identifying the immunogenicity of a peptide based on machine learning

Info

Publication number: TW202228153A
Application number: TW110146065A
Authority: TW
Inventors: 王明傑; 崔輝; 溫婧
Original assignee: 大陸商江蘇恆瑞醫藥股份有限公司; 大陸商上海盛迪醫藥有限公司
Priority date: 2020-12-09
Filing date: 2021-12-09
Publication date: 2022-07-16
Also published as: WO2022121973A1; CN116583903A

Abstract

The present disclosure relates to a system and a method to train a machine-learning HLA-peptide presentation prediction and identification model. The system comprises: an encoding module, a neural network training module, an ensemble learning module and a immunogenic prediction module. The system outperforms existing predictors trained on binding affinity.

Description

Machine learning-based peptide immunogenicity prediction and identification system and method

本申請要求2020年12月9日提交的專利申請(申請號CN202011450578.9)和2021年7月5日提交的專利申請(申請號CN202110756286.6)的優先權。 This application claims the priority of the patent application filed on December 9, 2020 (application number CN202011450578.9) and the patent application filed on July 5, 2021 (application number CN202110756286.6).

本揭露屬於生物醫藥領域，提供一種預測和/或鑑定肽由人類白細胞抗原(humanleukocyte antigen,HLA)分子呈遞可能性的方法，用於製備免疫治療腫瘤的疫苗。 The present disclosure belongs to the field of biomedicine, and provides a method for predicting and/or identifying the possibility of peptide presentation by human leukocyte antigen (HLA) molecules, for preparing a vaccine for immunotherapy of tumors.

免疫治療是近年來興起的一種新型腫瘤治療手段。相對於手術切除、傳統放化療、靶向治療等方法，免疫治療效果更為顯著、副作用較小，且患者獲益時間更長。經典的免疫治療方案使用注射免疫檢查點抑制劑達到啟動患者免疫細胞的目的，但啟動的免疫細胞往往不能特異性攻擊腫瘤細胞，反而可能攻擊正常細胞，導致免疫治療僅適用於腫瘤與正常組織差異較大的患者，極大限制了免疫治療的適用範圍、安全性和療效。在此基礎上發展出腫瘤新生抗原疫苗、腫瘤新生抗原體外T細胞啟動培養等方法，使啟動的免疫細胞對腫瘤細胞的特異性更高，從而提升免疫治療的泛用性與安全性。 Immunotherapy is a new type of tumor treatment that has emerged in recent years. Compared with surgical resection, traditional radiotherapy and chemotherapy, targeted therapy and other methods, immunotherapy has more significant effect, less side effects, and longer time for patients to benefit. The classic immunotherapy scheme uses the injection of immune checkpoint inhibitors to activate the immune cells of patients, but the activated immune cells often cannot specifically attack tumor cells, but may attack normal cells, resulting in immunotherapy only applicable to the difference between tumors and normal tissues Larger patients greatly limit the scope, safety and efficacy of immunotherapy. On this basis, methods such as tumor neoantigen vaccine and tumor neoantigen in vitro T cell initiation culture have been developed to make the activated immune cells more specific to tumor cells, thereby improving the universality and safety of immunotherapy.

腫瘤新生抗原疫苗和腫瘤新生抗原體外T細胞啟動培養等方法的核心是抗原的免疫原性預測，只有免疫原性較高的腫瘤新生抗原製成的疫苗/啟動的T細胞才具有療效，然而這一步驟目前存在較大挑戰。使用ELISPOT實驗手段能夠準確得到肽的免疫原性，但這一實驗一次僅能檢測數十條肽，不能滿足臨床需求。使用質譜實驗手段能夠一次性檢測大量肽的免疫原性，但質譜實驗週期往往長達數月，實驗條件也尚不穩定，臨床應用困難較大。 The core of methods such as tumor neoantigen vaccines and tumor neoantigen in vitro T cell priming culture is the immunogenicity prediction of antigens. Only vaccines/priming T cells made from tumor neoantigens with higher immunogenicity are effective. One step is currently a big challenge. The immunogenicity of peptides can be accurately obtained by using the ELISPOT experimental method, but this experiment can only detect dozens of peptides at a time, which cannot meet the clinical needs. The use of mass spectrometry experiments can detect the immunogenicity of a large number of peptides at one time, but the period of mass spectrometry experiments is often several months, and the experimental conditions are still unstable, making clinical application difficult.

隨著機器學習的快速發展以及醫學和人工智慧的不斷融合，採用電腦方法輔助研究生物學和醫學領域相關問題成為有力的工具。預測HLA結合的第一種計算方法是SYFPEITHI，但是SYFPEITHI對大多數HLA類型只能預測9個胺基酸和10個胺基酸肽。目前對於腫瘤新生抗原的免疫原性判定大多依賴於公開軟體，諸如netMHCpan、MHCflurry等進行預測。NetMHCpan 4.0同時整合了親和力(binding affinity)以及質譜沖提配體(MS eluted ligand)資料進行訓練，得到了相比使用單一資料訓練更好的預測結果。在親和力測量方面進行訓練時，MHCflurry 1.2的整體性能優於標準預測指標NetMHC 4.0和NetMHCpan 3.0(參見O'Donnell,Timothy J.,et al.“MHCflurry：open-source class I MHC binding affinity prediction.”Cell systems 7.1(2018)：129-132.)。但是，這些軟體的準確度都不足10%，Bonsack等人的研究表明，NetMHCpan 4.0和MHCflurry 1.2並不能明顯勝過PickPocket 1.1、IEDB SMM、IEDB SMMPMBEC以及SYFPEITHI等方法(參見Bonsack,Maria,et al.“Performance evaluation of MHC class-I binding prediction tools based on an experimentally validated MHC-peptide binding data set.”Cancer immunology research 7.5(2019)：719-736.)。因此有必要研發新的預測抗原由人類白細胞抗原分子呈遞可能性的方法，從而促進腫瘤免疫治療的發展。 With the rapid development of machine learning and the continuous integration of medicine and artificial intelligence, the use of computer methods to assist in the study of related problems in biology and medicine has become a powerful tool. The first computational method to predict HLA binding was SYFPEITHI, but SYFPEITHI can only predict 9 amino acids and 10 amino acid peptides for most HLA types. At present, the immunogenicity determination of tumor neoantigens mostly relies on public software, such as netMHCpan, MHCflurry, etc. for prediction. NetMHCpan 4.0 integrates both binding affinity and MS eluted ligand data for training, and obtains better prediction results than training with a single data. When trained on affinity measures, MHCflurry 1.2 outperforms standard predictors NetMHC 4.0 and NetMHCpan 3.0 overall (see O'Donnell, Timothy J., et al. "MHCflurry: open-source class I MHC affinity binding prediction." Cell systems 7.1 (2018): 129-132.). However, the accuracy of these software is less than 10%, and the study by Bonsack et al. shows that NetMHCpan 4.0 and MHCflurry 1.2 do not significantly outperform PickPocket 1.1, IEDB SMM, IEDB SMMPMBEC, and SYFPEITHI methods (see Bonsack, Maria, et al. "Performance evaluation of MHC class-I binding prediction tools based on an experimentally validated MHC-peptide binding data set.” Cancer immunology research 7.5(2019): 719-736.). It is therefore necessary to develop new predictive antigens presented by human leukocyte antigen molecules possible approaches to facilitate the development of tumor immunotherapy.

集成學習模型是一種近年來發展較快的機器學習方法，其主要原理是將多個效果較弱的模型進行集成得到一個效果較強的模型。集成學習方法首先訓練一個效果較弱的模型，然後依據前一個模型得到的結果訓練下一個效果較弱的模型，如此反覆運算多次，最後將一系列效果較弱的模型集成，即可得到最終的一個效果較強的模型。集成學習的優點在於能夠自我調整進行參數調整，且多次反覆運算集成，所以往往準確度較高。 The ensemble learning model is a machine learning method that has developed rapidly in recent years. Its main principle is to integrate multiple models with weak effects to obtain a model with strong effects. The ensemble learning method first trains a model with weaker effect, and then trains the next model with weaker effect according to the result obtained by the previous model, and repeats the operation for many times, and finally integrates a series of models with weaker effect to obtain the final result. a more powerful model. The advantage of ensemble learning is that it can adjust the parameters by self-adjustment, and the integration is repeated many times, so the accuracy is often high.

已有集成學習模型包括AdaBoost、XGBoost和LightGBM。AdaBoost是原始的集成學習模型，訓練時能夠自行優化參數並集成，但其沒有多執行緒功能，且一次會納入所有資料進行訓練，資料集過大時往往耗時過長；XGBoost在此基礎上進行了多執行緒優化，並且隨機使用部分資料進行訓練，從而提升訓練速度與模型性能；LighGBM則使用了不同的參數自優化方法，其訓練速度較XGBoost更快，但往往得到的模型準確度略低。目前，未見集成學習模型應用於抗原的免疫原性預測的報導，抗原免疫原性仍然缺乏準確度高、有效的預測模型。 Existing ensemble learning models include AdaBoost, XGBoost, and LightGBM. AdaBoost is the original ensemble learning model, which can optimize parameters and integrate itself during training, but it does not have multi-threaded function, and will include all data for training at one time. When the data set is too large, it often takes too long; XGBoost is based on this. Multi-threaded optimization is adopted, and some data are randomly used for training, thereby improving the training speed and model performance; LighGBM uses different parameter self-optimization methods, and its training speed is faster than XGBoost, but the accuracy of the model is often slightly lower. . At present, there is no report that the integrated learning model is applied to the immunogenicity prediction of antigens, and there is still a lack of accurate and effective prediction models for antigen immunogenicity.

本揭露提供了一種基於機器學習的肽免疫原性預測和/或鑑定的系統和方法，藉由高品質的質譜資料進行參數訓練，同時融合神經網路模型和集成學習模型，提升現有預測和/或鑑定方法準確性，解決了免疫原性檢測精度低的問題，從而提升下游免疫治療療法的有效性與安全性。 The present disclosure provides a system and method for the prediction and/or identification of peptide immunogenicity based on machine learning, using high-quality mass spectrometry data for parameter training, and integrating neural network The road model and integrated learning model can improve the accuracy of existing prediction and/or identification methods, solve the problem of low immunogenicity detection accuracy, and improve the effectiveness and safety of downstream immunotherapy therapies.

本揭露提供了一種基於融合神經網路模型和集成學習模型預測和/或鑑定肽由HLA分子呈遞可能性的方法。每個呈遞可能性代表了相應的肽被一個或多個HLA等位基因呈遞在受試者的腫瘤細胞表面上的可能性。 The present disclosure provides a method for predicting and/or identifying the likelihood of peptide presentation by HLA molecules based on a fusion neural network model and an ensemble learning model. Each presentation likelihood represents the likelihood that the corresponding peptide will be presented on the surface of tumor cells in the subject by one or more HLA alleles.

一些實施方案中，該方法包括模型構建步驟和預測和/或鑑定應用步驟。 In some embodiments, the method includes a model building step and a prediction and/or identification application step.

一些實施方案中，該模型構建步驟包括構建資料集；訓練神經網路模型和集成學習模型，並進行模型融合。一些實施方案中，包括在質譜實驗提供的資料上進行參數訓練。 In some embodiments, the model building step includes building a dataset; training a neural network model and an ensemble learning model, and performing model fusion. In some embodiments, parameter training is included on data provided by mass spectrometry experiments.

一些實施方案中，該預測和/或鑑定應用步驟包括預測和/或鑑定待測肽是否將由HLA等位基因呈遞的可能性。 In some embodiments, the predicting and/or identifying applying step includes predicting and/or identifying the likelihood that the test peptide will be presented by an HLA allele.

本揭露從公共資料庫獲取質譜實驗得到的構建資料集資料。合適的資料庫包括但不限於可在NCBI、EMBL、GenBank、RefSeq、UCSC資料庫等中訪問的序列、GenPept、EBI表達譜(https：//www.ebi.ac.uk/gxa/home)、UniProtKB/Swiss-Prot(http：//www.uniprot.org/)、蛋白質資訊資源(Protein Information Resource，PIR)(http：//pir.georgetown.edu/pirwww/index.shtml)、COSMIC、MOWSE、DDBJ、PDB、EST、STS、GSS、HTGS、IEDB、SYFPEITHI和MassIVE等本領域所屬技術領域具有通常知識者已知的電腦化資料庫。 The present disclosure obtains the construction data set data obtained by mass spectrometry experiments from public databases. Suitable repositories include, but are not limited to, sequences accessible in NCBI, EMBL, GenBank, RefSeq, UCSC repositories, etc., GenPept, EBI expression profiles (https://www.ebi.ac.uk/gxa/home), UniProtKB/Swiss-Prot (http://www.uniprot.org/), Protein Information Resource (PIR) (http://pir.georgetown.edu/pirwww/index.shtml), COSMIC, MOWSE, DDBJ, PDB, EST, STS, GSS, HTGS, IEDB, SYFPEITHI, and MassIVE, among others, have computerized databases known to those of ordinary skill in the art.

一些實施方案中，資料集資料包括在人群中高頻出現HLA分型遞呈的陽性肽資料。 In some embodiments, the data set data includes data for positive peptides that are frequently presented by HLA typing in the population.

一些具體實施方案中，從公共資料庫IEDB、SYFPEITHI和MassIVE獲取質譜實驗得到的肽、免疫原性資料與表達量資料。 In some specific embodiments, peptides, immunogenicity data and expression data obtained from mass spectrometry experiments are obtained from public databases IEDB, SYFPEITHI and MassIVE.

作為訓練集和測試集的肽序列長度可以是任何長度。一些實施方案中，長度小於或等於50，一些實施方案中，小於或等於40，一些實施方案中，小於或等於30，一些實施方案中，小於或等於20或小於或等於10。 The lengths of peptide sequences used as training and test sets can be of any length. In some embodiments, the length is less than or equal to 50, in some embodiments, less than or equal to 40, in some embodiments, less than or equal to 30, and in some embodiments, less than or equal to 20 or less than or equal to 10.

一些具體實施方案中，作為訓練集和測試集的肽序列長度是5-15個胺基酸、7-12個胺基酸、8-11個胺基酸，例如5、6、7、8、9、10、11、12、13、14或15個胺基酸。 In some embodiments, the peptide sequences used as training set and test set are 5-15 amino acids, 7-12 amino acids, 8-11 amino acids in length, such as 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 or 15 amino acids.

一些具體實施方案中，訓練集包括在人群中高頻出現HLA分型遞呈的陽性肽。 In some embodiments, the training set includes positive peptides that are frequently presented by HLA typing in the population.

一些實施方案中，資料集包括陽性肽集和陰性肽集。 In some embodiments, the dataset includes a set of positive peptides and a set of negative peptides.

該陽性肽集包括從由一個或多個不同的HLA等位元基因編碼的表面結合或分泌的HLA/肽複合物中鑑定或推斷出的肽的條目。一些實施方案中，陽性肽集包括被人群中高頻出現HLA分型遞呈的陽性肽。 The positive peptide set includes entries for peptides identified or inferred from surface-bound or secreted HLA/peptide complexes encoded by one or more distinct HLA alleles. In some embodiments, the set of positive peptides includes positive peptides that are presented by HLA typing with a high frequency in the population.

該陰性肽集包括從表面結合或分泌的HLA肽複合物中未鑑定或未推斷出的肽的條目。 This negative peptide set includes entries for peptides not identified or inferred from surface-bound or secreted HLA-peptide complexes.

一些實施方案中，資料集分為一個或多個訓練資料集(訓練集)和一個或多個測試資料集(測試集)。 In some embodiments, the dataset is divided into one or more training datasets (training set) and one or more testing datasets (testing set).

一些實施方案中，資料集中陽性肽集和陰性肽集按照1：1至1：20000混合，一些實施方案中，1：1至1：10000混合、1：1至1：5000混合、1：1至1：4000混合、1：1至1：3000混合、1：1至1：2000混合、1：1至1：1000混合、1：1至1：500混合。例如1：20000、1：10000、 1：5000、1：2000、1：1000、1：500、1：300、1：150、1：100、1：50、1：10、1：1。 In some embodiments, the positive peptide set and the negative peptide set in the dataset are mixed 1:1 to 1:20000, in some embodiments, 1:1 to 1:10000, 1:1 to 1:5000, 1:1 To 1:4000 mix, 1:1 to 1:3000 mix, 1:1 to 1:2000 mix, 1:1 to 1:1000 mix, 1:1 to 1:500 mix. For example 1: 20000, 1: 10000, 1:5000, 1:2000, 1:1000, 1:500, 1:300, 1:150, 1:100, 1:50, 1:10, 1:1.

一些具體實施方案中，資料集包括單等位基因資料集和/或多等位基因資料集。 In some embodiments, the datasets include monoallelic datasets and/or multiallelic datasets.

一些具體實施方案中，訓練集基於由表達單個HLA等位元基因產生的訓練資料訓練模型。 In some embodiments, the training set trains the model based on training data generated by expressing a single HLA allele.

一些具體實施方案中，訓練集基於表達單個HLA等位基因、表達多個HLA等位元基因或其組合產生的訓練資料訓練模型。 In some embodiments, the training set trains the model based on training data generated by expressing a single HLA allele, expressing multiple HLA alleles, or a combination thereof.

模型構建步驟包括資料集預處理，一些實施方案中，模型訓練之前將資料集中的資料進行資料編碼。一些實施方案中，包括將資料集中肽、表達量、HLA分型資料進行數位化編碼。例如，將HLA分型和肽序列胺基酸編碼，連同表達量資料與肽所屬蛋白家族ID輸入模型之中進行訓練。 The model building step includes data set preprocessing. In some embodiments, data in the data set is data encoded before model training. In some embodiments, it includes digitally encoding peptides, expression levels, and HLA typing data in the data set. For example, HLA typing and peptide sequence amino acid codes, along with expression data and protein family IDs to which the peptides belong, are input into the model for training.

一些實施方案中，使用獨熱編碼(one-hot)、BLOMAP、PSSM、word2vec或BLOSUM62，將肽、HLA分型資料編碼成多個數位組成的資料行。 In some embodiments, one-hot encoding (one-hot), BLOMAP, PSSM, word2vec, or BLOSUM62 is used to encode peptide, HLA typing data into data lines consisting of multiple digits.

一些具體實施方案中，使用獨熱編碼。使用N位元狀態寄存器對N種狀態進行編碼。本揭露實施例中肽序列的胺基酸，總計有21位元狀態(20種胺基酸加1個空位，空位用Z表示，狀態順序按胺基酸縮寫字母順序A-Z)。對於第一種狀態胺基酸A(Alanine，丙胺酸)，其獨熱編碼。 In some embodiments, one-hot encoding is used. N states are encoded using an N-bit state register. The amino acids of the peptide sequences in the disclosed embodiments have a total of 21-bit states (20 amino acids plus 1 vacancy, the vacancy is represented by Z, and the state sequence is A-Z in the alphabetical order of amino acid abbreviations). For the first state amino acid A (Alanine), it is one-hot encoded.

一些具體實施方案中，編碼為[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]；對於最後一種狀態空位，其獨熱編碼。獨熱編碼為[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1]；其他狀態以此類推。 In some embodiments, the code is [1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ]; for the last state vacancy, its one-hot coding. One-hot encoded as [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1]; other states And so on.

一些具體實施方案中，使用Scikit-learn 0.20.3實現獨熱編碼。 In some embodiments, one-hot encoding is implemented using Scikit-learn 0.20.3.

一些實施方案中，神經網路模型包括全連接神經網路、卷積神經網路、長短期記憶神經網路等。一些具體實施方案中，該神經網路模型為全連接神經網路模型。 In some embodiments, the neural network model includes a fully connected neural network, a convolutional neural network, a long short-term memory neural network, and the like. In some specific embodiments, the neural network model is a fully connected neural network model.

一些實施方案中，利用全連接神經網路模型在質譜實驗資料上進行參數訓練，具體為： In some embodiments, parameter training is performed on mass spectrometry experimental data using a fully connected neural network model, specifically:

假定有m個訓練肽，每個肽對應n個HLA分型： Suppose there are m training peptides, each corresponding to n HLA types:

其中

為各訓練肽經資料編碼後的資料矩陣(不包含HLA分型)，

為各訓練肽在質譜實驗中的陽性/陰性標籤，β _target1為神經網路係數矩陣，經交叉驗證訓練得到準確度最大時的係數。 in

is the data matrix encoded by the data for each training peptide (excluding HLA typing),

is the positive/negative label of each training peptide in the mass spectrometry experiment, β _target 1 is the neural network coefficient matrix, and the coefficient when the accuracy is the largest is obtained through cross-validation training.

一些具體實施方案中，經3倍或5倍交叉驗證訓練。 In some embodiments, training is performed with 3-fold or 5-fold cross-validation.

一些實施方案中，全連接神經網路模型架構為3-5層，例如3層、4層、5層。 In some embodiments, the fully connected neural network model architecture is 3-5 layers, eg, 3 layers, 4 layers, 5 layers.

一些具體實施方案中，全連接神經網路模型架構4層，每層神經元個數依次為256、32、16和1。 In some specific embodiments, the fully-connected neural network model has 4 layers, and the number of neurons in each layer is 256, 32, 16, and 1 in sequence.

一些實施方案中，啟動函數為sigmoid、tanh、relu、softmax。 In some embodiments, the activation function is sigmoid, tanh, relu, softmax.

一些實施方案中，損失函數為Negative Bernoulli’s log loss或交叉熵(Cross entropy)。一些具體實施方案中交叉熵為binary crossentropy。 In some embodiments, the loss function is Negative Bernoulli's log loss or Cross entropy. In some embodiments the cross-entropy is binary crossentropy.

一些實施方案中，神經網路初始化為Glorot初始化、Kaiming初始化、lecun初始化或批標準化(Batch Normalization)。 In some embodiments, the neural network initialization is Glorot initialization, Kaiming initialization, lecun initialization, or Batch Normalization.

一些實施方案中，集成學習模型包括Bagging和Boosting。例如隨機森林、Adaboost，梯度提升決策樹(GBDT)，XGboost，LightGBM。 In some embodiments, the ensemble learning model includes Bagging and Boosting. For example Random Forest, Adaboost, Gradient Boosting Decision Tree (GBDT), XGboost, LightGBM.

一些具體實施方案中，集成學習模型為XGBoost。 In some specific embodiments, the ensemble learning model is XGBoost.

一些具體實施方案中，利用XGBoost集成學習模型在質譜實驗資料上進行參數訓練，具體為： In some specific embodiments, the XGBoost integrated learning model is used to perform parameter training on mass spectrometry experimental data, specifically:

其中

為各訓練肽經資料編碼後的資料矩陣(包含HLA分型)，

為各訓練肽在質譜實驗中的陽性/陰性標籤，β _target2為集成學習係數矩陣，經交叉驗證訓練得到準確度最大時的係數。 in

is the data matrix (including HLA typing) encoded by the data for each training peptide,

is the positive/negative label of each training peptide in the mass spectrometry experiment, β _{target 2} is the ensemble learning coefficient matrix, and the coefficient when the accuracy is maximum is obtained after cross-validation training.

一些實施方案中，藉由測試集陽性預測值(positive predictive values，PPV)作為較佳目標，優化模型參數與架構。 In some embodiments, the model parameters and architecture are optimized with the test set positive predictive values (PPV) as the preferred objective.

優化集成學習模型的參數包括分類樹的最大深度(max depth)、一個子節點所需的最小實例權重總和(min child weight)、構造每棵樹時列的子採樣比率(colsample bytree)、葉節點上劃分所需的最小損失減少(gamma)、最大的弱學習器的個數(n estimators)、學習率(learning rate)和訓練實例的子樣本比率(subsample)。 The parameters for optimizing the ensemble learning model include the maximum depth of the classification tree (max depth), the minimum sum of instance weights required for a child node (min child weight), the subsampling ratio of columns when constructing each tree (colsample bytree), leaf nodes Minimum required for upper division Loss reduction (gamma), maximum number of weak learners (n estimators), learning rate (learning rate), and subsample ratio of training instances.

一些具體實施方案中，max depth選自3-10、4-8，例如3、4、5、6、7、8、9、10。 In some specific embodiments, the max depth is selected from 3-10, 4-8, eg, 3, 4, 5, 6, 7, 8, 9, 10.

一些具體實施方案中，min child weight選自2-10、3-9、4-8，例如2、3、4、5、6。 In some specific embodiments, the min child weight is selected from 2-10, 3-9, 4-8, eg, 2, 3, 4, 5, 6.

一些具體實施方案中，colsample bytree選自0.40-1.1、0.5.-0.90、0.50-0.80、0.50-0.60、0.45-0.55、0.50-0.54。 In some specific embodiments, the colsample bytree is selected from 0.40-1.1, 0.5.-0.90, 0.50-0.80, 0.50-0.60, 0.45-0.55, 0.50-0.54.

一些具體實施方案中，gamma選自0.01-1.0、0.05-1.0、0.1-1.0、0.2-0.9、0.3-0.8、0.4-0.7、0.5-0.6。 In some specific embodiments, gamma is selected from 0.01-1.0, 0.05-1.0, 0.1-1.0, 0.2-0.9, 0.3-0.8, 0.4-0.7, 0.5-0.6.

一些具體實施方案中，n estimators選自100-2500、500-2300、1000-1800、1500-1700、1550-1650。 In some specific embodiments, n estimators are selected from 100-2500, 500-2300, 1000-1800, 1500-1700, 1550-1650.

一些具體實施方案中，learning rate選自0.01-0.5，例如0.02-0.5、0.03-0.45、0.04-0.40、0.05-0.35、0.06-0.30、0.07-0.25。 In some specific embodiments, the learning rate is selected from 0.01-0.5, eg, 0.02-0.5, 0.03-0.45, 0.04-0.40, 0.05-0.35, 0.06-0.30, 0.07-0.25.

一些具體實施方案中，subsample選自0.5-1、0.6-0.9、0.7-0.8。 In some specific embodiments, the subsample is selected from 0.5-1, 0.6-0.9, 0.7-0.8.

一些實施方案中，預測和/或鑑定肽由HLA分子呈遞可能性，包括融合神經網路模型係數矩陣和集成學習模型係數矩陣。 In some embodiments, predicting and/or identifying the likelihood of peptide presentation by HLA molecules includes a fusion neural network model coefficient matrix and an ensemble learning model coefficient matrix.

一些具體實施方案中，包括對於神經網路模型和集成學習模型進行融合。 In some embodiments, a fusion of neural network models and ensemble learning models is included.

一些實施方案中，融合神經網路模型和集成學習模型的策略包括平均法、投票法和學習法，其中，平均法包括簡單平均法或加權平均法。 In some embodiments, strategies for fusing neural network models and ensemble learning models include averaging, voting, and learning, wherein averaging includes simple averaging or weighted averaging.

一些實施方案中，融合包括將集成學習模型的輸出在神經網路模型中的得分乘以權重係數，組成最終模型。 In some embodiments, fusing includes multiplying the scores of the outputs of the ensemble learning model in the neural network model by a weighting factor to form a final model.

一些具體實施方案中，藉由陽性預測值(positive predictive values，PPV)作為較佳目標，優化權重係數。 In some embodiments, the weight coefficients are optimized with positive predictive values (PPV) as the preferred objective.

一些實施方案中，權重係數選自0.1-0.00001、0.005-0.0001、0.05-0.0001、0.02-0.001、0.01-0.005，例如0.1、0.05、0.02、0.01、0.005、0.002、0.001、0.0005、0.0002、0.0001。 In some embodiments, the weighting factor is selected from the group consisting of 0.1-0.00001, 0.005-0.0001, 0.05-0.0001, 0.02-0.001, 0.01-0.005, eg, 0.1, 0.05, 0.02, 0.01, 0.005, 0.002, 0.001, 0.0005, 0.0002, 0.000.

一些實施方案中，預測和/或鑑定肽由HLA分子呈遞可能性的方法，包括根據待測樣本預測和/或鑑定待測樣本各肽由HLA分子呈遞可能性。 In some embodiments, the method for predicting and/or identifying the likelihood of peptide presentation by HLA molecules includes predicting and/or identifying the likelihood of each peptide being presented by HLA molecules in the test sample based on the test sample.

一些實施方案中，包括根據待測樣本的肽、表達量和HLA分型資料預測和/或鑑定待測樣本各肽由HLA分子呈遞可能性。 In some embodiments, it includes predicting and/or identifying the possibility that each peptide of the test sample is presented by HLA molecules according to the peptide, expression level and HLA typing data of the test sample.

一些實施方案中，對待測樣本進行與模型構建階段相同的預處理。將處理後的待預測和/或鑑定資料登錄至模型中預測和/或鑑定，並使用劃定閾值進行分類，輸出待預測和/或鑑定資料的免疫原性分類結果。 In some embodiments, the sample under test is subjected to the same preprocessing as the model building stage. The processed data to be predicted and/or identified is logged into the model for prediction and/or identification, and the threshold is used for classification, and the immunogenicity classification result of the data to be predicted and/or identified is output.

一些實施方案中，對待測樣本進行資料編碼。 In some embodiments, the sample to be tested is encoded with data.

一些實施方案中，使用獨熱編碼(one-hot)、BLOMAP、PSSM、word2vec或BLOSUM62，將待測樣本肽、HLA分型資料編碼成多個數位組成的資料行。 In some embodiments, one-hot encoding (one-hot), BLOMAP, PSSM, word2vec, or BLOSUM62 is used to encode the test sample peptides, HLA typing data into data lines composed of multiple digits.

一些具體實施方案中，編碼為[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]；對於最後一種狀態空位，其獨熱編碼。獨熱編碼為[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1]；其他狀態以此類推。 In some embodiments, the code is [1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ]; for the last state slot, its one-hot encoding. One-hot encoded as [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1]; other states And so on.

一些實施方案中，對待測和/或鑑定樣本的肽進行與模型構建階段相同的預處理，將處理後的待預測和/或鑑定資料登錄至模型中預測和/或鑑定，並使用劃定閾值進行分類，輸出待預測和/或鑑定資料的由HLA分子呈遞可能性結果。待測樣本中高於該閾值的肽被預測和/或鑑定為由HLA分子呈遞。 In some embodiments, the peptides of the samples to be tested and/or identified are subjected to the same preprocessing as in the model building phase, and the processed data to be predicted and/or identified are logged into the model for prediction and/or identification, and delineated thresholds are used. A classification is performed to output a likelihood of presentation by HLA molecules for the data to be predicted and/or identified. Peptides in the test sample above this threshold are predicted and/or identified as being presented by HLA molecules.

一些實施方案中，待預測和/或鑑定資料集還包括以下至少一種：肽中的至少一個的肽-HLA結合親和力測量值相關的資料；及與該肽中的至少一個的肽-HLA結合穩定性測量值相關的資料。 In some embodiments, the data set to be predicted and/or identified further includes at least one of: data related to peptide-HLA binding affinity measurements of at least one of the peptides; and stabilization of peptide-HLA binding to at least one of the peptides Information about sex measurements.

一些實施方案中，用於預測和/或鑑定待測肽的免疫原性

，具體為： In some embodiments, for predicting and/or identifying the immunogenicity of test peptides

,Specifically:

其中X為各肽經資料編碼模組編碼後得到的資料行。 Wherein X is the data row obtained after each peptide is encoded by the data encoding module.

一些實施方案中，以得分最高的0.1%選為陽性(top 0.1%)或40%召回率下的陽性預測值(recall 0.4)作為閾值，待測樣本中高於該閾值的的肽被預測和/或鑑定為由HLA分子呈遞。 In some embodiments, the top 0.1% of the scores are selected as positive (top 0.1%) or the positive predictive value (recall 0.4) at 40% recall as the threshold above which peptides in the test sample are predicted and/or or identified as being presented by HLA molecules.

一些具體實施方案中，以recall 0.4作為閾值。 In some embodiments, recall 0.4 is used as the threshold.

本揭露待預測和/或鑑定樣本的肽，包括從受試者的腫瘤細胞獲得外顯子組、轉錄組或全基因組腫瘤核苷酸測序數據中的至少一種，其中該腫瘤核苷酸測序數據被用於獲得肽的資料。 The present disclosure is to predict and/or identify peptides of a sample, including obtaining at least one of exome, transcriptome, or genome-wide tumor nucleotide sequencing data from tumor cells of a subject, wherein the tumor nucleotide sequencing data was used to obtain peptide data.

一些實施方案中，待預測和/或鑑定樣本包括以下至少一種：被工程改造成表達單個HLA等位基因的細胞系；被工程改造成表達多個HLA等位基因的細胞系；從多個患者獲得或得到的人細胞系；從多個患者獲得的新鮮或冷凍的腫瘤樣品；以及從多個患者獲得的新鮮或冷凍的組織樣品。 In some embodiments, the sample to be predicted and/or identified includes at least one of: a cell line engineered to express a single HLA allele; a cell line engineered to express multiple HLA alleles; obtained or obtained human cell lines; fresh or frozen tumor samples obtained from multiple patients; and fresh or frozen tissue samples obtained from multiple patients.

一些實施方案中，呈遞可能性視需要地進一步藉由如由RNA-seq或質譜法測量受試者中一個或多個HLA等位基因的表達水準鑑別。一些具體實施方案中，待測樣本資料集預處理與構建模型相同。一些具體實施方案中，待測樣本資料集使用獨熱編碼(one-hot)、BLOMAP、PSSM、word2vec或BLOSUM62，將待測樣本肽、HLA分型資料編碼成多個數位組成的資料行。 In some embodiments, presentation potential is optionally further identified by measuring the expression level of one or more HLA alleles in the subject, such as by RNA-seq or mass spectrometry. In some specific embodiments, the preprocessing of the sample data set to be tested is the same as the model building. In some specific embodiments, the test sample data set uses one-hot encoding (one-hot), BLOMAP, PSSM, word2vec or BLOSUM62 to encode the test sample peptide and HLA typing data into data lines composed of multiple digits.

一些實施方案中，預測和/或鑑定選自單HLA等位基因預測和/或鑑定、多等位基因預測和/或鑑定。 In some embodiments, the prediction and/or identification is selected from the group consisting of single HLA allele prediction and/or identification, multi-allele prediction and/or identification.

一些實施方案中，本揭露融合的神經網路模型和集成學習模型，選自單等位元基因模型、多等位元基因模型。 In some embodiments, the neural network model and the ensemble learning model fused in the present disclosure are selected from a monoallelic model and a multiallelic model.

一些具體實施方案中，單等位元基因模型在單HLA等位元基因的基礎上構建模型並預測和/或鑑定肽是否將由相關HLA等位基因呈遞的可能性。 In some embodiments, a monoallelic model builds a model based on a single HLA allele and predicts and/or identifies the likelihood that a peptide will be presented by the associated HLA allele.

一些具體實施方案中，多等位元基因模型在存在兩個或更多個HLA等位基因的多等位基因環境中構建模型，並預測和/或鑑定肽是否將由多個HLA等位基因呈遞的可能性。 In some embodiments, the multi-allelic gene model builds a model in a multi-allelic environment where two or more HLA alleles are present, and predicts and/or identifies whether a peptide will be presented by multiple HLA alleles possibility.

一些具體實施方案中，兩個或更多個HLA等位基因包括兩個或更多個不同的HLA等位基因。 In some embodiments, the two or more HLA alleles comprise two or more different HLA alleles.

一些具體實施方案中，HLA等位基因包括I類HLA等位基因。 In some embodiments, the HLA alleles comprise class I HLA alleles.

一些具體實施方案中，HLA等位基因包括II類HLA等位基因。 In some embodiments, the HLA alleles include class II HLA alleles.

一些具體實施方案中，HLA等位基因包括HLA-A、HLA-B、HLA-C、HLA-E、HLA-F、HLA-G、HLA-DQ、HLA-DR、HLA-DP。 In some embodiments, HLA alleles include HLA-A, HLA-B, HLA-C, HLA-E, HLA-F, HLA-G, HLA-DQ, HLA-DR, HLA-DP.

一些具體實施方案中，HLA等位基因選自A*01：01,A*02：01,A*02：03,A*02：04,A*02：07,A*03：01,A*24：02,A*29：02,A*31：01,A*68：02,B*35：01,B*44：02,B*44：03,B*51：01,B*54：01,B57：01,C*03：02,C*03：04,C*04：01,C*05：01,C*06：02,C*08：01,C*08：02,C*12：02,C*14：02,C*14：03,C*15：02和C*16：01。所屬技術領域具有通常知識者已知，存在上述HLA類型的等位基因變體，本發明涵蓋所有這些等位基因變體。HLA類等位基因的完整列表可見於http：//hla.alleles.org/alleles/。例如，可以在http：//hla.alleles.org/alleles/class1.html上找到HLA I類等位基因的完整列表。 In some embodiments, the HLA allele is selected from the group consisting of A*01:01, A*02:01, A*02:03, A*02:04, A*02:07, A*03:01, A* 24:02,A*29:02,A*31:01,A*68:02,B*35:01,B*44:02,B*44:03,B*51:01,B*54: 01,B57:01,C*03:02,C*03:04,C*04:01,C*05:01,C*06:02,C*08:01,C*08:02,C* 12:02, C*14:02, C*14:03, C*15:02 and C*16:01. It is known to those of ordinary skill in the art that there are allelic variants of the above-mentioned HLA types, all of which are encompassed by the present invention. A complete list of HLA class alleles can be found at http://hla.alleles.org/alleles/. For example, a complete list of HLA class I alleles can be found at http://hla.alleles.org/alleles/class1.html.

一些具體實施方案中，本揭露提供基於融合的神經網路模型和集成學習模型預測和/或鑑定肽或肽的組合由HLA分子呈遞可能性的方法，包括： In some embodiments, the present disclosure provides methods for predicting and/or identifying the likelihood of peptides or combinations of peptides being presented by HLA molecules based on fused neural network models and ensemble learning models, comprising:

步驟1：資料編碼，將肽、表達量、HLA分型資料等進行數位化編碼； Step 1: Data coding, digital coding of peptides, expression levels, HLA typing data, etc.;

步驟2：神經網路模型訓練，利用全連接神經網路模型，在質譜實驗提供的資料上進行參數訓練； Step 2: Neural network model training, using the fully connected neural network model to perform parameter training on the data provided by the mass spectrometry experiment;

步驟3：集成學習模型訓練，利用集成學習模型，在質譜實驗提供的免疫原性資料上進行參數訓練； Step 3: Integrated learning model training, using the integrated learning model to perform parameter training on the immunogenicity data provided by the mass spectrometry experiment;

步驟4：肽由HLA分子呈遞可能性預測和/或鑑定，將待預測和/或鑑定樣本的肽、表達量和HLA分型資料，根據神經網路係數矩陣和集成學習係數矩陣，預測和/或鑑定樣本各肽由HLA分子呈遞可能性。 Step 4: The peptides are predicted and/or identified by the possibility of presentation of HLA molecules, and the peptides, expression levels and HLA typing data of the samples to be predicted and/or identified are predicted and/or identified according to the neural network coefficient matrix and the integrated learning coefficient matrix. Or identify the likelihood that each peptide in the sample is presented by HLA molecules.

本揭露提供了一種基於融合神經網路模型和集成學習模型預測和/或鑑定肽由HLA分子呈遞可能性的方法，該方法包括：構建資料集和資料編碼；訓練、融合神經網路模型和集成學習模型；預測和/或鑑定肽由HLA分子呈遞可能性，其中，資料集、資料編碼、訓練、融合、神經網路模型、集成學習模型、預測和/或鑑定如本揭露所述。 The present disclosure provides a method for predicting and/or identifying the possibility of peptide presentation by HLA molecules based on a fusion neural network model and an ensemble learning model. The method includes: constructing a data set and data encoding; training, fusing the neural network model and integrating Learning model; predicting and/or identifying the likelihood of peptide presentation by HLA molecules, wherein the dataset, data encoding, training, fusion, neural network model, ensemble learning model, prediction and/or identification are as described in this disclosure.

本揭露提供了機器學習模型在製備mRNA、多肽疫苗、抗腫瘤藥物或腫瘤疫苗中的應用。一些具體實施方案中，機器學習模型為本揭露提供的融合的神經網路模型和集成學習模型。 The present disclosure provides the application of machine learning models in the preparation of mRNA, polypeptide vaccines, antitumor drugs or tumor vaccines. In some specific embodiments, the machine learning model is the fused neural network model and the ensemble learning model provided by the present disclosure.

本揭露提供了用於鑑別來自受試者的一個或多個腫瘤細胞的可能被一個或多個HLA等位基因呈遞在該腫瘤細胞表面上的至少一種肽的方法，包括藉由本揭露所述預測和/或鑑定肽由HLA分子呈遞可能性的方法。 The present disclosure provides methods for identifying at least one peptide from one or more tumor cells of a subject that is likely to be presented on the surface of the tumor cell by one or more HLA alleles, comprising predicting by the present disclosure and/or methods of identifying the likelihood of peptide presentation by HLA molecules.

本揭露提供了本揭露所述預測和/或鑑定肽由HLA分子呈遞可能性的方法在製備mRNA、多肽疫苗、抗腫瘤藥物或腫瘤疫苗中的應用。 The present disclosure provides the application of the method for predicting and/or identifying the possibility of peptide presentation by HLA molecules described in the present disclosure in the preparation of mRNA, polypeptide vaccines, anti-tumor drugs or tumor vaccines.

一些具體實施方案中，腫瘤選自由以下組成的組：肺癌、黑色素瘤、乳癌、卵巢癌、前列腺癌、腎癌、胃癌、結腸癌、睾丸癌、頭頸癌、胰腺癌、腦癌、B細胞淋巴瘤、急性骨髓性白血病、慢性骨髓性白血病、慢性淋巴細胞性白血病和T細胞淋巴細胞性白血病、非小細胞肺癌和小細胞肺癌。 In some embodiments, the tumor is selected from the group consisting of: lung cancer, melanoma, breast cancer, ovarian cancer, prostate cancer, kidney cancer, stomach cancer, colon cancer, testicular cancer, head and neck cancer, pancreatic cancer, brain cancer, B cell lymphoma tumor, acute myeloid leukemia, chronic myeloid leukemia, chronic lymphocytic leukemia and T-cell lymphocytic leukemia, non-small cell lung cancer and small cell lung cancer.

一些實施方案中，該方法進一步包括從被預測和/或鑑定為由HLA分子呈遞的肽或肽的組合產生用於構建個性化癌症疫苗的輸出。在這樣的實施方案中，個性化癌症疫苗的輸出可包括編碼該選定的肽或肽的組合的至少一個肽或至少一個多核苷酸。 In some embodiments, the method further comprises generating an output for constructing a personalized cancer vaccine from a peptide or combination of peptides predicted and/or identified to be presented by HLA molecules. In such embodiments, the output of the personalized cancer vaccine may include at least one peptide or at least one polynucleotide encoding the selected peptide or combination of peptides.

另一方面，提供了基於本揭露的基於融合的神經網路模型和集成學習模型，鑑別腫瘤中變體或等位基因突變的方法。 In another aspect, methods are provided for identifying variant or allelic mutations in tumors based on the fusion-based neural network models and ensemble learning models of the present disclosure.

一些實施方案中，包括鑑別在腫瘤細胞中具有突變的肽或由例如剪接位點突變、移碼突變、通讀突變或基因融合突變產生的突變多肽。 In some embodiments, identification of peptides with mutations in tumor cells or mutant polypeptides resulting from, for example, splice site mutations, frameshift mutations, readthrough mutations, or gene fusion mutations is included.

本揭露提供了一種肽或肽的組合，其包括一種或多種由本揭露該方法預測和/或鑑定的肽。 The present disclosure provides a peptide or combination of peptides comprising one or more peptides predicted and/or identified by the methods of the present disclosure.

本揭露提供了一種肽或肽的組合，其包括一種或多種由本揭露該方法預測和/或鑑定為由HLA分子呈遞的肽。 The present disclosure provides a peptide or combination of peptides comprising one or more peptides predicted and/or identified as presented by HLA molecules by the methods of the present disclosure.

一些實施方案中，肽是以高於野生型肽的親和力呈遞於HLA蛋白質上。 In some embodiments, the peptide is presented on the HLA protein with a higher affinity than the wild-type peptide.

一些具體實施方案中，肽的IC50值可以是至少低於5000nM、至少低於1000nM、至少低於500nM、至少低於250nM、至少低於200nM、至少低於150nM、至少低於100nM、至少低於50nM或更低。 In some embodiments, the peptide can have an IC50 value of at least below 5000 nM, at least below 1000 nM, at least below 500 nM, at least below 250 nM, at least below 200 nM, at least below 150 nM, at least below 100 nM, at least below 100 nM 50nM or less.

一些實施方案中，肽的長度包括但不限5、6、7、8、9、10、11、12、13、14、15、16、17、18、19、20、21、22、23、24、25、26、27、28、29、30、31、32、33、34、35、36、37、38、39、40、41、42、43、44、45、46、47、48、49、50、60、70、80、90、100、110、120或更多個胺基分子殘基，以及由其中可衍生的任何範圍。在特定實施方案中，肽的長度等於或少於50個胺基酸。 In some embodiments, the length of the peptide includes, but is not limited to, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 60, 70, 80, 90, 100, 110, 120 or more amine molecular residues, and any range derivable therefrom. In certain embodiments, the peptide is equal to or less than 50 amino acids in length.

一些實施方案中，肽的組合包括至少兩個或更多個肽。 In some embodiments, the combination of peptides includes at least two or more peptides.

一些具體實施方案中，該組合含有至少兩個不同的肽。 In some embodiments, the combination contains at least two different peptides.

一些實施方案中，具有所希望的活性或特性的肽可以被修飾成用於提供某些所希望的屬性，例如改良的藥理學特徵，同時增加或至少保持未修飾肽的大體上所有生物活性以結合所希望的HLA分子並活化適當T細胞。 In some embodiments, a peptide with a desired activity or property can be modified to provide certain desired properties, such as improved pharmacological properties, while increasing or at least maintaining substantially all of the biological activity of the unmodified peptide to Binds the desired HLA molecule and activates the appropriate T cells.

本揭露提供了用於鑑別來自受試者的一個或多個腫瘤細胞的可能被一個或多個HLA等位基因呈遞在該腫瘤細胞表面上的至少一種肽的方法，該肽選自本揭露所述的方法預測和/或鑑定的肽。 The present disclosure provides methods for identifying from one or more tumor cells of a subject at least one peptide selected from the group consisting of the one or more HLA alleles likely to be presented on the surface of the tumor cell by one or more HLA alleles peptides predicted and/or identified by the methods described above.

本揭露提供了一種分離的T細胞，其對本揭露所述的肽或肽的組合具有特異性。 The present disclosure provides an isolated T cell specific for a peptide or combination of peptides described in the present disclosure.

本揭露提供了一種多核苷酸，其編碼本揭露所述的肽或肽的組合。 The present disclosure provides a polynucleotide encoding a peptide or combination of peptides described in the present disclosure.

一些實施方案中，多核苷酸可以是例如單鏈和/或雙鏈DNA、cDNA、PNA、CAN、RNA(例如mRNA)，或多核苷酸的天然或化學修飾形式，或其組合，並且該多核苷酸可以含有或可以不含內含子。 In some embodiments, the polynucleotide may be, for example, single- and/or double-stranded DNA, cDNA, PNA, CAN, RNA (eg, mRNA), or a natural or chemically modified form of the polynucleotide, or a combination thereof, and the polynucleotide The nucleotides may or may not contain introns.

一些具體的實施方案中，多核苷酸選自mRNA。 In some specific embodiments, the polynucleotide is selected from mRNA.

一些實施方案中，多核苷酸被直接遞送。 In some embodiments, the polynucleotide is delivered directly.

一些實施方案中，多核苷酸藉由遞藥系統遞送。各種遞藥系統是已知的並且可以用於本揭露的多核苷酸，例如封裝在病毒載體、mRNA載體、DNA載體、脂質體中。 In some embodiments, the polynucleotide is delivered by a drug delivery system. Various delivery systems are known and can be used with the polynucleotides of the present disclosure, eg, encapsulation in viral vectors, mRNA vectors, DNA vectors, liposomes.

一些具體的實施方案中，多核苷酸與陽離子性化合物，如陽離子性脂質，形成複合物遞送。 In some specific embodiments, the polynucleotides are delivered in complexes with cationic compounds, such as cationic lipids.

本揭露提供了一種載體，其含有如本揭露所述多核苷酸，該載體為真核表達載體、原核表達載體或病毒載體。 The present disclosure provides a vector containing the polynucleotide described in the present disclosure, and the vector is a eukaryotic expression vector, a prokaryotic expression vector or a viral vector.

另一方面，本揭露提供了一種宿主細胞，其包含本揭露所述的載體；一些實施方案中，該宿主細胞為細菌、酵母、哺乳動物細胞，一些實施方案中，該宿主細胞為大腸桿菌、畢赤酵母、中國倉鼠卵巢細胞或人胚腎293細胞。 In another aspect, the present disclosure provides a host cell comprising the vector described in the present disclosure; in some embodiments, the host cell is bacteria, yeast, mammalian cells, and in some embodiments, the host cell is Escherichia coli, Pichia pastoris, Chinese hamster ovary cells or human embryonic kidney 293 cells.

本揭露提供了一種製備肽的方法，該肽由本揭露所述的方法預測和/或鑑定。 The present disclosure provides a method of making a peptide predicted and/or identified by the method described in the present disclosure.

一些實施方案中，該肽由本揭露所述的方法預測和/或鑑定為由HLA分子呈遞。 In some embodiments, the peptide is predicted and/or identified by the methods described herein to be presented by an HLA molecule.

一些實施方案中，該肽為本揭露所述的肽或肽的組合。 In some embodiments, the peptide is a peptide or combination of peptides described in this disclosure.

本揭露提供了一種製備肽的方法，包含本揭露所述的預測和/或鑑定肽由HLA分子呈遞可能性方法的步驟。 The present disclosure provides a method for preparing a peptide comprising the steps of the method described in the present disclosure for predicting and/or identifying the likelihood of peptide presentation by HLA molecules.

本揭露提供了一種製備多核苷酸的方法。 The present disclosure provides a method of making a polynucleotide.

一些實施方案中，該多核苷酸編碼由本揭露所述的方法預測和/或鑑定的由HLA分子呈遞的肽。 In some embodiments, the polynucleotide encodes a peptide presented by an HLA molecule predicted and/or identified by the methods described herein.

一些實施方案中，該多核苷酸編碼本揭露所述的肽或肽的組合。 In some embodiments, the polynucleotide encodes a peptide or combination of peptides described herein.

一些實施方案中，該多核苷酸為本揭露所述的多核苷酸。 In some embodiments, the polynucleotide is the polynucleotide of the disclosure.

本揭露提供了抗原，其包括本揭露所述的多核苷酸或肽。 The present disclosure provides antigens comprising the polynucleotides or peptides described in the present disclosure.

一些實施方案中，抗原包括了編碼肽或其部分的多核苷酸。該多核苷酸可以是例如單鏈和/或雙鏈DNA、cDNA、PNA、CAN、RNA(例如mRNA)，或多核苷酸的天然或化學修飾形式，或其組合，並且該多核苷酸可以含有或可以不含內含子。 In some embodiments, the antigen includes a polynucleotide encoding a peptide or portion thereof. The polynucleotide may be, for example, single- and/or double-stranded DNA, cDNA, PNA, CAN, RNA (eg, mRNA), or a natural or chemically modified form of the polynucleotide, or a combination thereof, and the polynucleotide may contain Or may be free of introns.

一些實施方案中，抗原包括藉由本揭露的方法預測和/或鑑定的腫瘤特異性突變的分離的肽、包含已知腫瘤特異性突變的肽，以及藉由本文所公開的方法預測和/或鑑定的肽或其片段。一些實施方案中，一種或多種抗原可以被呈遞在腫瘤表面上。 In some embodiments, antigens include isolated peptides with tumor-specific mutations predicted and/or identified by the methods disclosed herein, peptides comprising known tumor-specific mutations, and predicted and/or identified by the methods disclosed herein peptides or fragments thereof. In some embodiments, one or more antigens can be presented on the tumor surface.

一些實施方案中，一種或多抗原可以在患腫瘤的受試者中具有免疫原性，例如能夠在該受試者體內引起T細胞應答或B細胞應答。 In some embodiments, one or more antigens may be immunogenic in a subject with a tumor, eg, capable of eliciting a T cell response or a B cell response in the subject.

一些實施方案中，在產生用於患腫瘤的受試者的疫苗的情況下，可以考慮排除在受試者體內誘導自體免疫應答的一種或多種抗原。 In some embodiments, where a vaccine is produced for a subject with a tumor, it may be contemplated to exclude one or more antigens that induce an autoimmune response in the subject.

一些實施方案中，提供了一種能夠表達肽或其部分的表達載體。 In some embodiments, an expression vector capable of expressing a peptide or portion thereof is provided.

本揭露的抗原可以使用本領域中已知的方法製造，包括在適於表達該抗原或肽或載體的條件下培養宿主細胞，其中該宿主細胞包含至少一個編碼抗原或肽或載體的多核苷酸；以及純化該抗原或肽或載體。標準純化方法包括色譜技術、電泳技術、免疫技術、沉澱、透析、過濾、濃縮和等電聚焦技術。 The antigens of the present disclosure can be produced using methods known in the art, including culturing a host cell under conditions suitable for expression of the antigen or peptide or vector, wherein the host cell comprises at least one polynucleotide encoding the antigen or peptide or vector ; and purifying the antigen or peptide or carrier. Standard purification methods include chromatographic techniques, electrophoresis techniques, immunological techniques, precipitation, dialysis, filtration, concentration and isoelectric focusing techniques.

另一方面，本揭露提供了一種能夠引起特異性免疫應答(例如腫瘤特異性免疫應答)的組成物，其包含多個使用本揭露所描述的方法預測和/或鑑定的肽或編碼肽或其部分的多核苷酸。 In another aspect, the present disclosure provides a composition capable of eliciting a specific immune response (eg, a tumor-specific immune response) comprising a plurality of peptides or encoded peptides predicted and/or identified using the methods described in the present disclosure, or part of the polynucleotide.

一些實施方案中，組成物為mRNA、多肽疫苗、抗腫瘤藥物或腫瘤疫苗。 In some embodiments, the composition is an mRNA, a polypeptide vaccine, an anti-tumor drug, or a tumor vaccine.

一些實施方案中，組成物能夠在該受試者體內引起T細胞應答或B細胞應答。 In some embodiments, the composition is capable of eliciting a T cell response or a B cell response in the subject.

一些實施方案中，組成物為免疫原性組成物例如疫苗組成物，或醫藥組成物。 In some embodiments, the composition is an immunogenic composition such as a vaccine composition, or a pharmaceutical composition.

一些實施方案中，疫苗組成物或醫藥組成物包括個數在1個與30個之間的肽，即2、3、4、5、6、7、8、9、10、11、12、13、14、15、16、17、18、19、20、21、22、23、24、25、26、27、28、29或30個不同的肽。 In some embodiments, the vaccine composition or pharmaceutical composition comprises between 1 and 30 peptides, ie 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 , 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29 or 30 different peptides.

一些實施方案中，可以藉由若干方式設計出更長的肽。在一種情況下，當預測和/或鑑定出或已知肽在HLA等位基因上呈遞的可能性時，較長的肽可以由以下任一種組成：(1)朝各相應基因產物的N末端和C末端延伸2-5個胺基酸的個別呈遞的肽；(2)一些或全部呈遞肽與各自的延伸序列的串接。在另一情況下，當測序披露在腫瘤中存在較長的(>10個殘基)新表位(例如由產生新穎肽的移碼、通讀或包括內含子引起)時，較長的肽將由以下組成：(3)由新穎腫瘤特異性胺基酸組成的整個延伸段，由此繞過了對基於計算或體外測試來選擇HLA呈遞最強的較短肽的需求。在兩種情況下，較長鏈的使用使患者細胞能夠進行內源性加工並且可以產生更有效的抗原呈遞和T細胞應答的誘導作用。 In some embodiments, longer peptides can be designed in several ways. In one case, when the likelihood of peptide presentation on HLA alleles is predicted and/or identified or known, longer peptides may consist of either: (1) towards the N-terminus of each corresponding gene product and individually presented peptides extending 2-5 amino acids from the C-terminus; (2) concatenation of some or all of the presented peptides with the respective extension sequences. In another case, when sequencing revealed the presence of longer (>10 residues) neo-epitopes in the tumor (eg, caused by frameshifts, read-throughs, or intron inclusions to generate novel peptides), longer peptides Will consist of: (3) an entire stretch consisting of novel tumor-specific amino acids, thereby bypassing the need to select shorter peptides with the strongest HLA presentation based on computational or in vitro testing. In both cases, the use of longer chains enables endogenous processing in patient cells and can result in more efficient antigen presentation and induction of T cell responses.

一些具體實施方案中，肽可以包括翻譯後修飾。 In some embodiments, the peptides may include post-translational modifications.

肽可以藉由所屬技術領域具有通常知識者已知的任何技術製備，包括藉由標準分子生物學技術表達多肽或肽、從天然來源分離肽，或化學合成肽。先前已公開對應於各種基因的多核苷酸和蛋白質、多肽和肽，並且可以見於所屬技術領域具有通常知識者已知的電腦化資料庫。一種此類資料庫是位於美國國家衛生研究院(NationalInstitutes of Health)網站的國家生物技術資訊中心(National Center forBiotechnology Information)的Genbank和GenPept資料庫。已知基因的編碼區可以使用本文所公開或所屬技術領域具有通常知識者已知的技術擴增和/或表達。或者，所屬技術領域具有通常知識者已知肽和肽序列的各種市售製劑。 Peptides can be prepared by any technique known to those of ordinary skill in the art, including expression of polypeptides or peptides by standard molecular biology techniques, isolation of peptides from natural sources, or chemical synthesis of peptides. Polynucleotides and proteins, polypeptides and peptides corresponding to various genes have been previously disclosed and can be found in computerized databases known to those of ordinary skill in the art. One such repository is the Genbank and GenPept repositories of the National Center for Biotechnology Information located on the National Institutes of Health website. The coding regions of known genes can be amplified and/or expressed using techniques disclosed herein or known to those of ordinary skill in the art. Alternatively, there are various commercially available formulations of peptides and peptide sequences known to those of ordinary skill in the art.

一些實施方案中，疫苗組成物或醫藥組成物包括個數在1個與100個之間或更多個多核苷酸序列，即2、3、4、5、6、7、8、9、10、11、12、13、14、15、16、17、18、19、20、21、22、23、24、25、26、27、28、29、30、31、32、33、34、35、36、37、38、39、40、41、42、43、44、45、46、47、48、49、50、51、52、53、54、55、56、57、58、59、60、61、62、63、64、65、66、67、68、69、70、71、72、73、74、75、76、77、78、79、80、81、82、83、84、85、86、87、88、89、90、91、92、93、94,95、96、97、98、99、100或更多個不同的多核苷酸序列。 In some embodiments, the vaccine composition or pharmaceutical composition comprises between 1 and 100 or more polynucleotide sequences, ie 2, 3, 4, 5, 6, 7, 8, 9, 10 , 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35 , 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60 , 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85 , 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100 or more different polynucleotide sequences.

一些實施方案中，不同肽或編碼其的多核苷酸的選擇使得這些肽能夠與不同HLA分子，如不同的I類HLA分子和/或不同的II類HLA分子締合。 In some embodiments, the selection of different peptides or polynucleotides encoding the same enables these peptides to associate with different HLA molecules, such as different class I HLA molecules and/or different class II HLA molecules.

一些具體實施方案中，疫苗組成物或醫藥組成物包含能夠與最常出現的I類HLA分子和/或II類HAL分子締合的肽。例如，疫苗組成物可以包含能夠與至少2個、至少3個或至少4個I類HLA分子和/或II類HLA分子締合的不同片段。 In some embodiments, the vaccine composition or pharmaceutical composition comprises a peptide capable of associating with the most commonly occurring class I HLA molecules and/or class II HAL molecules. For example, the vaccine composition may comprise different fragments capable of associating with at least 2, at least 3, or at least 4 HLA class I molecules and/or class II HLA molecules.

一些實施方案中，該疫苗組成物或醫藥組成物能夠引起特異性細胞毒性T細胞應答和/或特異性輔助T細胞應答。 In some embodiments, the vaccine composition or pharmaceutical composition is capable of eliciting a specific cytotoxic T cell response and/or a specific helper T cell response.

一些實施方案中，疫苗組成物還包括佐劑和/或載體。 In some embodiments, the vaccine composition further includes an adjuvant and/or carrier.

一些實施方案中，醫藥組成物還包括藥學上可接受的載體。 In some embodiments, the pharmaceutical composition further includes a pharmaceutically acceptable carrier.

一些實施方案中，組成物也可以被包括在基於病毒載體的疫苗或藥物平臺中，如牛痘、禽痘、自複製型α病毒、馬拉巴病毒、腺病毒或慢病毒。 In some embodiments, the composition may also be included in a viral vector-based vaccine or drug platform, such as vaccinia, fowlpox, self-replicating alphavirus, Maraba virus, adenovirus, or lentivirus.

一些實施方案中，組成物可以被包括在脂質體中。 In some embodiments, the composition can be included in liposomes.

一些實施方案中，組成物可以藉由腸胃外施用，例如靜脈內、皮下、皮內或肌肉內施用。例如可以製備供靜脈內注射、皮下注射、皮內注射、腹膜內注射、腹腔注射、肌肉內注射的組成物。 In some embodiments, the composition may be administered parenterally, eg, intravenously, subcutaneously, intradermally, or intramuscularly. For example, compositions for intravenous injection, subcutaneous injection, intradermal injection, intraperitoneal injection, intraperitoneal injection, intramuscular injection can be prepared.

一些實施方案中，本揭露組成物包含肽或編碼肽或其部分的多核苷酸溶液並且被溶解或懸浮於可接受的載體，例如水性載體中。這些組成物可以藉由眾所周知的常規滅菌技術滅菌，或者可以經歷無菌過濾。由此得到的製劑可以被包裝起來按原樣使用，或者被凍乾；凍乾製劑在複溶後施用。 In some embodiments, a composition of the present disclosure comprises a solution of a peptide or polynucleotide encoding a peptide or portion thereof and is dissolved or suspended in an acceptable carrier, eg, an aqueous carrier. These compositions can be sterilized by well-known conventional sterilization techniques, or can be subjected to sterile filtration. The resulting formulation can be packaged for use as is, or lyophilized; the lyophilized formulation is administered after reconstitution.

另一方面，本揭露提供了一種藉由向受試者施用一種或多種抗原或肽，如使用本文所公開的方法預測和/或鑑定的多個抗原或肽來誘導受試者的腫瘤特異性免疫應答、針對腫瘤接種疫苗、治療和或緩解受試者的癌症症狀的方法。 In another aspect, the present disclosure provides a method for inducing tumor specificity in a subject by administering to the subject one or more antigens or peptides, such as multiple antigens or peptides predicted and/or identified using the methods disclosed herein Methods of immune response, vaccination against tumors, treatment, and or alleviation of symptoms of cancer in a subject.

一些實施方案中，腫瘤可以是任何實體腫瘤，如乳房腫瘤、卵巢腫瘤、前列腺腫瘤、肺腫瘤、腎腫瘤、胃腫瘤、結腸腫瘤、睾丸腫瘤、頭頸部腫瘤、胰腺腫瘤、腦腫瘤、黑素瘤及其它組織器官腫瘤；以及血液腫瘤，如淋巴瘤和白血病，包括急性骨髓性白血病、慢性骨髓性白血病、慢性淋巴細胞性白血病、T細胞淋巴細胞性白血病及B細胞淋巴瘤。 In some embodiments, the tumor can be any solid tumor, such as breast tumor, ovarian tumor, prostate tumor, lung tumor, kidney tumor, stomach tumor, colon tumor, testicular tumor, head and neck tumor, pancreatic tumor, brain tumor, melanoma and other tissue and organ tumors; and hematological tumors, such as lymphomas and leukemias, including acute myeloid leukemia, chronic myeloid leukemia, chronic lymphocytic leukemia, T-cell lymphocytic leukemia, and B-cell lymphoma.

一些實施方案中，抗原或肽可以單獨施用或與其它治療劑組合施用。 In some embodiments, the antigen or peptide can be administered alone or in combination with other therapeutic agents.

一些具體實施方案中，治療劑是例如化學治療劑、放射或免疫療法。針對特定癌症的任何適合的治療性治療都可以施用。 In some embodiments, the therapeutic agent is, for example, a chemotherapeutic agent, radiation or immunotherapy. Any suitable therapeutic treatment for a particular cancer can be administered.

另一方面，本揭露提供了一種製造腫瘤疫苗的方法，該方法包括執行本揭露的方法的各個步驟；及產生包含多個抗原或肽或該多個抗原或肽的子集的腫瘤疫苗。一些實施方案中，製造腫瘤疫苗的方法包括藉由鑑別來自受試者的一種或多種腫瘤細胞的可能呈遞於該腫瘤細胞表面上的一種或多種抗原或肽的步驟。 In another aspect, the present disclosure provides a method of manufacturing a tumor vaccine, the method comprising performing the various steps of the method of the present disclosure; and producing a tumor vaccine comprising a plurality of antigens or peptides or a subset of the plurality of antigens or peptides. In some embodiments, the method of making a tumor vaccine comprises using By the step of identifying one or more antigens or peptides from one or more tumor cells of the subject that may be presented on the surface of the tumor cells.

一些實施方案中，製造腫瘤疫苗的方法包括以下步驟：從受試者的腫瘤細胞獲得外顯子組、轉錄組或全基因組腫瘤核苷酸測序數據中的至少一種，其中該腫瘤核苷酸測序數據用於獲得肽或肽的組合的資料；將每個肽的資料登錄到一個或多個機器學習系統中，以產生肽或肽的組合中的每一者在受試者腫瘤細胞的腫瘤細胞表面上由一個或多個HLA等位元基因遞呈的數值可能性集合，該數值可能性集合視需要地基於所接收的質譜資料進行鑑定；以及基於該數值可能性集合選擇該肽或肽的組合的子集，以產生經選擇的肽或肽的組合；以及產生或已產生包含該經選擇的肽或肽的組合的腫瘤疫苗。本揭露治療和/或預防方法中所用化合物或組成物的劑量通常將隨疾病的嚴重性、患者的體重和化合物的相對功效而改變。不過，作為一般性指導，合適的單位劑量可以是0.1~1000mg。 In some embodiments, the method of making a tumor vaccine comprises the steps of: obtaining at least one of exome, transcriptome, or genome-wide tumor nucleotide sequencing data from tumor cells of the subject, wherein the tumor nucleotide sequenced The data is used to obtain profiles of peptides or combinations of peptides; the profiles of each peptide are logged into one or more machine learning systems to generate tumor cells of each of the peptides or combinations of peptides in the subject's tumor cells A numerical likelihood set ostensibly presented by one or more HLA alleles, the numerical likelihood set optionally identified based on the received mass spectrometry data; and selecting the peptide or peptides based on the numerical likelihood set A subset of combinations to produce a selected peptide or combination of peptides; and a tumor vaccine comprising the selected peptide or combination of peptides is produced or has been produced. The dosage of a compound or composition used in the methods of treatment and/or prevention of the present disclosure will generally vary with the severity of the disease, the weight of the patient, and the relative efficacy of the compound. However, as a general guide, a suitable unit dose may range from 0.1 to 1000 mg.

如所屬技術領域具有通常知識者所熟知的，藥物的給藥劑量依賴於多種因素，包括但並非限定於以下因素：所用具體化合物的活性、患者的年齡、患者的體重、患者的健康狀況、患者的行為、患者的飲食、給藥時間、給藥方式、排泄的速率、藥物的組合等；另外，最佳的治療方式如治療的模式、通式化合物(I)的日用量或可藥用的鹽的種類可以根據傳統的治療方案來驗證。 As is well known to those of ordinary skill in the art, the dosage of a drug to be administered depends on a variety of factors, including but not limited to the following factors: the activity of the particular compound used, the age of the patient, the weight of the patient, the medical condition of the patient, the behavior, patient's diet, administration time, administration mode, excretion rate, combination of drugs, etc.; in addition, the optimal treatment mode such as the mode of treatment, the daily dosage of the compound (I) or the pharmaceutically acceptable The type of salt can be verified according to traditional treatment protocols.

本揭露還提供了電腦系統，其包括電腦處理器和存儲電腦程式指令的記憶體，該電腦程式指令在被電腦處理器執行時使電腦處理器執行本揭露上述方法的實施方案。 The present disclosure also provides a computer system that includes a computer processor and a memory storing computer program instructions that, when executed by the computer processor, cause the computer processor to execute embodiments of the above-described methods of the present disclosure.

一些實施方案中，本揭露電腦系統，能夠基於融合神經網路模型和集成學習模型預測和/或鑑定肽由HLA分子呈遞可能性。 In some embodiments, the computer system of the present disclosure is capable of predicting and/or identifying the likelihood of peptide presentation by HLA molecules based on a fusion neural network model and an ensemble learning model.

一些具體實施方案中，本揭露系統包括： In some specific embodiments, the disclosed system includes:

資料編碼模組：用於將肽、表達量、HLA分型等資料進行數位化編碼； Data encoding module: used to digitize data such as peptides, expression levels, and HLA typing;

神經網路訓練模組：與資料編碼模組相連，利用全連接神經網路模型，在質譜實驗資料上進行參數訓練； Neural network training module: connected to the data coding module, using the fully connected neural network model to perform parameter training on mass spectrometry experimental data;

集成學習訓練模組：與資料編碼模組相連，利用集成學習模型在質譜實驗資料上進行參數訓練； Integrated learning and training module: connected to the data coding module, using the integrated learning model to perform parameter training on mass spectrometry experimental data;

免疫原性預測和/或鑑定模組：與神經網路訓練模組和集成學習模組相連。基於待測樣本的肽、表達量和HLA分型資料，根據神經網路係數矩陣和集成學習係數矩陣，用於預測和/或鑑定HLA分子呈遞可能性。 Immunogenicity Prediction and/or Identification Module: Linked to Neural Network Training Module and Ensemble Learning Module. Based on the peptide, expression level and HLA typing data of the test sample, according to the neural network coefficient matrix and the ensemble learning coefficient matrix, it is used to predict and/or identify the possibility of HLA molecule presentation.

本揭露還提供了電腦可讀介質，其具有存儲於其上用於實現本揭露所述方法的電腦可執行指令。 The present disclosure also provides computer-readable media having computer-executable instructions stored thereon for implementing the methods described in the present disclosure.

一些實施方案中提供了電腦可讀介質，其具有存儲於其上用於實現本揭露所述預測和/或鑑定肽由HLA分子呈遞可能性方法的電腦可執行指令。 In some embodiments, a computer-readable medium is provided having computer-executable instructions stored thereon for implementing the methods of the present disclosure for predicting and/or identifying the likelihood of peptide presentation by HLA molecules.

本揭露還提供了一種裝置，包括用於存儲程式的記憶體以及用於執行該程式的處理器，以實現本揭露所述的方法。 The present disclosure also provides an apparatus including a memory for storing a program and a processor for executing the program, so as to implement the method described in the present disclosure.

一些實施方案中提供了一種預測和/或鑑定肽由HLA分子呈遞可能性的裝置，包括用於存儲程式的記憶體以及用於執行該程式的處理器，以實現本揭露所述的預測和/或鑑定方法。 In some embodiments, an apparatus for predicting and/or identifying the likelihood of peptide presentation by an HLA molecule is provided, comprising a memory for storing a program and a processor for executing the program to achieve the prediction and/or the present disclosure. or identification method.

本揭露還提供了一種HLA等位元基因特異性結合資訊資料庫，其包括本揭露所述的方法預測和/或鑑定的肽的資訊和/或編碼本揭露所述的方法預測和/或鑑定的肽的多核苷酸資訊。 The present disclosure also provides an HLA allele-specific binding information database, which includes information on peptides predicted and/or identified by the methods described in the present disclosure and/or encodes peptides predicted and/or identified by the methods described in the present disclosure. peptide polynucleotide information.

本揭露還提供了一種HLA等位元基因特異性結合肽序列資料庫，包括根據本揭露方法預測和/或鑑定的肽序列資訊。 The present disclosure also provides an HLA allele-specific binding peptide sequence database, including peptide sequence information predicted and/or identified according to the method of the present disclosure.

圖1為基於融合神經網路模型和集成學習模型預測肽由HLA分子呈遞可能性的系統和方法的流程圖。 1 is a flowchart of a system and method for predicting the likelihood of peptide presentation by HLA molecules based on a fusion neural network model and an ensemble learning model.

圖2為神經網路模型召回率40%時的陽性預測值。 Figure 2 shows the positive predictive value of the neural network model when the recall rate is 40%.

圖3為實施例4中本揭露與質譜實驗資料的陽性肽特徵比較。 FIG. 3 is a comparison of the positive peptide characteristics of the present disclosure and mass spectrometry experimental data in Example 4. FIG.

圖4為實施例6擴充機器學習模型系統陽性肽集後，召回率40%時的陽性預測值。 FIG. 4 shows the positive predictive value when the recall rate is 40% after the positive peptide set of the machine learning model system is expanded in Example 6.

圖5為質譜分析樣品蛋白免疫印跡實驗。 Figure 5 is a Western blot experiment of mass spectrometry samples.

圖6為質譜分析樣品銀染實驗。 Figure 6 is the silver staining experiment of mass spectrometry samples.

術語： the term:

為了更容易理解本申請，以下具體定義了某些技術和科學術語。除顯而易見在本檔中的它處另有明確定義，否則本文使用的所有其它技術和科學術語都具有本申請所屬技術領域具有通常知識者通常理解的含義。 For easier understanding of this application, certain technical and scientific terms are specifically defined below. Unless otherwise clearly defined elsewhere in this document, all other technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

本申請所用胺基酸三字母代碼和單字母代碼如J.Biol.Chem,243,p3558(1968)中該。“抗原”是誘導免疫應答的物質，包括與抗體或T淋巴細胞(T細胞)特異性反應的肽或蛋白質，本揭露抗原可以包括至少一個使其不同於相應野生型親本抗原的變化的抗原，例如，該變化是腫瘤細胞突變或腫瘤細胞特異性翻譯後修飾。抗原可以包括多肽或多核苷酸。突變可以包括移碼或非移碼插入缺失、錯義或無義取代、剪接位點變化、基因組重排或基因融合，或產生由突變或其它異常如剪接而產生的腫瘤特異性開放閱讀框的任何基因組或表達變化。突變還可以包括剪接變體。腫瘤細胞特異性翻譯後修飾可以包括異常磷酸化。腫瘤細胞特異性翻譯後修飾還可以包括蛋白酶體產生的剪接抗原。 The three-letter and one-letter codes for amino acids used in this application are as in J. Biol. Chem, 243, p3558 (1968). An "antigen" is a substance that induces an immune response, including peptides or proteins that specifically react with antibodies or T lymphocytes (T cells). Include at least one antigen with a change that makes it different from the corresponding wild-type parent antigen, eg, the change is a tumor cell mutation or a tumor cell-specific post-translational modification. Antigens can include polypeptides or polynucleotides. Mutations may include frameshift or non-frameshift indels, missense or nonsense substitutions, splice site changes, genomic rearrangements or gene fusions, or mutations that produce tumor-specific open reading frames resulting from mutations or other abnormalities such as splicing. Any genomic or expression changes. Mutations can also include splice variants. Tumor cell-specific post-translational modifications can include aberrant phosphorylation. Tumor cell-specific post-translational modifications can also include proteasome-generated splicing antigens.

“腫瘤抗原”是存在於受試者的腫瘤細胞或組織中但不存在於受試者的相應正常細胞或組織中的抗原。 A "tumor antigen" is an antigen that is present in tumor cells or tissues in a subject but not in corresponding normal cells or tissues in the subject.

“誘導免疫應答”和“增強免疫應答”可互換使用，並指免疫應答對特定抗原的剌激(即，被動或適應性的)。在物件中誘導免疫應答之後，該物件被保護免於發生疾病(例如癌症疾病)或者藉由誘導免疫應答使疾病狀況得到改善。例如，對腫瘤所表達抗原的免疫應答可在患有癌症疾病的患者中或者在有風險發生癌症疾病的物件中被誘導。在這種情況下，誘導免疫應答可意味著物件的疾病狀況得到改善，物件沒有發生轉移，或者有風險發生癌症疾病的物件沒有發生癌症疾病。 "Inducing an immune response" and "enhancing an immune response" are used interchangeably and refer to the stimulation (ie, passive or adaptive) of an immune response to a particular antigen. After inducing an immune response in an object, the object is protected from developing a disease (eg, cancer disease) or ameliorated by inducing an immune response. For example, an immune response to an antigen expressed by a tumor can be induced in a patient suffering from a cancerous disease or in an object at risk of developing a cancerous disease. In this case, induction of an immune response may mean that the disease status of the object has improved, that the object has not metastasized, or that the object at risk of developing cancer disease has not developed cancer disease.

“免疫原性”是例如藉由T細胞、B細胞或兩者引發免疫應答的能力。 "Immunogenicity" is the ability to elicit an immune response, eg, by T cells, B cells, or both.

“肽”、“肽序列”、“肽段”與“多肽”可互換使用，並指胺基酸殘基的鏈，通常具有確定的序列，包括肽的變體或片段，肽片段可以是單體或聚合的。“肽”包含藉由修飾或未修飾的肽鍵連接的至少兩個胺基酸的任何肽，L-胺基酸和D-胺基酸均可使用。肽可包含修飾的胺基酸，例如可藉由自然過程如轉錄後修飾或藉由化學過程進行修飾。這些修飾的一些實例是：乙醯化、醯化、ADP-核糖基化、醯胺化、與黃素的共價鍵合、與血紅素的共價鍵合、與核苷酸或核苷酸衍生物的共價鍵合、與修飾或未修飾的碳水化合物部分的共價鍵合、與脂質或脂質衍生物鍵合、與磷脂醯肌醇共價鍵合、交聯、環化、二硫鍵形成、去甲基化、半胱胺酸分子形成、焦谷胺酸形成、甲醯化、γ-羧化、羥基化、碘化、甲基化、氧化、磷酸化、外消旋化、羥基化等。 "Peptide", "peptide sequence", "peptide fragment" and "polypeptide" are used interchangeably and refer to a chain of amino acid residues, usually of a defined sequence, including variants or fragments of peptides, which may be single bulk or aggregate. A "peptide" comprises any peptide comprising at least two amino acids joined by modified or unmodified peptide bonds, both L-amino acids and D-amino acids can be used. Peptides may contain modified amino acids, for example, may be modified by natural processes such as post-transcriptional modification or by chemical processes. Some examples of these modifications are: acetylation, acylation, ADP-ribosylation, acylation, covalent bonding to flavin, Covalent bonding of erythrocytes, covalent bonding to nucleotides or nucleotide derivatives, covalent bonding to modified or unmodified carbohydrate moieties, bonding to lipids or lipid derivatives, to phospholipids Inositol covalent bonding, cross-linking, cyclization, disulfide bond formation, demethylation, cysteine molecule formation, pyroglutamate formation, methylation, gamma-carboxylation, hydroxylation, iodine Methylation, methylation, oxidation, phosphorylation, racemization, hydroxylation, etc.

“多核苷酸”或“核酸”指任何長度的核苷酸鏈，包括DNA和RNA。核苷酸可為去氧核糖核苷酸、核糖核苷酸、經修飾的核苷酸或鹼基和/或其類似物、或者可藉由DNA或RNA聚合酶摻入鏈內的任何受質。多核苷酸可包含經修飾的核苷酸，例如甲基化的核苷酸及其類似物。如果存在的話，可在鏈組裝之前或鏈組裝之後賦予對核苷酸結構的修飾。多核苷酸還可含有本領域一般已知的核糖或去氧核糖糖的類似形式，包括例如2後賦予對甲基-、2基賦予對烯丙基、2丙基氟-或2丙-疊氮基-核糖、碳環糖類似物、α-或β-異頭糖、差向異構糖(如阿拉伯糖、木糖或來蘇糖、吡喃糖、呋喃糖、景天庚酮糖)、無環類似物和無鹼基核苷類似物如甲基核糖苷。 "Polynucleotide" or "nucleic acid" refers to a chain of nucleotides of any length, including DNA and RNA. Nucleotides can be deoxyribonucleotides, ribonucleotides, modified nucleotides or bases and/or their analogs, or any substrate that can be incorporated into the chain by DNA or RNA polymerase . Polynucleotides may comprise modified nucleotides, such as methylated nucleotides and analogs thereof. Modifications to the nucleotide structure, if present, can be imparted before strand assembly or after strand assembly. Polynucleotides may also contain analogous forms of ribose or deoxyribose sugars generally known in the art, including, for example, 2 followed by para-methyl-, 2-based para-allyl, 2-propyl fluoro- or 2-propane Nitro-ribose, carbocyclic sugar analogs, alpha- or beta-anomeric sugars, epimeric sugars (such as arabinose, xylose or lyxose, pyranose, furanose, sedum heptulose) , acyclic analogs and abasic nucleoside analogs such as methyl riboside.

“抗原加工”或“加工”是指多肽或抗原到加工產物(為該多肽或抗原之片段)的降解(例如，多肽到肽的降解)以及這些片段中的一個或更多個與MHC分子的相關聯(例如，藉由結合)來被細胞(較佳抗原呈遞細胞)呈遞到特異性T細胞。 "Antigen processing" or "processing" refers to the degradation (eg, polypeptide-to-peptide degradation) of a polypeptide or antigen into a processed product that is a fragment of the polypeptide or antigen and the interaction of one or more of these fragments with an MHC molecule Association (eg, by binding) is presented to specific T cells by cells, preferably antigen-presenting cells.

“抗原呈遞細胞”(APC)是在其細胞表面展示與MHC分子相關聯之蛋白質抗原的肽片段的細胞。一些APC可活化抗原特異性T細胞。 "Antigen-presenting cells" (APCs) are cells that display peptide fragments of protein antigens associated with MHC molecules on their cell surface. Some APCs activate antigen-specific T cells.

“HLA親和力”是特定抗原與特定HLA等位基因之間的結合親和力。 "HLA affinity" is the binding affinity between a particular antigen and a particular HLA allele.

“疫苗”是指其在施用後誘導識別並攻擊病原體或病變細胞如癌症細胞的免疫應答，特別是細胞免疫應答。疫苗可用於預防或治療疾病。 By "vaccine" is meant an immune response, particularly a cellular immune response, which, upon administration, induces an immune response that recognizes and attacks pathogens or diseased cells such as cancer cells. Vaccines can be used to prevent or treat disease.

“個體化癌症疫苗”係關於特定的癌症患者並且意指癌症疫苗適合於個體癌症患者的需要或特殊情況。 "Personalized cancer vaccine" refers to a particular cancer patient and means that the cancer vaccine is tailored to the needs or particular circumstances of the individual cancer patient.

“神經網路”是用於分類或回歸的機器學習模型，由多層線性變換，繼之以通常藉由隨機梯度下降和反向傳播訓練的逐元素非線性組成。 A "neural network" is a machine learning model for classification or regression that consists of multiple layers of linear transformations followed by element-wise nonlinearity, usually trained by stochastic gradient descent and backpropagation.

“集成學習”是指將多個學習模型進行組合，以獲得更好的預測效果，從而使組合後的模型具有更強的泛化能力，或者說具有更強的普適性。 "Ensemble learning" refers to combining multiple learning models to obtain better prediction results, so that the combined model has stronger generalization ability, or more universality.

“XGBoost”運用集成學習思想來進行結果/標籤的預測。XGBoost通常可以用於解決兩種問題，包括分類問題和回歸問題。 "XGBoost" uses the ensemble learning idea for result/label prediction. XGBoost can generally be used to solve two kinds of problems, including classification problems and regression problems.

“訓練集”是指用於訓練的樣本集合主要用來訓練學習模型中的參數。 "Training set" refers to the set of samples used for training mainly used to train the parameters in the learning model.

“驗證集”是指用於驗證模型性能的樣本集合，不同學習模型在訓練集上訓練結束後，藉由驗證集來比較判斷各個模型的性能。 "Validation set" refers to the sample set used to verify the performance of the model. After different learning models are trained on the training set, the performance of each model is compared and judged by the validation set.

“程式”或“電腦程式”通常是指符合特定程式設計語言規則的語法單位，其由聲明和語句或者指令組成，可以分成解決或執行某功能、任務或問題所需的“程式碼片段”。 "Program" or "computer program" generally refers to a syntactic unit conforming to the rules of a particular programming language, consisting of declarations and statements or instructions, which can be divided into "code fragments" required to solve or perform a certain function, task or problem.

“系統”或“電腦系統”通常是指一個或多個進行資料處理的電腦，外部設備以及軟體。“使用者”或“系統操作者”通常包括藉由“使用者裝置”(例如，電腦、無線裝置等)接入並使用電腦網路的人，其目的在於資料處理和資訊交換。 "System" or "computer system" generally refers to one or more computers, peripherals, and software that process data. A "user" or "system operator" generally includes a person who accesses and uses a computer network via a "user device" (eg, computer, wireless device, etc.) for the purpose of data processing and information exchange.

“電腦”通常是能夠進行實質性計算(substantial computation)，包括大量的沒有人為干預的算數運算和邏輯運算的功能單元。 A "computer" is usually a functional unit capable of performing substantial computation, including a large number of arithmetic and logical operations without human intervention.

“應用軟體”或“應用程式”通常是指特意解決應用問題的軟體或程式。 "Application software" or "application" generally refers to software or programs designed to solve application problems.

“受試者”涵蓋細胞、組織或生物體、人或非人，無論是體內、離體還是體外，雄性還是雌性的。術語受試者包括含人在內的哺乳動物。 "Subject" encompasses cells, tissues or organisms, human or non-human, whether in vivo, ex vivo or in vitro, male or female. The term subject includes mammals including humans.

“核酸”是去氧核糖核酸(DNA)或核糖核酸(RNA)。根據本揭露，核酸包括基因組DNA、cDNA、mRNA、重組產生的和化學合成的分子。根據本揭露，核酸可以以單鏈或雙鏈的以及線性或共價環狀閉合的分子存在。根據本揭露的核酸可以是分離的。根據本揭露，術語“分離的核酸”意指該核酸為(i)體外擴增的，例如藉由聚合酶鏈反應(PCR)，(ii)藉由選殖重組產生的，(iii)純化的，例如藉由切割和經凝膠電泳分離，或(iv)合成的，例如藉由化學合成。可用核酸引入(即轉染)細胞，尤其是，可藉由體外轉錄從DNA範本製備的RNA形式。RNA還可在應用之前藉由穩定序列、加帽和聚腺苷酸化進行修飾。 "Nucleic acid" is deoxyribonucleic acid (DNA) or ribonucleic acid (RNA). According to the present disclosure, nucleic acid includes genomic DNA, cDNA, mRNA, recombinantly produced and chemically synthesized molecules. According to the present disclosure, nucleic acids can exist as single- or double-stranded and linear or covalently circularly closed molecules. Nucleic acids according to the present disclosure may be isolated. According to the present disclosure, the term "isolated nucleic acid" means that the nucleic acid is (i) amplified in vitro, eg, by polymerase chain reaction (PCR), (ii) produced by clonal recombination, (iii) purified , eg by cleavage and separation by gel electrophoresis, or (iv) synthesized eg by chemical synthesis. Nucleic acids can be introduced into (ie, transfected) cells, in particular, RNA forms that can be prepared from DNA templates by in vitro transcription. RNA can also be modified by stabilizing sequences, capping and polyadenylation prior to application.

“製備”包含大規模生產過程中的製備和實驗室製備。 "Preparation" includes preparation in large-scale production processes and laboratory preparations.

“預測”意指前瞻性的確定或鑑別。 "Prediction" means a prospective determination or identification.

“給予”和“處理”當應用於動物、人、實驗受試者、細胞、組織、器官或生物流體時，是指外源性藥物、治療劑、診斷劑或組成物與動物、人、受試者、細胞、組織、器官或生物流體的接觸。“給予”和“處理”可以指例如治療、藥物代謝動力學、診斷、研究和實驗方法。細胞的處理包括試劑與細胞的接觸，以及試劑與流體的接觸，其中該流體與細胞接觸。“給予”和“處理”還意指藉由試劑、診斷、結合組成物或藉由另一種細胞體外和離體處理例如細胞。“處理”當應用於人、獸醫學或研究受試者時，是指治療處理、預防或預防性措施，研究和診斷應用。 "Administration" and "treatment" when applied to animals, humans, experimental subjects, cells, tissues, organs, or biological fluids refer to the interaction of exogenous drugs, therapeutic agents, diagnostic agents, or compositions with the animal, human, subject, or biological fluid. contact with subjects, cells, tissues, organs or biological fluids. "Administering" and "treatment" can refer to, for example, therapeutic, pharmacokinetic, diagnostic, research, and experimental methods. Treatment of cells includes contact of reagents with cells, and contact of reagents with fluids, wherein the fluids are in contact with cells. "Administering" and "treating" also mean by an agent, a diagnosis, a binding composition, or by another cell body In vitro and ex vivo treatment of eg cells. "Treatment" when applied to human, veterinary or research subjects refers to therapeutic treatment, prophylactic or preventive measures, research and diagnostic applications.

“治療”意指給予患者內用或外用治療劑，諸如包含本文的任一種結合化合物的組成物，該患者具有一種或多種疾病症狀，而已知該治療劑對這些症狀具有治療作用。通常，在受治療患者或群體中以有效緩解一種或多種疾病症狀的量給予治療劑，無論是藉由誘導這類症狀退化還是抑制這類症狀發展到任何臨床右測量的程度。有效緩解任何具體疾病症狀的治療劑的量(也稱作“治療有效量”)可根據多種因素變化，例如患者的疾病狀態、年齡和體重，以及藥物在患者產生需要療效的能力。藉由醫生或其它專業衛生保健人士通常用於評價該症狀的嚴重性或進展狀況的任何臨床檢測方法，可評價疾病症狀是否已被減輕。儘管本本的實施方案(例如治療方法或製品)在緩解單個患者的目標疾病症狀方面可能無效，但是根據本領域已知的任何統計學檢驗方法如Student t檢驗、卡方檢驗、依據Mann和Whitney的U檢驗、Kruskal-Wallis檢驗(H檢驗)、Jonckheere-Terpstra核對總和Wilcoxon檢驗確定，其在統計學顯著數目的患者中應當減輕目標疾病症狀。 "Treatment" means administering an internal or external therapeutic agent, such as a composition comprising any of the binding compounds herein, to a patient having one or more disease symptoms for which the therapeutic agent is known to have a therapeutic effect. Typically, the therapeutic agent is administered in an amount effective to alleviate one or more symptoms of a disease in a patient or population being treated, either by inducing regression of such symptoms or inhibiting the progression of such symptoms to any clinically measured extent. The amount of a therapeutic agent effective to relieve symptoms of any particular disease (also referred to as a "therapeutically effective amount") can vary depending on factors such as the patient's disease state, age and weight, and the ability of the drug to produce the desired effect in the patient. Whether symptoms of a disease have been alleviated can be assessed by any clinical test commonly used by physicians or other health care professionals to assess the severity or progression of the symptoms. Although embodiments of the present invention (eg, methods of treatment or articles of manufacture) may be ineffective in alleviating symptoms of a target disease in a single patient, the method of The U test, Kruskal-Wallis test (H test), Jonckheere-Terpstra checked sum Wilcoxon test determined that it should reduce symptoms of the target disease in a statistically significant number of patients.

“保守修飾”或“保守置換或取代”是指具有類似特徵(例如電荷、側鏈大小、疏水性/親水性、主鏈構象和剛性等)的其它胺基酸置換蛋白中的胺基酸，使得可頻繁進行改變而不改變蛋白的生物學活性。所屬技術領域具有通常知識者知曉，一般而言，多肽的非必需區域中的單個胺基酸置換基本上不改變生物學活性(參見例如Watson等(1987)Molecular Biology of the Gene，The Benjamin/Cummings Pub.Co.，第224頁，(第4版))。另外，結構或功能類似的胺基酸的置換不大可能破環生物學活性。 "Conservative modification" or "conservative substitution or substitution" refers to the replacement of amino acids in proteins by other amino acids with similar characteristics (eg, charge, side chain size, hydrophobicity/hydrophilicity, backbone conformation and rigidity, etc.), This allows frequent changes without altering the biological activity of the protein. It is known to those of ordinary skill in the art that, in general, single amino acid substitutions in non-essential regions of polypeptides do not substantially alter biological activity (see, eg, Watson et al. (1987) Molecular Biology of the Gene, The Benjamin/Cummings Pub. Co., p. 224, (4th ed.). In addition, substitution of structurally or functionally similar amino acids is unlikely to disrupt biological activity.

應用於某個物件的術語“天然存在的”是指這樣的事實，即該物件可在自然界中發現。例如存在於可從自然界來源分離得到的生物體(包括病毒)、且未經人工在實驗室中有意修飾的肽或多核苷酸即是天然存在的。 The term "naturally occurring" as applied to an item refers to the fact that the item can be found in nature. For example, peptides or polynucleotides that are present in organisms (including viruses) that can be isolated from natural sources and have not been intentionally modified artificially in the laboratory are naturally occurring.

“有效量”包含足以改善或預防醫字病症的症狀或病症的量。有效量還意指足以允許或促進診斷的量。用於特定患者或獸醫學受試者的有效量可依據以下因素而變化：如待治療的病症、患者的總體健康情況、給藥的方法途徑和劑量以及副作用嚴重性。有效量可以是避免顯著副作用或毒性作用的最大劑量或給藥方案。 An "effective amount" includes an amount sufficient to ameliorate or prevent the symptoms or conditions of the medical condition. An effective amount also means an amount sufficient to allow or facilitate diagnosis. The effective amount for a particular patient or veterinary subject may vary depending on factors such as the condition being treated, the general health of the patient, the method, route and dosage of administration, and the severity of side effects. An effective amount can be the maximum dose or dosing regimen that avoids significant side effects or toxic effects.

“外源性”指根據背景在生物、細胞或人體外產生的物質。“內源性”指根據背景在細胞、生物或人體內產生的物質。 "Exogenous" refers to a substance produced outside an organism, cell, or human body depending on the context. "Endogenous" refers to a substance produced in a cell, organism or human body depending on the context.

“同源性”或“同一性”是指兩個多核苷酸序列之間或兩個多肽之間的序列相似性。當兩個比較序列中的位置均被相同鹼基或胺基酸單體亞基佔據時，例如如果兩個DNA分子的每一個位置都被腺嘌呤佔據時，那麼該分子在該位置是同源的。兩個序列之間的同源性百分率是兩個序列共有的匹配或同源位置數除以比較的位置數×100%的函數。例如，在序列最佳比對時，如果兩個序列中的10個位置有6個匹配或同源，那麼兩個序列為60%同源。一般而言，當比對兩個序列而得到最大的同源性百分率時進行比較。本文該“至少85%序列同一性”是指變體與親本序列相比，兩序列具有至少85%同源，在一些方案中，其具有至少86%、87%、88%、89%、90%、91%、92%、93%、94%、95%、96%、97%、98%、或99%序列同源；在一些具體的方案中，其具有90%、95%或99%以上；在另一些具體的方案中，其具有至少95%序列同源。該具有至少85%序列同一性的胺基酸序列包括藉由對親本序列進行一個或者多個胺基酸缺失、插入或替換突變獲得。 "Homology" or "identity" refers to the sequence similarity between two polynucleotide sequences or between two polypeptides. When a position in the two compared sequences is occupied by the same base or amino acid monomer subunit, for example if every position in two DNA molecules is occupied by an adenine, then the molecules are homologous at that position of. The percent homology between the two sequences is a function of the number of matches or homologous positions shared by the two sequences divided by the number of positions compared x 100%. For example, when sequences are optimally aligned, two sequences are 60% homologous if 6 of 10 positions in the sequences are matched or homologous. In general, comparisons are made when the two sequences are aligned for the greatest percent homology. As used herein, "at least 85% sequence identity" means that the variant has at least 85% homology between the two sequences compared to the parental sequence, in some embodiments, it has at least 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99% sequence homology; in some specific schemes, it has 90%, 95% or 99% % or more; in other specific schemes, it has at least 95% sequence homology. the amine group having at least 85% sequence identity Acid sequences include those obtained by mutagenesis of one or more amino acid deletions, insertions or substitutions of the parental sequence.

本文使用的表述“細胞”、“細胞系”和“細胞培養物”可互換使用，並且所有這類名稱都包括其後代。因此，單詞“轉化體”和“轉化細胞”包括原代受試細胞和由其衍生的培養物，而不考慮轉移數目。還應當理解的是，由於故意或非有意的突變，所有後代在DNA含量方面不可能精確相同。包括具有與最初轉化細胞中篩選的相同的功能或生物學活性的突變後代。 As used herein, the expressions "cell", "cell line" and "cell culture" are used interchangeably and all such designations include progeny thereof. Thus, the words "transformants" and "transformed cells" include primary test cells and cultures derived therefrom, regardless of the number of transfers. It should also be understood that, due to deliberate or unintentional mutations, all progeny may not be exactly the same in terms of DNA content. Mutant progeny that have the same function or biological activity as screened in the original transformed cell are included.

“視需要”或“視需要地”意味著隨後所描述地事件或環境可以但不必發生，該說明包括該事件或環境發生或不發生地場合。例如，“視需要包含1-3個抗體重鏈可變區”意味著特定序列的抗體重鏈可變區可以但不必須存在。 "Optional" or "optionally" means that the subsequently described event or circumstance can but need not occur, and that the description includes instances where the event or circumstance occurs or does not occur. For example, "optionally comprising 1-3 antibody heavy chain variable regions" means that an antibody heavy chain variable region of a particular sequence may, but need not, be present.

“醫藥組成物”表示含有一種或多種本文所述化合物或其生理學上/可藥用的鹽或前體藥物與其他化學組分的混合物，以及其他組分例如生理學/可藥用的載體和賦形劑。醫藥組成物的目的是促進對生物體的給藥，利於活性成分的吸收進而發揮生物活性。 "Pharmaceutical composition" means a mixture containing one or more of the compounds described herein, or a physiologically/pharmaceutically acceptable salt or prodrug thereof, with other chemical components, and other components such as a physiological/pharmaceutically acceptable carrier and excipients. The purpose of the pharmaceutical composition is to facilitate the administration to the organism, facilitate the absorption of the active ingredient and then exert the biological activity.

術語“載體”是指能夠運輸已與其連接的另一個核酸的核酸分子。在一個實施方案中，載體是“質粒”，其是指可將另外的DNA區段連接至其中的環狀雙鏈DNA環。在另一個實施方案中，載體是病毒載體，其中可將另外的DNA區段連接至病毒基因組中。本文中公開的載體能夠在已引入它們的宿主細胞中自主複製(例如，具有細菌的複製起點的細菌載體和附加型哺乳動物載體)或可在引入宿主細胞後整合入宿主細胞的基因組，從而隨宿主基因組一起複製(例如，非附加型哺乳動物載體)。 The term "vector" refers to a nucleic acid molecule capable of transporting another nucleic acid to which it has been linked. In one embodiment, the vector is a "plasmid," which refers to a circular double-stranded DNA loop into which additional DNA segments can be ligated. In another embodiment, the vector is a viral vector in which additional DNA segments can be ligated into the viral genome. The vectors disclosed herein are capable of autonomous replication in the host cells into which they have been introduced (eg, bacterial vectors and episomal mammalian vectors with a bacterial origin of replication) or may integrate into the host cell's genome after introduction into the host cell, thereby following The host genome replicates together (eg, a non-episomal mammalian vector).

以下結合實施例進一步描述本揭露，但這些實施例並非限制著本揭露的範圍。本揭露實施例中未註明具體條件的實驗方法，通常按照常規條件，如冷泉港的抗體技術實驗手冊，分子選殖手冊；或按照原料或商品製造廠商所建議的條件。未註明具體來源的試劑，為市場購買的常規試劑。 The present disclosure is further described below with reference to the embodiments, but these embodiments do not limit the scope of the present disclosure. The experimental methods that do not specify specific conditions in the embodiments of the present disclosure generally follow conventional conditions, such as Cold Spring Harbor Antibody Technology Experiment Manual, Molecular Cloning Manual; or conditions suggested by raw material or commodity manufacturers. Reagents with no specific source indicated are conventional reagents purchased in the market.

實施例1. 機器學習模型系統構建Example 1. Machine learning model system construction

資料集構建過程： Dataset construction process:

首先，從公共資料庫IEDB、SYFPEITHI和MassIVE等獲取質譜實驗得到的肽的免疫原性資料與表達量資料，僅保留長度為8-11個胺基酸且有表達的肽，總共得到154,103條質譜肽陽性資料，作為陽性肽集。 First, the immunogenicity data and expression data of peptides obtained by mass spectrometry experiments were obtained from public databases such as IEDB, SYFPEITHI and MassIVE, and only the peptides with a length of 8-11 amino acids and expressed were retained, and a total of 154,103 mass spectra were obtained. Peptide positive data, as a positive peptide set.

然後，從UCSC資料庫中查找人類參考基因組hg38所有有表達的蛋白資料，並窮舉切割為8-11個胺基酸長度，作為陰性肽集。同時使用UCSC資料庫獲取各肽的註釋資訊，使用BioMart網站將註釋資訊轉化為各肽所屬蛋白家族ID(即PANTHER family ID欄位)。 Then, find all the expressed protein data of the human reference genome hg38 from the UCSC database, and exhaustively cut it into 8-11 amino acid lengths as a negative peptide set. At the same time, the UCSC database was used to obtain the annotation information of each peptide, and the BioMart website was used to convert the annotation information into the protein family ID (ie the PANTHER family ID column) to which each peptide belongs.

最後，將陽性肽集與陰性肽集混合，再切分為兩類，一類為測試資料集(測試集)，包含多等位基因測試集和單等位基因測試集，兩個測試集分別計分，用於檢測模型效果；一類為訓練資料集(訓練集)，用於進行模型訓練。多等位基因測試集包含5,000,500條肽，單等位基因測試集包含32,003,200條肽，訓練集包含105,724,241條肽。 Finally, the positive peptide set and the negative peptide set are mixed, and then divided into two categories, one is the test data set (test set), which includes a multi-allele test set and a monoallele test set. The two test sets are calculated separately. The second category is the training data set (training set), which is used for model training. The multiallelic test set contained 5,000,500 peptides, the monoallelic test set contained 32,003,200 peptides, and the training set contained 105,724,241 peptides.

資料集構建過程在神經網路模型訓練和集成學習模型訓練中一致。 The dataset construction process is consistent in neural network model training and ensemble learning model training.

模型訓練優化過程： Model training optimization process:

經過大量實驗，本實施例選擇神經網路模型與集成學習模型相融合組成最終的較佳模型，並劃定閾值使回歸模型具有分類能力。神經網路模型同在後實施例2構建，以relu作為啟動函數和Lecun初始化的全連接神經網路模型。 After a lot of experiments, in this embodiment, the neural network model and the ensemble learning model are selected to be fused to form a final optimal model, and a threshold is set to make the regression model have the ability to classify. The neural network model is constructed in Example 2, with relu as the startup function and a fully connected neural network model initialized by Lecun.

模型訓練之前需要將資料集中的資料進行預處理，將HLA分型和肽序列胺基酸轉化為獨熱編碼，連同表達量資料與肽所屬蛋白家族ID輸入模型之中進行訓練。 Before model training, the data in the data set needs to be preprocessed, the HLA typing and peptide sequence amino acids are converted into one-hot encoding, and the expression data and the protein family ID to which the peptide belongs are input into the model for training.

訓練過程中使用CentOS 7.0作業系統，使用python 2.17.15和Perl 5.10作為程式設計語言。使用Keras 2.2.4和TensorFlow 1.15.0構建的全連接網路作為神經網路架構，使用XGBoost 0.82作為集成學習器進行訓練。其中全連接網路架構為4層，每層神經元個數依次為256、32、16和1，使用二元交叉熵(binary crossentropy)作為損失函數，使用Adam作為優化器進行參數優化。XGBoost使用貝葉斯(Bayesian)法對分類樹的最大深度(max depth)、一個子節點所需的最小實例權重總和(min child weight)、構造每棵樹時列的子採樣比率(colsample bytree)、葉節點上劃分所需的最小損失減少(gamma)、最大的弱學習器的個數(n estimators)、學習率(learning rate)和訓練實例的子樣本比率(subsample)等參數進行優化搜索。 The CentOS 7.0 operating system was used during training, and python 2.17.15 and Perl 5.10 were used as programming languages. A fully connected network built with Keras 2.2.4 and TensorFlow 1.15.0 was used as the neural network architecture, and XGBoost 0.82 was used as the ensemble learner for training. The fully connected network architecture is 4 layers, and the number of neurons in each layer is 256, 32, 16, and 1 in turn. Binary crossentropy is used as the loss function, and Adam is used as the optimizer for parameter optimization. XGBoost uses the Bayesian method to determine the maximum depth of the classification tree (max depth), the minimum sum of instance weights required for a child node (min child weight), and the subsampling ratio of columns when constructing each tree (colsample bytree) , the minimum loss reduction (gamma) required for division on leaf nodes, the maximum number of weak learners (n estimators), the learning rate (learning rate) and the subsample ratio of training instances (subsample) and other parameters to optimize the search.

訓練過程中對訓練集取出10%用於驗證當次訓練模型表現，重複十次以使驗證資料覆蓋所有訓練集。十次訓練完成後使用十個模型對多等位元基因測試集進行預測，以測試集召回率(recall)=0.4劃定閾值，以測試集上的PPV作為較佳目標，優化模型參數與架構。 During the training process, 10% of the training set was taken out to verify the performance of the current training model, and repeated ten times to make the verification data cover all the training sets. After ten trainings are completed, ten models are used to predict the multi-allele test set, and the test set recall rate (recall) = 0.4 is used to define a threshold, and the PPV on the test set is used as a better target to optimize the model parameters and architecture. .

經過訓練與優化，各實施例使用1：1000作為訓練集中陽性陰性資料混合比例，使用relu作為神經網路啟動函數，使用Lecun初始化器作為神經網路初始化器，將集成學習模型的輸出結果在神經網路模型中的得分乘以權重係數0.01，組成最終較佳模型為本揭露融合模型。 After training and optimization, each example uses 1:1000 as the mixing ratio of positive and negative data in the training set, uses relu as the neural network startup function, and uses Lecun to initialize As a neural network initializer, it multiplies the score of the output result of the ensemble learning model in the neural network model by a weight coefficient of 0.01 to form the final optimal model-based fusion model.

針對預測應用階段，首先同模型構建階段對待預測資料集進行預處理，將轉化後的待預測資料登錄至模型中預測，並使用劃定閾值進行分類，輸出待預測資料的免疫原性分類結果。高於閾值的肽預測和/或鑑定為由HLA分子呈遞。 For the prediction application stage, first preprocess the data set to be predicted in the same model construction stage, log the transformed data to be predicted into the model for prediction, use the threshold for classification, and output the immunogenicity classification result of the data to be predicted. Peptides above the threshold were predicted and/or identified as being presented by HLA molecules.

實施例2. 神經網路模型構建Example 2. Neural network model construction

使用來自公共資料庫IEDB、SYFPEITHI和MassIVE等中HLA分型為A*02：07的五組陽性肽，平均每組586.8條，按照1：10,000比例添加陰性對照肽。 Five groups of positive peptides with HLA type A*02:07 from public databases IEDB, SYFPEITHI and MassIVE were used, with an average of 586.8 in each group, and the negative control peptide was added at a ratio of 1:10,000.

使用本發明中以relu作為啟動函數和Lecun初始化的全連接神經網路模型，和未使用relu作為啟動函數(使用tanh作為啟動函數)且未使用Lecun初始化(不進行初始化)的全連接神經網路模型對測試資料集進行HLA分子呈遞可能性預測。樣本檢測結果如下表1，結果顯示，使用relu作為啟動函數和Lecun初始化的全連接神經網路在五組測試資料的平均檢測準確度提升了1倍以上(如圖2)。其中，真陽性表示預測為陽性且資料庫中為陽性的肽；假陰性表示預測為陰性且資料庫中為陽性的肽；假陽性表示預測為陽性且資料庫中為陰性的肽；真陰性表示預測為陰性且資料庫中為陰性的肽。 Use the fully connected neural network model with relu as the startup function and Lecun initialization in the present invention, and the fully connected neural network that does not use relu as the startup function (uses tanh as the startup function) and does not use Lecun initialization (does not initialize) The model makes predictions of the likelihood of HLA molecule presentation on the test dataset. The sample detection results are shown in Table 1 below. The results show that the average detection accuracy of the five groups of test data using relu as the startup function and the fully connected neural network initialized by Lecun has increased by more than 1 times (as shown in Figure 2). Among them, true positives represent peptides predicted to be positive and positive in the database; false negatives represent peptides predicted to be negative and positive in the database; false positives represent peptides predicted to be positive and negative in the database; true negatives represent Peptides predicted to be negative and negative in the database.

實施例3. 融合模型單HLA等位元基因分型的HLA分子呈遞可能性預測Example 3. Prediction of the likelihood of HLA molecule presentation by fusion model single HLA allele genotyping

使用公開軟體netMHCpan和MHCflurry以及實施例一獲得的融合模型對測試資料集進行HLA分子呈遞可能性預測。樣本檢測結果如下表2，結果顯示，相對於公開軟體結果，實施例一獲得的融合模型在五組測試資料的平均檢測準確度提升了10倍以上。其中，真陽性表示預測為陽性且資料庫中為陽性的肽；假陰性表示預測為陰性且資料庫中為陽性的肽；假陽性表示預測為陽性且資料庫中為陰性的肽；真陰性表示預測為陰性且資料庫中為陰性的肽。 The public software netMHCpan and MHCflurry and the fusion model obtained in Example 1 were used to predict the probability of HLA molecule presentation on the test data set. The sample detection results are shown in Table 2 below. The results show that, compared with the published software results, the average detection accuracy of the five sets of test data for the fusion model obtained in Example 1 is improved by more than 10 times. Among them, true positives represent peptides predicted to be positive and positive in the database; false negatives represent peptides predicted to be negative and positive in the database; false positives represent peptides predicted to be positive and negative in the database; true negatives represent Peptides predicted to be negative and negative in the database.

部分檢測結果示例如表3所示，第二列為對應肽經實施例一獲得的融合模型預測的在A*02：07分型下的免疫原性，閾值為0.232，大於該閾值則表明融合模型該肽免疫原性為陽性。 Some examples of test results are shown in Table 3. The second column is the immunogenicity of the corresponding peptides predicted by the fusion model obtained in Example 1 under A*02:07 typing. The model was positive for the peptide immunogenicity.

實施例4. 融合模型16組單HLA等位基因分型的HLA分子呈遞可能性預測Example 4. Prediction of the likelihood of HLA molecule presentation by 16-group single HLA allele typing of the fusion model

使用來自MassIVE資料庫的16組不同HLA分型的每組200條(共計3200條)質譜陽性肽資料，按照1：10,000比例添加陰性對照肽。使用實施例一獲得的融合模型對測試資料集進行HLA分子呈遞可能性預測。樣本檢測結果如下表4，使用召回率40%時的陽性預測值作為評價標準，本揭露準確度提升為52.52%，相較於公開軟體準確度顯著提升。 The negative control peptides were added at a ratio of 1:10,000 using 200 mass spectrometry positive peptide data in each group (3200 in total) of 16 groups of different HLA types from the MassIVE database. Use the fusion model obtained in Example 1 to predict the possibility of HLA molecule presentation on the test data set. The sample detection results are shown in Table 4 below. Using the positive predictive value when the recall rate is 40% as the evaluation standard, the accuracy of this disclosure is increased to 52.52%, which is significantly improved compared to the accuracy of the public software.

使用其中2組不同HLA分型(A*24：02、A*02：01)的質譜陽性肽資料和本發明預測得到的陽性肽資料，分別繪製各HLA分型上陽性肽特徵圖，結果如圖3所示，本實施例中學習到的陽性肽特徵與質譜資料相似，說明學習結果較好。 Using the mass spectrometry positive peptide data of 2 groups of different HLA types (A*24:02, A*02:01) and the positive peptide data predicted by the present invention, respectively draw the positive peptide characteristic map on each HLA type, and the results are as follows As shown in FIG. 3 , the characteristics of the positive peptides learned in this example are similar to the mass spectrometry data, indicating that the learning results are good.

實施例5. 多HLA等位元基因分型質譜資料進行HLA分子呈遞可能性預測Example 5. Multi-HLA allele genotyping mass spectrometry data for HLA molecule presentation likelihood prediction

本實例藉由以下方法鑑定陽性樣本和負對照樣本的肽，每組分別兩個重複： This example identifies peptides from positive and negative control samples, with two replicates each, by the following methods:

利用2.5X10⁸的A375細胞，用細胞裂解液(20mM Tris 8.0,1mM EDTA,100mM NaCl,1% Triton X-100)5ml，4℃冷庫搖床裂解1小時。再將蛋白裂解液在4℃預冷離心機進行14000轉每分鐘，30分鐘。將蛋白裂解液進行BCA蛋白定量後，按照4mg總蛋白量對應30μL蛋A/G磁珠(Protein A/G beads)，6μg抗體(陽性樣本)或者6μg抗體(負對照) 進行免疫共沉澱樣品混合物(co-IP樣品，HLA相關複合物)的製備，並將從co-IP樣品放置4℃冷庫旋轉，均勻混合過夜。隨後，用co-IP清洗液(20mM Tris 8.0,1mM EDTA,100mM NaCl)對HLA相關複合物沉積的磁珠(beads)進行4次清洗。最後，用濃度為1-2mol/L的醋酸與HLA相關複合物沉積的beads室溫孵育10min，將HLA複合物從beads上沖提下來，離心1000rpm,1min。收集上清即為用於MS分析的樣品。 Using 2.5× ^{10 8} A375 cells, lyse with 5 ml of cell lysis buffer (20 mM Tris 8.0, 1 mM EDTA, 100 mM NaCl, 1% Triton X-100) for 1 hour at 4°C in a cold storage shaker. The protein lysate was then pre-cooled in a centrifuge at 4°C at 14,000 rpm for 30 minutes. After the protein lysate was quantified for BCA protein, 30μL protein A/G magnetic beads (Protein A/G beads), 6μg antibody (positive sample) or 6μg antibody (negative control) were used for co-immunoprecipitation sample mixture according to the total protein amount of 4mg. (co-IP samples, HLA-related complexes) were prepared, and the samples from co-IP were placed in a 4°C freezer with rotation and mixed evenly overnight. Subsequently, magnetic beads (beads) deposited by HLA-associated complexes were washed 4 times with co-IP wash solution (20 mM Tris 8.0, 1 mM EDTA, 100 mM NaCl). Finally, acetic acid at a concentration of 1-2 mol/L was incubated with the beads deposited by HLA-related complexes at room temperature for 10 min, the HLA complexes were washed off the beads, and centrifuged at 1000 rpm for 1 min. The collected supernatant is the sample for MS analysis.

質譜樣品將利用液質聯用法對HLA呈遞抗原肽鑑定分離肽。使用Mascot(V 2.3.0)軟體對液質聯用獲得的資料進行分析。檢索參數設置如下：資料庫分別設置為SwissProt Homo_sapiens；酶切方式設置為None；漏切位點設置為0；肽離子打分要求高於20。主要用到的儀器包括Thermo Scientific EASY-nLC 1200 UPLC Thermo奈升超高效液相色譜儀，Thermo Scientific Q ExactiveTM HF-X四級杆-軌道肼質譜儀和Thermo Scientific C18色譜柱(1.9μm，250mm×100μm i.d.)。 Mass spectrometry samples will be used to identify isolated peptides for HLA-presenting antigenic peptides using LC/MS. Data obtained by LC/MS were analyzed using Mascot (V 2.3.0) software. The retrieval parameters are set as follows: the database is set to SwissProt Homo_sapiens respectively; the enzyme digestion method is set to None; the missed cleavage site is set to 0; The main instruments used include Thermo Scientific EASY-nLC 1200 UPLC Thermo nanoliter ultra-high performance liquid chromatograph, Thermo Scientific Q ExactiveTM HF-X quadrupole-orbital hydrazine mass spectrometer and Thermo Scientific C18 chromatographic column (1.9 μm, 250 mm × 100 μm i.d.).

使用藉由以上質譜實驗所得多HLA分型的陽性肽資料與負對照所得陰性對照肽資料，分別使用實施例一獲得的融合模型進行抗原免疫原性預測，並使用內置閾值進行免疫原性判斷。內置閾值表示訓練集召回率40%，因此對陽性資料的預測結果也以召回率作為評價標準。檢測結果如下表5，測試樣本中召回率大於40%，負對照樣本中召回率遠低於40%，說明本發明在多HLA分型資料上應用結果良好。 Using the multi-HLA typing positive peptide data obtained by the above mass spectrometry experiment and the negative control peptide data obtained by the negative control, the fusion model obtained in Example 1 was used to predict the immunogenicity of the antigen, and the built-in threshold was used to judge the immunogenicity. The built-in threshold indicates that the recall rate of the training set is 40%, so the prediction result of the positive data is also based on the recall rate as the evaluation standard. The test results are shown in Table 5 below. The recall rate in the test sample is greater than 40%, and the recall rate in the negative control sample is much lower than 40%, indicating that the present invention has good application results on multi-HLA typing data.

檢測結果及統計表明，本揭露的基於機器學習方法的抗原免疫原性預測方法能準確預測出抗原的免疫原性，陽性預測值顯著高於現行公開軟體。 The test results and statistics show that the antigen immunogenicity prediction method based on the machine learning method of the present disclosure can accurately predict the immunogenicity of the antigen, and the positive predictive value is significantly higher than the current public software.

實施例6. 針對人群中高頻HLA分型進行質譜分析，擴充機器學習模型系統陽性肽集Example 6. Mass spectrometry analysis for high-frequency HLA typing in the population to expand the positive peptide set of the machine learning model system

本實例藉由以下方法鑑定公開資料庫未報導、在人群中高頻出現HLA分型結合的陽性多肽。該實例主要包括以下三步：單HLA分型細胞系的構建，HLA質譜樣品製備及定量，質譜分析。 In this example, the following methods were used to identify positive polypeptides that were not reported in public databases and that frequently appeared in the human population for HLA typing binding. This example mainly includes the following three steps: construction of a single HLA typing cell line, HLA mass spectrometry sample preparation and quantification, and mass spectrometry analysis.

利用慢病毒感染K562細胞構建單HLA分型的穩轉細胞系，根據不同分型細胞系HLA表達量，調整質譜樣品製備的細胞量，將1X 10 8至5X 10 8細胞用細胞裂解液(20mM Tris 8.0,1mM EDTA,100mM NaCl,1% Triton X-100)5-20ml，4℃裂解1小時，4℃預冷離心機進行14000轉每分鐘，30分鐘；蛋白裂解液經BCA蛋白定量後，按照4mg總蛋白量對應30μL Protein A/G beads,6μg MHC-I抗體(陽性樣本)或者6μg正常小白鼠(normal mouse)IgG抗體(負對照)的比例，進行免疫共沉澱樣品混合物的製備(co-IP樣品，HLA相關複合物)。co-IP樣品4℃均勻混合過夜，隨後用清洗液(20mM Tris 8.0,1mM EDTA,100mM NaCl)對co-IP beads進行4次清洗。最後，用濃度為1-2mol/L的醋酸與co-IP beads孵育10min，將HLA複合物沖提下來，離心1000rpm,1min。收集上清即為用於質譜分析樣品(MS樣品)。 Lentivirus-infected K562 cells were used to construct stable transfected cell lines with single HLA typing. According to the HLA expression levels of different typing cell lines, the amount of cells prepared for mass spectrometry samples was adjusted. Tris 8.0, 1mM EDTA, 100mM NaCl, 1% Triton X-100) 5-20ml, lysed at 4°C for 1 hour, pre-cooled at 4°C and centrifuged at 14,000 rpm for 30 minutes; after the protein lysate was quantified by BCA protein, Prepare co-immunoprecipitation sample mixture according to the ratio of 4mg total protein corresponding to 30μL Protein A/G beads, 6μg MHC-I antibody (positive sample) or 6μg normal mouse IgG antibody (negative control). -IP samples, HLA-associated complexes). The co-IP samples were mixed homogeneously at 4°C overnight, then washed with wash buffer (20 mM Tris 8.0, 1 mM EDTA, 100 mM NaCl) for 4 washes of co-IP beads. Finally, incubate with co-IP beads with a concentration of 1-2mol/L acetic acid for 10min, wash out the HLA complex, and centrifuge at 1000rpm for 1min. The supernatant was collected as the sample for mass spectrometry analysis (MS sample).

將質譜分析樣品取出5%進行蛋白免疫印跡實驗和銀染，分別定性評價質譜樣品特異性、定量分析HLA相關蛋白複合物總量。首先將樣品與SDS-PAGE蛋白上樣緩衝液(5X)預混，100℃煮沸5分鐘；然後將樣品分為兩份進行SDS-PAGE電泳。其中一份藉由轉至PVDF膜，孵育兔抗人MHC-I抗體，HRP偶聯抗兔IgG二抗進行蛋白免疫印跡實驗。如圖5，藉由比較normal mouse IgG組(陰性對照)和MHC-I抗體組(陽性樣本)，MHC-I信號在陽性樣本組出現，陰性對照組無信號(圖5，泳道#6，#7)，說明該樣品製備過程具有特異性；藉由比較co-IP前後細胞裂解液中HLA蛋白含量(圖5，泳道#1，#4)，可以計算co-IP富集效率。圖5中，泳道#1為Co-IP前蛋白裂解液樣本、泳道#2為Marker、泳道#3為IgG Co-IP後的蛋白裂解液、泳道#4為HLA抗體Co-IP後的蛋白質裂解液、泳道#5為空泳道、泳道#6為normal mouse IgG組(陰性對照)、泳道#7為MHC-I抗體組(陽性樣本)。結果表明，目前實驗體系可有效富集蛋白裂解液中90%以上HLA蛋白，說明樣品製備過程具有高效性。 5% of the samples analyzed by mass spectrometry were taken out for western blotting and silver staining to qualitatively evaluate the specificity of mass spectrometry samples and quantitatively analyze the total amount of HLA-related protein complexes, respectively. First, the samples were premixed with SDS-PAGE protein loading buffer (5X) and boiled at 100°C for 5 minutes; then the samples were divided into two parts for SDS-PAGE electrophoresis. One of them was transferred to PVDF membrane, incubated with rabbit anti-human MHC-I antibody, and HRP-conjugated anti-rabbit IgG secondary antibody for western blotting. As shown in Figure 5, by comparing the normal mouse IgG group (negative control) and the MHC-I antibody group (positive sample), MHC-I signal appeared in the positive sample group, but no signal in the negative control group (Figure 5, lane #6, # 7), indicating that the sample preparation process is specific; by comparing the HLA protein content in the cell lysate before and after co-IP (Figure 5, lane #1, #4), the co-IP enrichment efficiency can be calculated. In Figure 5, lane #1 is the protein lysate sample before Co-IP, lane #2 is Marker, lane #3 is the protein lysate after IgG Co-IP, and lane #4 is the protein lysis after HLA antibody Co-IP solution, lane #5 is an empty lane, lane #6 is normal mouse IgG group (negative control), and lane #7 is MHC-I antibody group (positive sample). The results show that the current experimental system can effectively enrich more than 90% of HLA proteins in the protein lysate, indicating that the sample preparation process is efficient.

另一份SDS-PAGE直接進行銀染，藉由比較1ng，5ng，10ng蛋白標準品BSA的信號值，可計算質譜樣品的總蛋白量，輔助質譜分析。如圖6，泳道#1為Marker、泳道#2為質譜樣本1的陰性對照、泳道#3為質譜樣本1的陽性對照、泳道#4為質譜樣本2的陰性對照、泳道#5為質譜樣本2的陽性對照、泳道#6為Marker、泳道#7為1ng蛋白標準品(BSA)、泳道#8為5ng蛋白標準品(BSA)、泳道#9為10ng蛋白標準品(BSA)。箭頭指示MHC-I抗體組(陽性樣本)特異富集蛋白，其中45KDa附近條帶鑑定為HLA蛋白(箭頭HLA)，此次實驗共獲得1.0-12.4ug的總蛋白量。 Another SDS-PAGE was directly stained with silver. By comparing the signal values of 1ng, 5ng, and 10ng protein standard BSA, the total protein amount of the mass spectrometry sample could be calculated to assist mass spectrometry analysis. As shown in Figure 6, lane #1 is Marker, lane #2 is the negative control of mass spec sample 1, lane #3 is the positive control of mass spec sample 1, lane #4 is the negative control of mass spec sample 2, and lane #5 is mass spec sample 2 The positive controls for , lane #6 is Marker, lane #7 is 1 ng protein standard (BSA), lane #8 is 5 ng protein standard (BSA), and lane #9 is 10 ng protein standard (BSA). Arrows indicate MHC-I antibody group (positive samples) specific enriched proteins, in which the bar near 45KDa The band was identified as HLA protein (arrow HLA), and a total protein amount of 1.0-12.4ug was obtained in this experiment.

如圖4所示，利用這些陽性肽進行重新訓練融合模型，所得模型比未添加模型表現更好，有大約15%的準確率提升，陽性預測值顯著高於現行公開軟體。 As shown in Figure 4, using these positive peptides to retrain the fusion model, the resulting model performs better than the model without addition, with an accuracy improvement of about 15%, and the positive predictive value is significantly higher than the current public software.

以上僅是本發明的具體應用範例，對本發明的保護範圍不構成任何限制；對於所屬技術領域具有通常知識者來說，在上述說明的基礎上還可以做出其它不同形式的變化或變動。這裡也無需也無法對所有的實施方式予以列舉說明。凡採用等同變換或者等效替換而形成的類似此種的技術方案，均落在本發明申請專利範圍之內。 The above are only specific application examples of the present invention, and do not constitute any limitation to the protection scope of the present invention; for those with ordinary knowledge in the technical field, other different forms of changes or modifications can also be made on the basis of the above description. It is not necessary and impossible to enumerate and describe all the embodiments here. Any similar technical solutions formed by equivalent transformation or equivalent replacement fall within the scope of the patent application of the present invention.

Claims

A method of predicting and/or identifying the likelihood of peptide presentation by HLA molecules, comprising:

Step 1: Construct data sets and data coding, including digital coding of peptides, expression levels, and HLA typing data in the data set;

Step 2: Train the neural network model and the integrated learning model, including parameter training on the data provided by the mass spectrometry experiment;

Step 3: Predict and/or identify the possibility of peptide presentation by HLA molecules, including fusion neural network model coefficient matrix and ensemble learning model coefficient matrix, predict and/or identify according to the peptide, expression level and HLA typing data of the sample to be tested The possibility that each peptide of the sample to be tested is presented by HLA molecules.

The method of claim 1, wherein the neural network model is selected from a fully connected neural network model, a convolutional neural network model, and a long short-term memory neural network model, preferably a fully connected neural network model.

The method according to claim 1 or 2, wherein the neural network model architecture is 3-5 layers; preferably, the neural network model architecture is 4 layers, and the number of neurons in each layer is 256, 32, 16 and 1.

The method of any one of claims 1 to 3, wherein the ensemble learning model is selected from random forest, Adaboost, XGboost, LightGBM, preferably XGboost.

The method of any one of claims 1 to 4, wherein the data set includes a positive peptide set and a negative peptide set, divided into one or more training data sets and one or more test data sets.

The method of any one of claims 1 to 5, wherein the length of the peptides in the dataset is 5-15 amino acids, preferably 7-12 amino acids, most preferably 8-11 amino acids .

The method of any one of claims 1 to 6, wherein the data encoding is selected from one-hot encoding, BLOMAP, PSSM, word2vec and BLOSUM62, preferably one-hot encoding.

The method of any one of claims 1 to 7, wherein the neural network model performs parameter training on mass spectrometry experimental data:

Suppose there are m training peptides, each corresponding to n HLA types:

in

is the data matrix coded by the data coding module for each training peptide without HLA typing, and each row in the matrix is a data row coded by the data coding module for one peptide;

is the positive/negative label of each training peptide in the mass spectrometry experiment; β _{target 1} is the neural network coefficient matrix, which is obtained through cross-validation training when the accuracy is the largest.

The method of any one of claims 1 to 8, wherein the activation function of the neural network model is relu.

The method of any one of claims 1 to 9, wherein the initializer of the neural network model is a Lecun initializer.

The method of any one of claims 1 to 10, wherein the ensemble learning model performs parameter training on mass spectrometry experimental data:

Suppose there are m training peptides, each corresponding to n HLA types:

in

is a data matrix containing HLA typing for each training peptide encoded by the data, and each row in the matrix is a data row of a peptide encoded by the data;

is the positive/negative label of each training peptide in the mass spectrometry experiment; β _{target 2} is the ensemble learning coefficient matrix, and the coefficient when the accuracy is the maximum is obtained through cross-validation training.

The method according to any one of claims 1 to 11, wherein the strategy of fusing the neural network model and the ensemble learning model is selected from averaging method, voting method and learning method; preferably, the fusion comprises combining the ensemble learning model The score of the output result in the neural network model is multiplied by the weight factor.

The method according to claim 12, wherein the weight coefficient is 0.1-0.00001, preferably 0.05-0.0001, and optimally 0.01.

The method of any one of claims 1 to 13, wherein the prediction and/or identification is selected from single HLA allele prediction and/or identification, multi-allele prediction and/or identification.

The method of any one of claims 1 to 14, wherein the predicting and/or identifying an expected value of the probability of presentation of the peptide by HLA molecules

,Specifically:

Wherein X is the data line obtained after each peptide is encoded by the data.

The method according to any one of claims 1 to 15, wherein the 0.1% with the highest score is selected as the positive value or the positive predictive value under 40% recall rate is used as the threshold value, and the peptides in the sample to be tested that are higher than the threshold value are predicted and/or Identified as being presented by HLA molecules.

A method of preparing a peptide predicted and/or identified by a method as in any one of claims 1 to 16.

A method of preparing a polynucleotide encoding a peptide prepared by the method of claim 17.

An application of the method as described in any one of claims 1 to 18 in the preparation of mRNA, polypeptide vaccine, antitumor drug or tumor vaccine, preferably, the tumor is selected from the group consisting of: lung cancer, melanoma, Breast cancer, ovarian cancer, prostate cancer, kidney cancer, stomach cancer, colon cancer, testicular cancer, head and neck cancer, pancreatic cancer, brain cancer, B-cell lymphoma, acute myeloid leukemia, chronic myeloid leukemia, chronic lymphocytic leukemia and T Cell lymphocytic leukemia, non-small cell lung cancer, and small cell lung cancer.

A device for predicting and/or identifying the likelihood of peptide presentation by HLA molecules, characterized in that the device comprises a memory for storing a program and a processor for executing the program to achieve any of claims 1 to 16 one of the methods described.

A computer-readable storage medium is characterized by comprising a program, the program being executed by a processor to perform the method according to any one of claims 1 to 16.

A method for predicting and/or identifying the possibility of one or more peptides being presented by HLA molecules, comprising the steps of: constructing a data set and data encoding; training, fusing a neural network model and an ensemble learning model; predicting and/or identifying peptides by HLA molecule presentation possibilities.

A database of HLA allele-specific binding information comprising information or coding of peptides predicted and/or identified by the method as described in any one of claims 1 to 16 or encoded as in any one of claims 1 to 16 The method predicts and/or identifies the polynucleotide profile of the peptide.