Disclosure of Invention
The invention mainly aims to provide a method and a device for detecting tumor neoantigen polypeptide, so as to solve the problems of incomplete and inaccurate detection result of tumor neoantigen in the prior art.
In order to achieve the above object, according to one aspect of the present invention, there is provided a method for detecting a tumor neoantigen polypeptide, the method comprising: obtaining somatic mutation and germ line mutation of tumor tissues; performing HLA typing by using the sequencing data of the tumor control blood cell sample to obtain an HLA typing result; predicting the neoantigen polypeptide by using the HLA typing result to perform somatic mutation and germ line mutation to obtain candidate neoantigen polypeptide; and (4) scoring and sequencing the candidate neoantigen polypeptides, wherein the polypeptide with the highest score is the neoantigen polypeptide.
Further, prior to using HLA typing results for neoantigen polypeptide prediction of somatic and germline mutations, the method further comprises: somatic cell specific mutation and germline mutation are combined and annotated with VEP to obtain the gene, transcript and polypeptide segment causing change in each mutation.
Further, performing neoantigen polypeptide prediction on somatic and germline mutations includes: carrying out MHC affinity test and scoring on polypeptide fragments changed by each mutation in somatic mutation and germ line mutation to obtain the sum score of each polypeptide fragment; wherein the MHC affinity test comprises: (1) polypeptide fragments of 9-11 amino acids in length are used: (2) polypeptide fragments in a plurality of different positions; (3) various affinity test methods were used: the test method comprises at least one of the following steps: MHCflurry, MHCnggetSI, NNalign, and NetMHC.
Further, the MHC affinity test further comprises at least one of: (4) polypeptide fragments employing multiple transcripts; (5) a variety of different HLA typing is employed.
Further, scoring the candidate neoantigen polypeptides comprises: and (3) sequencing the sum score of each polypeptide fragment in the candidate neoantigen polypeptides, wherein the polypeptide with the highest score is the neoantigen polypeptide.
Further, performing HLA typing by using the sequencing data of the tumor control blood cell sample, and obtaining an HLA typing result comprises: carrying out sequence comparison on the sequencing data of the tumor control blood cell sample and known HLA alleles in an IMGT/HLA database to obtain a comparison matrix; carrying out merging, deleting and sorting on the comparison matrix to obtain a sorting matrix; and processing the sorting matrix by adopting an optimization problem algorithm to obtain an HLA typing result.
Further, the comparison matrix comprises a plurality of columns and a plurality of rows, the columns are all the HLA types, the rows are all the reads, and the merging, deleting and sorting of the comparison matrix comprises: merging the same rows to obtain a weight row; for any two columns, namely, a column and b column, if the b column completely contains reads of the a column and the b column also contains other reads different from the reads of the a column, the a column is deleted.
Further, an ILP optimization algorithm is adopted to solve the rationality matrix, and an HLA typing result is obtained.
Furthermore, the sequencing data of the tumor control blood cell sample is subjected to sequence alignment with known HLA alleles in an IMGT/HLA database by adopting Optitype software to obtain an alignment matrix.
Further, obtaining somatic and germline mutations of tumor tissue includes: carrying out paired somatic cell detection by utilizing the tumor control blood cell sample and the tumor tissue sample to obtain somatic cell mutation of the tumor tissue; and (3) performing germ line mutation detection by using a tumor control blood cell sample and using GATK to obtain the germ line mutation.
In order to achieve the above object, according to one aspect of the present invention, there is provided a device for detecting a tumor neoantigen polypeptide, the device comprising: the system comprises an acquisition module, an HLA typing module, a candidate neoantigen prediction module and a neoantigen prediction module; the acquisition module is used for acquiring somatic mutation and germ line mutation of the tumor tissue; the HLA typing module is used for carrying out HLA typing by utilizing the sequencing data of the tumor control blood cell sample to obtain an HLA typing result; the candidate neoantigen prediction module is used for predicting neoantigen polypeptides by utilizing HLA typing results to perform somatic mutation and germ line mutation to obtain candidate neoantigen polypeptides; and the neoantigen prediction module is used for scoring and sequencing the candidate neoantigen polypeptides and marking the polypeptide with the highest score as the neoantigen polypeptide.
Further, the apparatus further comprises: mutation merging and annotation module: the method is used for merging somatic cell specific mutation and embryonic line mutation, and adopting VEP for annotation to obtain the gene, transcript and polypeptide segment causing change of each mutation.
Further, the candidate neoantigen prediction module: the MHC affinity testing module is used for carrying out MHC affinity testing and scoring on polypeptide fragments changed due to each mutation in somatic mutation and germline mutation to obtain the sum score of each polypeptide fragment; wherein the MHC affinity test comprises: (1) polypeptide fragments of 9-11 amino acids in length are used: (2) polypeptide fragments in a plurality of different positions; (3) various affinity test devices were employed: the testing device comprises at least one of the following components: MHCflurry, MHCnggetSI, NNalign, and NetMHC.
Further, the MHC affinity test further comprises at least one of: (4) polypeptide fragments employing multiple transcripts; (5) a variety of different HLA typing is employed.
Further, the neoantigen prediction module: and the method is used for sequencing the sum score of each polypeptide fragment in the candidate neoantigen polypeptide, and the polypeptide with the highest score is the neoantigen polypeptide.
Further, the HLA typing module includes: the system comprises a comparison unit, a merging and deleting unit and an HLA typing unit, wherein the comparison unit is used for carrying out sequence comparison on sequencing data of a tumor control blood cell sample and known HLA alleles in an IMGT/HLA database to obtain a comparison matrix; the merging and deleting unit is used for merging, deleting and sorting the comparison matrix to obtain a sorted matrix; and the HLA typing unit is used for processing the arrangement matrix by adopting an optimization problem algorithm to obtain an HLA typing result.
Further, the alignment matrix includes a plurality of columns and a plurality of rows, the plurality of columns are all HLA types, the plurality of rows are all reads, and the merge deletion unit includes: a merging subunit and a deleting subunit, where the merging subunit is configured to merge the same rows (the same means completely the same here) to obtain a weight row (the weight row records the number of repeats of reads); and the deleting subunit is used for deleting the a column and the b column of any two columns, if the b column completely contains reads of the a column and the b column also contains other reads different from the reads of the a column.
Further, the HLA typing unit includes: and the ILP optimization module is used for solving the rationality matrix by utilizing an ILP optimization algorithm to obtain an HLA typing result.
Further, the alignment unit is Optitype software.
Further, the acquisition module includes: a somatic cell mutation acquisition unit and an embryonic line mutation acquisition unit, wherein the somatic cell mutation acquisition unit is used for carrying out paired somatic cell detection by utilizing a tumor control blood cell sample and a tumor tissue sample to obtain the somatic cell mutation of the tumor tissue; and the germ line mutation acquisition unit is used for carrying out germ line mutation detection by using the GATK through utilizing the tumor control blood cell sample to obtain the germ line mutation.
According to another aspect of the present invention, there is provided a storage medium comprising a stored program, wherein the program, when executed, controls a device on which the storage medium is located to perform any of the above-described methods for detecting a tumor neoantigen polypeptide.
According to another aspect of the present invention, there is provided a processor for executing a program, wherein the program is executed for performing any of the above-mentioned methods for detecting a tumor neoantigen polypeptide.
By applying the technical scheme of the invention, the prediction of the neoantigen polypeptide is carried out by acquiring the sum of the mutations of the somatic mutation and the germ line mutation of the tumor tissue, the sources of the mutations are more comprehensive, and therefore, the prediction result is relatively more accurate. In addition, the method also scores and orders the predicted candidate neoantigen polypeptides, and takes the subsequent neoantigen polypeptide with the highest score as the neoantigen polypeptide according to the scoring result, so that more accurate neoantigen polypeptide can be obtained conveniently, and the guiding significance of subsequent immunotherapy medication is further improved.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail with reference to examples.
HLA: human leukocyte antigen, the expression product of the human Major Histocompatibility Complex (MHC), is an alloantigen with high polymorphism.
The IMGT/HLA database contains mainly sequence information on alleles of HLA type I and type II genes, and also contains some alleles of non-HLA genes (Allele).
The HLA type I gene mainly comprises classical HLA-A, HLA-B, HLA-C and other genes, and also comprises partial pseudogenes.
HLA type II genes mainly comprise DR, DQ, DP, DO and DM series genes.
For the different alleles, in addition to the standard nomenclature specified by the WHO commission, an HLA number is also provided. For example, HLA00001, a × 01: 01: 01: 01; HLA02169, a × 01: 01: 01: 02N; HLA01244, a × 01: 01: 02. and A, 01: 01: 01: 02N for example, the HLA allele is divided into four fields and finally includes a modification suffix, for a total of 5 fields.
The database provides various formats such as fasta, msf, pir and the like, and provides sequences of DNA, RNA and protein at three different levels.
Example 1
In a preferred embodiment of the present application, a method for detecting a tumor neoantigen polypeptide is provided, and fig. 1 is a flowchart of a method for detecting a tumor neoantigen polypeptide according to an embodiment of the present invention. As shown in fig. 1, the method includes:
step S101, obtaining somatic mutation and germ line mutation of tumor tissues;
step S102, performing HLA typing by using the sequencing data of the tumor control blood cell sample to obtain an HLA typing result;
step S103, utilizing HLA typing results to carry out neoantigen polypeptide prediction on somatic mutation and germ line mutation to obtain candidate neoantigen polypeptides;
and step S104, scoring and sequencing the candidate neoantigen polypeptides, wherein the polypeptide with the highest score is the neoantigen polypeptide.
According to the method, the neoantigen polypeptide is predicted by acquiring the sum of the mutations of the somatic mutation and the germ line mutation of the tumor tissue, the sources of the mutations are more comprehensive, and therefore the prediction result is relatively more accurate. In addition, the method also scores and orders the predicted candidate neoantigen polypeptides, and takes the subsequent neoantigen polypeptide with the highest score as the neoantigen polypeptide according to the scoring result, so that more accurate neoantigen polypeptide can be obtained conveniently, and the guiding significance of subsequent immunotherapy medication is further improved.
Before using HLA typing results to predict the neoantigen polypeptide of somatic mutation and germ line mutation, it is also necessary to obtain the polypeptide fragment caused by mutation and whether there is a change. The determination may specifically be made using existing methods. In a preferred embodiment, the method further comprises: somatic cell specific mutation and germline mutation are combined and annotated with VEP to obtain the gene, transcript and polypeptide segment causing change in each mutation.
In a preferred embodiment, the prediction of neoantigen polypeptides for somatic and germline mutations comprises: carrying out MHC affinity test and scoring on polypeptide fragments changed by each mutation in somatic mutation and germ line mutation to obtain the sum score of each polypeptide fragment; wherein the MHC affinity test comprises: (1) polypeptide fragments of 9-11 amino acids in length are used: (2) polypeptide fragments in a plurality of different positions; (3) various affinity test methods were used: the test method comprises at least one of the following steps: MHCflurry, MHCnggetSI, NNalign, and NetMHC.
In a further preferred embodiment, the MHC affinity test further comprises at least one of: (4) polypeptide fragments employing multiple transcripts; (5) a variety of different HLA typing is employed.
And (3) for the polypeptide fragment formed by each mutation, carrying out MHC affinity detection by adopting fragments with multiple lengths, wherein the length is generally selected from 9-11 amino acids. In addition to length variation, the mutation may be selected differently at a particular position in the polypeptide fragment for a single mutation site; if the mutation is near a variable cleavage site, the polypeptide is set up from transcript to transcript. In addition, different HLA types will have different MHC affinities for a particular polypeptide fragment.
The polypeptide fragments are all from one mutation, although the sequences of the polypeptide fragments are similar, the MHC affinity of the polypeptide fragments are different, so that the MHC affinity analysis of the polypeptide fragments with multiple mutation selection dimensions can more comprehensively analyze whether one mutation possibly constitutes a new antigen polypeptide. In addition, MHC affinity testing has used a number of methods including MHCflurry, MHCnuggetsI, NNalign, NetMHC, and performing MHC affinity testing on the above polypeptides from one mutation will result in a number of affinity scores, and each polypeptide is summed and scored. And a plurality of algorithms are adopted for affinity analysis, so that the robustness of the analysis is improved, and the reliability of the analysis result is improved.
In a preferred embodiment, scoring the candidate neoantigen polypeptides comprises: and (3) sequencing the sum score of each polypeptide fragment in the candidate neoantigen polypeptides, wherein the polypeptide with the highest score is the neoantigen polypeptide.
The HLA typing using the sequencing data of the tumor control blood cell sample can be performed by the existing method, for example, by performing HLA typing on the tumor control blood cell sample by using BWA-HLA and Polysolver software, and then taking the intersection of the two typing results. In a preferred embodiment, the HLA typing is performed using sequencing data from a tumor control blood cell sample, and obtaining the HLA typing result comprises: carrying out sequence comparison on the sequencing data of the tumor control blood cell sample and known HLA alleles in an IMGT/HLA database to obtain a comparison matrix; carrying out merging, deleting and sorting on the comparison matrix to obtain a sorting matrix; and processing the sorting matrix by adopting an optimization problem algorithm to obtain an HLA typing result.
In a preferred embodiment, the alignment matrix includes a plurality of columns and a plurality of rows, the plurality of columns being all HLA types, the plurality of rows being all reads, the merge deletion unit includes: a merging subunit and a deleting subunit, where the merging subunit is configured to merge the same rows (the same means completely the same here) to obtain a weight row (the weight row records the number of repeats of reads); and the deleting subunit is used for deleting the a column and the b column of any two columns, if the b column completely contains reads of the a column and the b column also contains other reads different from the reads of the a column.
In a preferred embodiment, an ILP optimization algorithm is used to solve the sorting matrix to obtain the HLA typing result.
In a preferred embodiment, the sequencing data of the tumor control blood cell samples are aligned to known HLA alleles in the IMGT/HLA database using Optitype software to generate an alignment matrix.
BWA-HLA is a far-aged method, with poor results, and intersection of WA-HLA with Polysolver is likely to miss the correct HLA typing results. While the typing accuracy was significantly higher with Optitype than the Polysolver software. Specifically, the software optitype (v2.1.0) and the sequencing fastq file of the blood cell sample of the tumor patient are used for HLA typing, and the typing result can be obtained.
The specific acquisition mode of somatic mutation and germ line mutation can be realized by adopting the existing method. In a preferred embodiment, obtaining somatic and germline mutations of tumor tissue comprises: carrying out paired somatic cell detection by utilizing the tumor control blood cell sample and the tumor tissue sample to obtain somatic cell mutation of the tumor tissue; and (3) performing germ line mutation detection by using a tumor control blood cell sample and using GATK to obtain the germ line mutation.
Specifically, paired somatic mutation was performed on tumor tissue and blood cells of tumor patients using the mutct 2 module of GATK (v4.0.5.1) to obtain a somatic mutation vcf file, and mutations that passed the filter criteria were selected for subsequent use. The blood cells of the tumor patients are subjected to germ line mutation detection by using a HaplotpypeCaller module of GATK (v4.0.5.1), a germ line mutation vcf file is obtained, and the mutation passing the filtering standard is selected for subsequent use. The above filtering criteria are typically mutation frequencies greater than 10% and reads support numbers greater than 100.
In a preferred embodiment of the present application, as shown in fig. 2, there is provided a method for detecting tumor immunity neoantigen polypeptide, comprising:
step A, using a fastq file of a sequencing result of a tumor control blood cell sample, comparing reads in the fastq file to known HLA sequences of type I and type II in an IMGT/HLA database, and obtaining an alignment matrix.
And step B, after the matrixes are merged, deleted and sorted in a comparison mode, an optimization problem algorithm is used for achieving the HLA typing result.
And step C, carrying out paired somatic cell detection by utilizing the tumor control blood cell sample and the cancer tissue sample to obtain the somatic cell variation of the tumor tissue.
And D, performing germ line mutation detection by using the tumor control blood cell sample and using GATK to obtain the germ line mutation of the patient.
And E, merging the somatic cell specific mutation and the germ line mutation, and performing annotation by using VEP to obtain a polypeptide fragment corresponding to the mutation.
And F, utilizing the HLA typing result and the combined mutation to predict the neoantigens, and obtaining an initial neoantigen list (namely candidate neoantigens).
And G, scoring and sequencing the predicted polypeptides by using the mutation frequency and the affinity of the polypeptide fragments and MHC molecules in the initial neoantigen list to obtain a final neoantigen list after scoring and sequencing, and further taking the newly antigen with the highest score as the neoantigen.
Example 2
The target is as follows: tumor patients (sample ID 180504502TT1) were tested for neogenetic tumor antigens.
The method comprises the following steps:
1. HLA typing was performed using the software optitype (v2.1.0) and sequencing fastq files of blood cell samples from tumor patients to obtain typing results (see Table 1).
2. Paired somatic mutation was performed on tumor tissue and blood cells of tumor patients using the Mutect2 module of GATK (v4.0.5.1) to obtain a somatic mutation vcf file, and mutations that passed the filter criteria were selected for subsequent use.
3. The blood cells of the tumor patients are subjected to germ line mutation detection by using a HaplotpypeCaller module of GATK (v4.0.5.1), a germ line mutation vcf file is obtained, and the mutation passing the filtering standard is selected for subsequent use.
4. The somatic mutation file was merged with the Germline mutation file using the combinanevarients module of GATK (v4.0.5.1).
5. The merged file was annotated with vep (v94.5) to obtain the gene where the mutation was located, the transcript, and the polypeptide fragment that resulted in the change.
6. The HLA typing results and the annotated mutation files were subjected to neoantigen detection and scoring ranking using the pvacceq module of pvactols (v1.3.4), and the ranking results are shown in Table 2.
Table 1: HLA typing results
A
|
B
|
C
|
A*24:03
|
B*46:01
|
C*01:02 |
Table 2: ranking results of neoantigen scoring
As can be seen from table 2, the most highly scored neoantigen polypeptide is a mutated antigen sequence on chromosome 1.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) or a processor, and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
In response to the above manner, the present application also provides a device for detecting tumor neoantigen polypeptide, which is used to implement the above embodiments and preferred embodiments, and the description of which is already given is not repeated. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
This is further illustrated below in connection with alternative embodiments.
Example 3
In this embodiment, there is also provided a device for detecting a tumor neoantigen polypeptide, as shown in fig. 3, the device comprising: the system comprises an acquisition module 10, an HLA typing module 20, a candidate neoantigen prediction module 30 and a neoantigen prediction module 40, wherein the acquisition module 10 is used for acquiring somatic mutation and germ line mutation of tumor tissues; the HLA typing module 20 is used for carrying out HLA typing by utilizing the sequencing data of the tumor control blood cell sample to obtain an HLA typing result; a candidate neoantigen prediction module 30, configured to perform neoantigen polypeptide prediction on somatic mutation and germline mutation by using an HLA typing result, so as to obtain a candidate neoantigen polypeptide; and the neoantigen prediction module 40 is used for scoring and sequencing the candidate neoantigen polypeptides and marking the polypeptide with the highest score as the neoantigen polypeptide.
The device carries out prediction on the polypeptide of the neoantigen by acquiring the sum of the mutations of two parts of sources including somatic mutation and germ line mutation of tumor tissues, the sources of the mutations are more comprehensive, and therefore the prediction result is relatively more accurate. In addition, the method also scores and orders the predicted candidate neoantigen polypeptides, and takes the subsequent neoantigen polypeptide with the highest score as the neoantigen polypeptide according to the scoring result, so that more accurate neoantigen polypeptide can be obtained conveniently, and the guiding significance of subsequent immunotherapy medication is further improved.
Optionally, the apparatus further comprises: mutation merging and annotation module: the method is used for merging somatic cell specific mutation and embryonic line mutation, and adopting VEP for annotation to obtain the gene, transcript and polypeptide segment causing change of each mutation.
Optionally, the candidate neoantigen prediction module comprises: the MHC affinity testing module is used for carrying out MHC affinity testing and scoring on polypeptide fragments changed due to each mutation in somatic mutation and germline mutation to obtain the sum score of each polypeptide fragment; wherein the MHC affinity test comprises: (1) polypeptide fragments of 9-11 amino acids in length are used: (2) polypeptide fragments in a plurality of different positions; (3) various affinity test devices were employed: the testing device comprises at least one of the following components: MHCflurry, MHCnggetSI, NNalign, and NetMHC.
Optionally, the MHC affinity test further comprises at least one of: (4) polypeptide fragments employing multiple transcripts; (5) a variety of different HLA typing is employed.
Optionally, the neoantigen prediction module: and the method is used for sequencing the sum score of each polypeptide fragment in the candidate neoantigen polypeptide, and the polypeptide with the highest score is the neoantigen polypeptide.
Optionally, the HLA typing module comprises: the system comprises a comparison unit, a merging and deleting unit and an HLA typing unit, wherein the comparison unit is used for carrying out sequence comparison on sequencing data of a tumor control blood cell sample and known HLA alleles in an IMGT/HLA database to obtain a comparison matrix; the merging and deleting unit is used for merging, deleting and sorting the comparison matrix to obtain a sorted matrix; and the HLA typing unit is used for processing the arrangement matrix by adopting an optimization problem algorithm to obtain an HLA typing result.
Optionally, the alignment matrix includes a plurality of columns and a plurality of rows, the plurality of columns are all HLA types, the plurality of rows are all reads, and the merge deletion unit includes: a merging subunit and a deleting subunit, where the merging subunit is configured to merge the same rows (the same means completely the same here) to obtain a weight row (the weight row records the number of repeats of reads); and the deleting subunit is used for deleting the a column and the b column of any two columns, if the b column completely contains reads of the a column and the b column also contains other reads different from the reads of the a column.
Optionally, the HLA typing unit comprises: and the ILP optimization module is used for solving the rationality matrix by utilizing an ILP optimization algorithm to obtain an HLA typing result.
Optionally, the alignment unit is Optitype software.
Optionally, the obtaining module includes: a somatic cell mutation acquisition unit and an embryonic line mutation acquisition unit, wherein the somatic cell mutation acquisition unit is used for carrying out paired somatic cell detection by utilizing a tumor control blood cell sample and a tumor tissue sample to obtain the somatic cell mutation of the tumor tissue; and the germ line mutation acquisition unit is used for carrying out germ line mutation detection by using the GATK through utilizing the tumor control blood cell sample to obtain the germ line mutation.
From the above description, it can be seen that the above-described embodiments of the present invention achieve the following technical effects: the neoantigen polypeptide is predicted by acquiring the sum of the mutations of the somatic mutation and the germ line mutation of the tumor tissue, so that the sources of the mutations are more comprehensive, and the prediction result is relatively more accurate. In addition, the method also scores and orders the predicted candidate neoantigen polypeptides, and takes the subsequent neoantigen polypeptide with the highest score as the neoantigen polypeptide according to the scoring result, so that more accurate neoantigen polypeptide can be obtained conveniently, and the guiding significance of subsequent immunotherapy medication is further improved.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.