Tumour neoantigen identification method based on transcript profile
Technical field
The present invention relates to a kind of biometric information authentication methods of tumor neogenetic antigen.
Background technique
Conventional genetic mutation detection technique includes Sanger sequencing, pyrosequencing, amplification retardance discrimination system at present
(ARMS), fluorescence was hybridization technique and allele specific pcr etc. originally.These universal flux of genetic test means are lower, and take
With height, time-consuming.
With the appearance of new-generation sequencing technology (Next-Generation Sequencing, NGS), extensive parallel survey
Sequence is possibly realized.Compared to first generation sequencing technologies, NGS technology detection speed is fast, and accuracy rate is high, and at low cost, coverage is wide.Benefit
The sequence of resurveying of Oncogenome sequence can be carried out with NGS technology, then (point is prominent to the different type variation of tumor-related gene
Become, insertion, missing etc.), copy number variation and gene associations and polymorphism identified.Tumour NGS detection at present can answer
For hereditary tumor screening and mutation identification.
In immunotherapy of tumors, the neoantigen (neoantigen) that identification tumor tissue cell generates is to determine that downstream is faced
The committed step of bed treatment.Normal cell is in Carcinogenesis, since inhereditary material changes, leads to its DNA sequence dna and its
His normal cell is variant, therefore can generate tumor associated antigen (expression of tumour cell height) or tumour specific antigen and (only exist
It is expressed in tumour cell).These antigens can theoretically be resisted due to having the specific epitope of mark tumour cell
Original then in conjunction with TCR, and then activates T cell in human leukocyte antigen (HLA) identification on delivery cell, starts immune anti-
It answers, is the potential target spot of immunotherapy of tumors.Pass through the analysis to patient's tumor tissues and normal tissue NGS data, Ke Yijian
Make mutation of the tissue, such as point mutation, insertion and deletion mutation, structure variation, Gene Fusion etc..It can from these catastrophic events
To predict variation that tumour cell occurs in downstream transcription and expression process, and then neoantigen that may be present is deduced, transcribed
Group sequencing data can more directly reflect the variation of inhereditary material, can not only identify gene for genome
Presentation situation of the horizontal mutation of group in central dogma, can also confirm the reliability of gene order-checking data analysis result, be
Tumor neogenetic Antigen Identification provides more foundations and evidence, and then guiding clinical treatment.
Summary of the invention
The purpose of the present invention is to provide one kind to identify individual specimen from tumor patient transcript profile NGS data
The method of tumour specific antigen.
Tumor antigen identification method based on transcript profile, comprising the following steps:
S1: obtaining the RNA sample of specimens, builds library and amplification to RNA sample, obtains the RNA sample of tumor tissues
This sequencing result;
S2: the short read of RNA sample sequencing result being compared to the mankind and refers to genome, obtains RNA comparison result;
S3, gene expression amount is calculated according to RNA comparison result, abrupt climatic change is carried out according to RNA comparison result, according to RNA
Comparison result predicts fusion event;Transcript profile HLA parting is predicted according to comparison result;Calculate gene expression amount, mutation inspection
It surveys and prediction fusion event carries out in a designated order, or synchronous progress;
S3A: calculating gene expression amount includes: the short read of introne in removal RNA comparison result is formed only by exon
The RNA comparison result of short read composition, calculates gene expression amount;
S3B: abrupt climatic change includes:
S3B1: from the mutation detected in RNA comparison result in RNA sample, remove in the full exon sample of normal tissue
The mutation having, somatic mutation of the remaining mutation as rna level;Functional annotation is carried out to the mutation of RNA;
S3B2: it calculates the depth in mutational site: for each mutational site, calculating current mutational site and surveyed in full exon
Depth in sequence sample, full exon sample include the full exon sample of tumor tissues and the full exon sample of normal tissue
This;
S3B3:HLA parting affinity prediction: according to the annotation of mutation as a result, corresponding for each catastrophic event editor
Transcript sequence, obtain one include all mutation polypeptide sequence, removing can be matched in wild type protein sequence
Peptide fragment forms the distinctive nascent polypeptide sequence of transcript profile, and the distinctive polypeptide sequence of transcript profile is combined needs according to HLA parting
Length is intercepted into small peptide, and it is pre- that the patient's HLA parting identified in small peptide and the RNA sample sequencing result of step S1 is done affinity
It surveys;
Wherein, the depth for calculating mutational site carries out in a designated order with the prediction of HLA parting affinity or synchronous progress;
S3C, prediction fusion event: predicting the fusion event of transcript profile according to the RNA comparison result of step S2,
After being screened out from it believable fusion, protein sequence is translated into, protein sequence is combined to the length needed according to HLA parting
Intercept into small peptide;The patient's HLA parting identified in the RNA sample sequencing result of small peptide and step S1 is done into affinity prediction;
S4: by the gene expression amount of transcript profile sample, depth of the transcript profile mutational site in complete outer sequencing sample is newborn
The binding force of small peptide and patient HLA parting gives downstream analysis personnel as analysis result.
Further, it in step S3A, calculates gene expression amount and comprises the steps of:
S3A1, removal build the duplicate short read sequence generated during library by PCR amplification;
S3A2, the short read compared to introne is removed, obtains the RNA comparison result being made of the short read of exon;
S3A3, each short read number of exon is calculated, with following formula scales at the expression quantity of gene:
Wherein total exon reads indicates to compare the total short read number for arriving some exon region, mapped reads
(millions) it indicates to compare the short read number for arriving the region in every 1000000 short reads, exon length (KB) indicates this
The length of section exon.
Further, after step S2 obtains gene expression amount, Quality Control is carried out to RNA comparison result, the content of Quality Control includes: whole
The coverage of a capture region, average sequencing depth, comparison rate, the specific gravity of repetitive sequence and unique specific gravity for comparing reads;If
Comparison result meets Quality Control requirement, then enters S3;If comparison result does not meet Quality Control requirement, sample is reacquired.
Further, step S3B2 include all mutation polypeptide sequence acquisition methods are as follows: according to transcript number from
Corresponding CDS Region Nucleotide sequence is found out in ensembl database, is spliced 3 ' terminal sequence of downstream (3 ' UTR), according to mutation
The nucleotide of corresponding position on nucleotide sequence is made modification by functional annotation, obtains the nucleotides sequence comprising all mutation
Column;That provide such as functional annotation is G100A, then the 100th bases G for splicing nucleotide sequence is revised as A;Institute will be included
There is the transcript nucleotide sequence of mutation to translate into polypeptide sequence.
Further, the method for small peptide is intercepted in step S3B and S3C are as follows:
SI, sequence obtain sporting in polypeptide sequence and work as premutation, centered on being currently mutated, intercept n ammonia forward
M amino acid acquisition polypeptide sequence of base acid and backward interception, the maximum length that n can present for HLA parting, m is HLA parting institute
The maximum length or m that can be presented are from the length when premutation to first terminator codon;
SII, successively intercepted length is the small peptide of N in polypeptide sequence, and N is that HLA parting combines the length needed;Obtain N
Item includes the small peptide when premutation.Such as: it is 8, N=8 that HLA parting, which combines the length needed,;It is cut forward since mutated site
It takes 7 amino acid, form the small peptide that a length is 8 amino acid with the mutated site;Intercept 6 forward since mutated site
A amino acid intercepts 1 amino acid backward, and it is the short of 8 amino acid that this 7 amino acid, which form the 2nd article of length with mutated site,
Peptide, in this way, 8 small peptides containing the mutation can be obtained altogether.
Further, step SI intercepts the rule of polypeptide sequence containing mutation sites are as follows:
A, for point mutation: centered on the position of mutation, intercepting n amino acid forward, intercept m amino backward
Acid, the longest peptide fragment that n=m=HLA parting can present, if when leading portion or back segment curtailment, how many cuts how many;If
Point mutation belongs to stop loss (terminator codon loss), then m is from the length when premutation to first terminator codon;
B, for non-frameshift mutation: non-frameshit insertion will intercept forward n amino from first amino acid of insetion sequence
Acid;M amino acid, the longest peptide fragment that n=m=HLA parting can present are intercepted backward from the last one amino acid of insetion sequence;
Centered on insetion sequence, n amino acid before insetion sequence, insetion sequence and m amino acid after insetion sequence and
Collectively constitute nascent polypeptide sequence;
Non- frameshift deletion then centered on deletion segment, respectively forwardly intercepts n amino acid, intercepts m amino acid backward,
The longest peptide fragment that n=m=HLA parting can present;
C, for frameshift mutation: centered on first amino acid for starting frameshift mutation, intercepting n amino acid, n forward
The longest peptide fragment that=HLA parting can present;M amino acid is intercepted backward, and m is from when premutation to first terminator codon
Length, i.e., intercept backward to first terminator codon.
The present invention has the advantages that
1. entire qualification process, since fastq file, user is without preparing other input files.
2. all authentication steps have corresponding Quality Control step, the accuracy of result is improved.
The case where 3. identification for mutational site is more comprehensive, not only considers individual cells, it is also considered that monolith tissue
Mutation distribution.
4. there is the confirmation of multiple groups qualification result.
5. committed step is all carried out there are many algorithm simultaneously, it can both be mutually authenticated, and also reduce the false negative of result.
6. committed step all optimizes algorithm, and introduces parallel computation, the analysis time of single sample is shortened.
Specific embodiment
Tumor antigen identification method based on transcript profile, comprising the following steps:
S1: obtaining the RNA sample of specimens, builds library and amplification to RNA sample, obtains the RNA sample of tumor tissues
This sequencing result.
S2: the short read of RNA sample sequencing result being compared to the mankind and refers to genome, obtains RNA comparison result.
S3, according to RNA comparison result and gene expression amount is calculated, abrupt climatic change is carried out according to RNA comparison result, and according to
RNA comparison result predicts fusion event;HLA parting is predicted according to comparison result;Calculate gene expression amount, abrupt climatic change and
Prediction fusion event carries out in a designated order, or synchronous progress.
S4: by the gene expression amount of transcript profile sample;Depth of the transcript profile mutational site in complete outer sequencing sample;It is newborn
The binding force of small peptide and patient HLA parting gives downstream analysis personnel as analysis result.
Calculate gene expression amount
The short read of introne in RNA comparison result is removed, is formed only by the RNA comparison result of the short read of exon, meter
Calculate gene expression amount.
Gene expression amount is calculated to comprise the steps of:
S3A1, removal build the duplicate short read sequence generated during library by PCR amplification;The library stage is built in sequencing, with
The nucleotides sequence that machine interrupts, which is listed in flowcell, can expand cluster, and the sequence of cluster is the same, referred to as repetitive sequence, main
If repetitive sequence is removed in this step in order to increase probability and confidence level that base is measured, capture sequencing when
It waits, if the repetitive sequence of target area does not remove, can not just know the true coverage in the region and depth, mutation is identified
Interference can be all generated with expression quantity calculating.
S3A2, the short read of introne that will will be transcribed on RNA using the split_N_cigar function in GATK kit
Segment excision obtains the RNA comparison result being only made of the short read of exon;Although what our sequencings obtained is mature rna,
Some short reads, which can compare, includes subregion, because short read only has the length of 150bp, there is a strong possibility can compare reality
On be not belonging to its place, it is therefore desirable to the short read that subregion is included in transcript profile identification is removed.
S3A3, the short read number that each gene region compares is calculated using HTseq, with following formula scales at gene
Expression quantity:
Wherein total exon reads indicates to compare the total short read number for arriving some exon region, mapped reads
(millions) it indicates to compare the short read number for arriving the region in every 1000000 short reads, exon length (KB) indicates this
The length of section exon.
Calculate mutational site depth
Depth of the catastrophe point in full exon sample is calculated to comprise the steps of:
S3B.1: from the mutation detected in RNA comparison result in RNA sample, remove in the full exon sample of normal tissue
The mutation having, somatic mutation of the remaining mutation as rna level;Functional annotation is carried out to the mutation of RNA;
S3B.2: it calculates the depth in mutational site: for each mutational site, calculating current mutational site in full exon
The depth in sample is sequenced, full exon sample includes the full exon sample of tumor tissues and the full exon sample of normal tissue
This, the tools such as Bam-recount can be used to calculate depth of the transcript profile mutational site in full exon sample, Bam-
Recount can only calculate the depth of SNV, and javakit can calculate the depth of indel, and the two result is incorporated as final knot
Fruit.Full exon sample includes the full exon sample of tumor tissues and the full exon sample of normal tissue.
HLA parting affinity predicts S3C.1: from the mutation detected in RNA sample in RNA comparison result, removing normal group
Identified mutation in the full exon sample knitted, mutation of the remaining mutation as rna level;Function is carried out to the mutation of RNA
It can annotation;
S3C.2: according to the annotation of mutation as a result, being directed to the corresponding transcript sequence of each catastrophic event editor, one is obtained
Item includes the nucleotide sequence of all mutation, then translates into polypeptide sequence, remove in polypeptide sequence wild-type protein can be with
The peptide fragment being matched to forms the distinctive nascent polypeptide sequence of transcript profile.By the distinctive polypeptide sequence of transcript profile according to HLA parting knot
The length interception needed is closed into small peptide, the patient's HLA parting identified in the RNA sample sequencing result of small peptide and step S1 is done into parent
It is predicted with power.
The preparation method of transcript nucleotide sequence are as follows: found out from ensembl database according to transcript number corresponding
CDS Region Nucleotide sequence, splice 3 ' terminal sequence of downstream, according to the functional annotation of mutation by corresponding position on nucleotide sequence
Nucleotide make modification, obtain one include all mutation nucleotide sequence.That provide such as functional annotation is G100A, then
The 100th bases G for splicing nucleotide sequence is revised as A;By the transcript nucleotide sequence translation comprising all mutation
At polypeptide sequence.
Using the HLA parting of multiple HLA Classification Identification tools identification transcript profile sample, such as SOAP-HLA, by each calculation
The result comprehensive consideration that method obtains is as final result.
Affinity prediction is carried out using multiple HLA parting affinity forecasting softwares respectively, is such as directed to HLA I type
(netMHC4.0 etc.), for HLA II type (netMHCII 2.2 etc.);In prediction result, retain the judgement of at least one software
To there is the result of affinity.
Predict fusion event
According to the RNA comparison result of step S2 predict transcript profile fusion event, by SOAPfuse (v1.27),
The tools such as STAR-Fusion are predicted.After being screened out from it believable fusion, protein sequence is translated into, by albumen sequence
Column combine the length needed to intercept into small peptide according to HLA parting;By what is identified in the RNA sample sequencing result of small peptide and step S1
Patient's HLA parting does affinity prediction;
After step S2 obtains gene expression amount, Quality Control is carried out to RNA comparison result, the content of Quality Control includes: entire capture
The coverage in region, average sequencing depth, comparison rate, the specific gravity of repetitive sequence and unique specific gravity for comparing reads;If comparing knot
Fruit meets Quality Control requirement, then enters S4;If comparison result does not meet Quality Control requirement, sample is reacquired.
Intercept small peptide
The method for intercepting small peptide are as follows:
SI, sequence obtain sporting in polypeptide sequence and work as premutation, centered on being currently mutated, intercept n ammonia forward
M amino acid acquisition polypeptide sequence of base acid and backward interception, the maximum length that n can present for HLA parting, m is HLA parting institute
The maximum length or m that can be presented are from the length when premutation to first terminator codon;
SII, in the polypeptide sequence small peptide that successively intercepted length is N, N is that HLA parting combines the length needed;Obtain N item
Small peptide comprising working as premutation.Such as: it is 8, N=8 that HLA parting, which combines the length needed,;Intercept 7 forward since mutated site
A amino acid forms the small peptide that a length is 8 amino acid with the mutated site;Intercept 6 forward since mutated site
Amino acid intercepts 1 amino acid backward, this 7 amino acid and mutated site form the small peptide that the 2nd article of length is 8 amino acid,
In this way, 8 small peptides containing the mutation can be obtained altogether.
For different types of mutation, the rule of interception small peptide are as follows:
A, for point mutation: centered on the position of mutation, intercepting n amino acid forward, intercept m amino backward
Acid, the longest peptide fragment that n=m=HLA parting can present, if when leading portion or back segment curtailment, how many cuts how many;If
Point mutation belongs to stop loss (terminator codon loss), then m is from the length when premutation to first terminator codon;
B, for non-frameshift mutation: non-frameshit insertion will intercept forward n amino from first amino acid of insetion sequence
Acid;M amino acid, the longest peptide fragment that n=m=HLA parting can present are intercepted backward from the last one amino acid of insetion sequence;
Centered on insetion sequence, n amino acid before insetion sequence, insetion sequence and m amino acid after insetion sequence and
Collectively constitute polypeptide sequence;
Non- frameshift deletion then centered on deletion segment, respectively forwardly intercepts n amino acid, intercepts m amino acid backward,
The longest peptide fragment that n=m=HLA parting can present;
C, for frameshift mutation: centered on first amino acid for starting frameshift mutation, intercepting n amino acid, n forward
The longest peptide fragment that=HLA parting can present;M amino acid is intercepted backward, and m is from when premutation to first terminator codon
Length, i.e., intercept backward to first terminator codon.
Specific embodiment is that invention is further explained, but of the invention is not limited to these specific embodiment parties
Formula.