CN110684830A

CN110684830A - RNA analysis method for paraffin section tissue

Info

Publication number: CN110684830A
Application number: CN201910962113.2A
Authority: CN
Inventors: 黄毅; 易鑫; 吴玲清; 刘久成; 王长希; 李俊
Original assignee: Shenzhen Guiinga Medical Laboratory
Current assignee: Shenzhen Guiinga Medical Laboratory
Priority date: 2019-10-11
Filing date: 2019-10-11
Publication date: 2020-01-14

Abstract

The invention provides a paraffin section tissue RNA analysis method, which comprises the following steps: carrying out DNA degradation on the paraffin section tissue, and extracting sample RNA; preparing a paraffin section sample nucleic acid library, and sequencing sample RNA; performing quality control on sample data obtained by sequencing; comparing the sample data after quality control with the reference genome, and performing quality control on the result by comparison; and performing transcriptome assembly and transcript quantification on the sample data subjected to quality control of the comparison result, and performing quantitative analysis, gene differential expression analysis and fusion gene analysis on gene expression. The invention provides an index and a detection method for completely evaluating RNA quality of paraffin section tissue, which can comprehensively evaluate the RNA of the paraffin section tissue, have accurate and effective evaluation results and provide effective reference basis for the accuracy of subsequent analysis.

Description

RNA analysis method for paraffin section tissue

Technical Field

The invention belongs to the field of second-generation high-throughput sequencing analysis, and particularly relates to a paraffin section tissue RNA analysis method.

Background

RNA sequencing (RNA-seq) is a sensitive and accurate method of quantifying gene expression. Second generation high throughput sequencing (NGS) has created a new era for RNA-seq transcriptome analysis. The design of the broad spectrum application process of RNA-seq involves sequencing technology, sample type, demand analysis of genome and its computational resources. The analysis process is evaluated based on accuracy, calculation speed and the cost of analysis.

The gene expression profile of a tumor sample is a powerful biomarker for identifying prognosis and prediction. To date, transcriptomic profiling has been performed on a large number of cancer frozen tissue samples. However, since fresh frozen tissues of tumor samples of clinical patients are not easy to collect and store for long-term follow-up, formalin-fixed paraffin-embedded tissue (FFPE) is a more widely used biomaterial in the medical field. Genome-wide gene expression profiling of tumor samples is essential for cancer research and also facilitates extensive retrospective clinical genomic studies. FFPE is subjected to fixation, paraffin embedding, sectioning and staining to prevent degradation of cellular tissues, and these preparation processes and storage have significant negative effects on DNA and RNA quality. FFPE samples generally have severe degradation, chemical modification, cross-linking of nucleic acids and proteins, and variability in tissue handling and processing, and these molecular changes will directly affect data quality, causing several problems, such as sample degradation leading to lower sequencing data alignment quality, more soft-cut sequences, more repetitive sequences, formaldehyde fixation leading to random C-T transformation of nucleic acids, which makes FFPE isolated nucleic acids incompatible with downstream high-throughput molecular techniques. In addition to deepening the sequencing depth to supplement the problem of nucleic acid degradation, a complete index and detection method for evaluating the RNA quality of paraffin tissues are urgently needed to ensure the reliability of subsequent analysis. Meanwhile, a complete paraffin section RNA analysis process is needed to study the difference of gene expression levels of organisms in different environments or different physiological states, so that the reaction mechanism of a body can be known and an intracellular regulation network can be constructed. Meanwhile, the fusion new gene formed by connecting all or part of two genes in series due to chromosome translocation or reverse splicing plays an important role in researching the cause and development of various cancer types.

Disclosure of Invention

In order to solve the technical problems, the invention provides a paraffin section tissue RNA analysis method.

A method for analyzing RNA of paraffin section tissue, the method comprising the steps of:

carrying out DNA degradation on the paraffin section tissue, and extracting sample RNA;

preparing a paraffin section sample nucleic acid library and sequencing the sample RNA based on the library;

performing quality control on the sample data obtained by sequencing to remove rRNA data;

comparing the sample data after quality control with a reference genome, and performing quality control on the comparison result;

carrying out transcriptome assembly and transcript quantification on the sample data after quality control, and carrying out quantitative analysis on gene expression;

based on the transcript quantification results, gene differential expression analysis was performed.

Further, the analysis method can also perform fusion gene analysis;

the Fusion gene analysis is performed by selecting one or more software selected from JAFFA, STAR-Fusion, TopHat-Fusion, Fusion catcher, or SOAPfuse.

Further, the preparation of the paraffin section sample nucleic acid library comprises the following steps:

extracting nucleic acid in the paraffin section sample;

synthesizing a single-stranded cDNA based on the nucleic acid;

synthesizing a double-stranded cDNA based on the single-stranded cDNA;

repairing the double-stranded cDNA ends;

determining the connecting joint of the double-stranded cDNA, and performing PCR amplification on the DNA of the connecting joint to obtain a nucleic acid library of the paraffin section sample.

Further, the quality control of the sample data obtained by sequencing comprises:

removing the sequence consisting of the sequencing linker sequence, the low-quality sequence and the N base;

screening the number of bases of the filtered data after the joint removal, the percentage of base quality larger than 20, the percentage of base quality larger than 30, GC content, N content, average read length, rRNA comparison rate and the number of read after the filtration;

and selecting the data and samples meeting the set threshold value for subsequent analysis.

Further, the comparing the sample data after quality control with the reference genome comprises:

comparing the obtained sample sequence containing the nucleic acid sequence information with a reference genome;

sample sequences of the aligned reference genomes are obtained.

Further, the comparing the sample sequence containing the nucleic acid sequence information with the reference genome, and selecting one or more of TopHat, STAR, or HISAT2 to compare the sample sequence with the reference genome.

Further, the quality control of the comparison result comprises:

carrying out quality evaluation on the comparison result file of the paraffin section tissues;

the repetitive sequence is removed.

Further, the quality evaluation of the comparison result file of the paraffin section tissue comprises:

evaluating one or more of the ratio of duplicate sequences, alignment, unique alignment, exon alignment, intron alignment, intergenic region alignment, expression efficiency, detected transcript, detected gene or sequence coverage uniformity.

Further, the quantifying gene expression is performed by selecting one or more of RSEM, eXpress, HTseq, Cufflinks, StringTie, Sailfish, Salmonon, quasi-mapping, or Kallisto software.

Further, the gene differential expression analysis is performed by selecting one or more of DESeq, limma, edgeR, Cuffdiff, Ballgown, DESeq2, or slauth software.

The invention provides an index and a detection method for completely evaluating RNA quality of paraffin section tissues, which can research the difference between gene expression quantities of organisms under different environments or different physiological states, so that the reaction mechanism of the organisms can be known and an intracellular regulation network can be constructed; meanwhile, the invention can also carry out fusion gene analysis, and the detection and analysis of the fusion new gene can play an important role in researching the cause and development of various cancer types.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 shows a flow chart of one embodiment of the RNA analysis method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

As shown in FIG. 1, a method for analyzing RNA of paraffin section tissue comprises the following steps: carrying out DNA degradation on the paraffin section tissue, and extracting sample RNA; preparing a paraffin section sample nucleic acid library and sequencing the sample RNA based on the library; performing quality control on the sample data obtained by sequencing to remove rRNA data; comparing the sample data after quality control with a reference genome, and performing quality control on the comparison result; carrying out transcriptome assembly and transcript quantification on the sample data after quality control, and carrying out quantitative analysis on gene expression; performing gene differential expression analysis based on the transcript quantification result; also, fusion gene analysis was also performed using the method described in this example. The software for each step and its results file are detailed in table 1.

TABLE 1 result file presentation of paraffin section RNA analysis method

Preparation of a sample: selecting 3 stomach cancer paraffin section tissues (Beijing Jiyin medical examination laboratory, sample numbers are 199003859T, 199003855T and 199003848T) and 3 paracancer stones of stomach cancer patientsWax section tissue (Beijing Gionee plus medical laboratory, sample numbers 199003859N, 199003848N, 199003855N), DNA degradation in nuclease-free water without RNase, RNA extraction kit (MagMAX MAX) for FFPE sample^TMFFPE DNA/RNA Ultra kit) to obtain purified total RNA; re-use Ribo-Zero^TMRibosomal rna (rRNA) removal kit to remove rRNA;

library construction and sequencing: the preparation of the nucleic acid library of the paraffin section sample comprises the following steps: extracting nucleic acid in the paraffin section sample; synthesizing a single-stranded cDNA based on the nucleic acid; synthesizing a double-stranded cDNA based on the single-stranded cDNA; repairing the double-stranded cDNA ends; determining the connecting joint of the double-stranded cDNA, and performing PCR amplification on the DNA of the connecting joint to obtain a nucleic acid library of the paraffin section sample. Specifically, a kit for obtaining high-quality library yield by using only 10 ng-1. mu.gRNA: (UltraTM RNA library preparation kit) and a DNA library is precisely quantified using a Qubit fluorescer in order to obtain high quality sequencing results. The distribution range of the fragment length of the DNA library is detected by using an Agilent 2100 bioanalyzer, and the size of the library has a narrow peak at 300 bp. RNA sequencing (RNA-Seq) was performed using a second generation high throughput sequencing platform (Illumina HiSeq Xten sequencing platform).

The quality control of the sample data obtained by sequencing comprises the following steps: removing the sequence consisting of the sequencing linker sequence, the low-quality sequence and the N base; screening the number of bases of the filtered data after the joint removal, the percentage of base quality larger than 20, the percentage of base quality larger than 30, GC content, N content, average read length, rRNA comparison rate and the number of read after the filtration; and selecting the data and samples meeting the set threshold value for subsequent analysis. Specifically, the fast software (a data quality control software) is used for quality control, then the bowtie2 software (a sequencing sequence and reference sequence alignment software) is used for aligning the data after quality control with the ribosomal RNA (rRNA) database of the National Center for Biotechnology Information, NCBI for short), and the data are comparedrRNA data were removed. Filtering standard of quality control index: the number of bases after the linker removal is Clean _ Base (Cleandata reads in Table 150bp in length)>2500Mb, percentage Q20 of base mass greater than 20>90% percent of Q30 having a base mass of more than 30>85% GC content>40% and<60% N content<0.100% average read length>120bp and<150bp and rRNA alignment rate<40% of the read number after filtration (number of reads after removal of quality control not up to standard and removal of rRNA)>4*10⁷And (5) screening. Software of bowtie2 was compared with the selected parameters: "- - -positive-D15-R2- -N0-L22-i S,1, 1.15". Specifically, see table 2, where the percentage of the paracancerous rRNA tissue with sample number 199003855N is 85.37%, the percentage is higher than the threshold, and the rRNA filtered data is only 31,165,304 reads (sequences generated by a high-throughput sequencing platform), and the number of the reads is lower than the number of the filtered data, which does not meet the requirement of the subsequent analysis, and requires resampling or rRNA degradation of the sample.

Comparing the sample data after quality control with the reference genome comprises: comparing the obtained sample sequence containing the nucleic acid sequence information with a reference genome; sample sequences of the aligned reference genomes are obtained. Specifically, The method adopts HISAT2 software (RNA-Seq Genome comparison tool software) for comparison, takes a 37 th edition of Human Genome sequence (The Genome reference consensus Human Genome Build 37, GRCh37 for short) as a reference Genome, needs to construct a HISAT2 index for The reference Genome, adopts default parameters for comparison, and adjusts individual sample parameters based on The comparison quality control result of The next sample. Preferably, this embodiment selects TopHat (a Bowtie-based RNA-Seq data analysis software) or STAR (spread proteins Alignment to a Reference, an RNA-Seq genome Alignment tool software) instead of HISAT2, or one or more combinations of TopHat, STAR or HISAT2 to align a sample sequence with a Reference genome.

The quality control of the comparison result comprises the following steps: carrying out quality evaluation on the comparison result file of the paraffin section tissues; the repetitive sequence is removed. Specifically, RNA-SeQC software (a software tool for quality control and expression evaluation of RNA-Seq data) is used for analysis, and it is necessary to construct an index for the comparison result file and operate commands: samtools index; an index is also constructed for the reference genome GRCh37, and commands: samtools false, while creating a dit index using createsequence dictionary. It is necessary to ensure that contig names of the bam file, the reference genome, and the genome gtf file are consistent. And the quality control of the comparison result can evaluate the sample and remove unqualified samples, so that the reliability of the analysis result is improved.

And (3) carrying out quality evaluation on the comparison result file of the paraffin section tissues, wherein the specific threshold value is set as follows: the Duplication Rate < 60%, the alignment Rate > 85%, the Unique alignment Rate Mappled Unique Rate > 50%, the exon alignment Rate > 50%, the intron alignment Rate < 40%, the Intergenic Rate < 10%, the expression efficiency > 45%, the detected transcript >130000, the detected gene >2000, the sequence coverage uniformity bias < 0.500%.

Table 2 shows the comparison and quality control results of 6 samples in this example, the Duplicate rate of the tissue beside cancer of sample No. 199003859N is 62.52%, which is higher, and more strict parameters are required to be used in the subsequent operation of removing the repeated sequence, so that the rate is reduced to be within the threshold range. The Mapping Rate of the tissue beside cancer with the sample number of 199003855N is less than 85%, and loose alignment conditions are required to improve the alignment Rate. All samples which do not reach the threshold value and can not directly enter the next expression quantity analysis need to be sent again or the quality control or the stricter parameter comparison is adjusted under the condition of ensuring the required data quantity so as to enable the data to reach the standard.

TABLE 2 comparison of samples and quality control results

Removing repeated sequences: PCR duplication is removed by Picard software (a software that operates on high throughput sequencing data and formats) because PCR amplification generates repetitive sequences that interfere with the actual enrichment signal. The Picard software REMOVEs PCR repeats with the addition of the parameter REMOVE _ DUPLICATES ═ true, otherwise only the repeat sequence is marked and not removed.

Transcriptome assembly and transcript quantification: by using StringTie software (a transcriptome marker expression quantitative software), an output file with a removed repetitive sequence can be used as an input file only by sequencing a generated comparison result file, and a reference genome annotation file is also required. The parameter used "-m 200" -m sets the minimum length allowed for the predicted transcript. StringTie software runs using the merge option, known transcripts and assembled new transcripts can be merged and assembled into a non-redundant set of transcripts. Preferably, this embodiment may further select one or more of RSEM (RNA-Seq by expectentationvalidation, abbreviated as RSEM, that is, an RNA-Seq data quantification software), efpress (an RNA-Seq data quantification software), HTseq (an RNA-Seq data analysis software), Cufflinks (an RNA-Seq transcriptome data assembly software), Sailfish (an RNA-Seq data rapid quantification software), salmonella (an RNA-Seq data quantification software), quasi-mapping (an unaligned RNA-Seq data quantification software), or kali (an RNA-Seq data rapid quantification software) to perform quantitative analysis of gene expression.

Differential expression analysis: differential expression analysis was performed using the transcript quantification results of the previous step as an input file for this step, using DESeq2 software (a software for RNA-Seq differential expression analysis based on the number of reads); cancer tissue samples were comma segmented; paracarcinoma tissue samples were also comma segmented; the space between the cancer tissue and the tissue beside the cancer is divided by a blank space. Preferably, the embodiment may further select one or more of DESeq (a piece of RNA-Seq differential expression analysis software based on the read number), limma (a piece of RNA-Seq differential expression analysis software based on the read number), edgeR (a piece of RNA-Seq differential expression analysis software based on the read number), Cuffdiff (a piece of RNA-Seq differential expression analysis software based on the assembly technology), Ballgown (a piece of RNA-Seq differential expression analysis software based on the assembly technology), or sluuth (a piece of alignment-free RNA-Seq differential expression quantitative analysis software) for gene differential expression analysis.

Analysis of fusion gene: fusion gene detection was predicted using fusion catcher software (a version of fusion gene analysis software). -d-parameter specifies the directory where the reference genome of the species is located, -i-parameter specifies the directory where the raw sequencing data fastq file corresponding to the sample is located, -o-parameter specifies the directory where the result is output. For humans, the authorities provide databases built on the Ensemblerelease 90 version. Preferably, the present embodiment may select one or more of JAFFA (a software for gene Fusion analysis based on comparing transcriptome to reference re-transcriptome), STAR-Fusion (a software for identifying Fusion gene based on STAR alignment), TopHat-Fusion (a software for identifying Fusion gene using RNA-Seq data), or SOAPfuse (an open software for probing Fusion transcript in the genome-wide range of human RNA-Seq data) for Fusion gene analysis.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A paraffin section tissue RNA analysis method is characterized by comprising the following steps:

2. The method for RNA analysis of paraffin-cut tissue according to claim 1, wherein the method further comprises performing fusion gene analysis;

3. The method for RNA analysis of paraffin section tissue according to claim 1 or 2, wherein the preparation of the paraffin section sample nucleic acid library comprises the steps of:

extracting nucleic acid in the paraffin section sample;

synthesizing a single-stranded cDNA based on the nucleic acid;

synthesizing a double-stranded cDNA based on the single-stranded cDNA;

repairing the double-stranded cDNA ends;

4. The method for RNA analysis of paraffin section tissue according to claim 1 or 2, wherein the quality control of the sample data obtained by sequencing comprises:

5. The method for RNA analysis of paraffin section tissue according to claim 1 or 2, wherein the comparing the sample data after quality control with the reference genome comprises:

sample sequences of the aligned reference genomes are obtained.

6. The method for RNA analysis of paraffin section tissue according to claim 5, wherein the sample sequence containing nucleic acid sequence information is aligned with the reference genome, and one or more of TopHat, STAR or HISAT2 is selected to align the sample sequence with the reference genome.

7. The method for RNA analysis of paraffin section tissue according to claim 1 or 2, wherein the quality control of the comparison result comprises:

the repetitive sequence is removed.

8. The method for RNA analysis of paraffin section tissue according to claim 7, wherein the quality evaluation of the comparison result file of paraffin section tissue comprises:

9. The method for RNA analysis of paraffin section tissue according to claim 1 or 2, wherein the quantitative analysis of gene expression is performed by selecting one or more software selected from RSEM, eXpress, HTseq, Cufflinks, StringTie, Sailfish, Salmonon, quasi-mapping and Kallisto.

10. The method for RNA analysis of paraffin-cut tissue according to claim 1 or 2, wherein the gene differential expression analysis is performed by one or more software selected from DESeq, limma, edgeR, Cuffdiff, Ballgown, DESeq2, or slauth.