The processing method of sequencing data and processing means
Technical field
The present invention relates to sequencing data process field, in particular to processing method and the processing means of a kind of sequencing data.
Background technology
Chromosomal abnormality is probably on number or structure.Quantity is abnormal, including trisomy (many chromosomes), monosomy (losing a chromosome) and polyploidy (the most a set of whole chromosome).Textural anomaly includes the structural rearrangement caused by chromosome breakage etc., such as transposition, overturns, lacks and inserts.
Chromosome quantity is abnormal, such as aneuploidy and polyploidy, includes birth defect and related to cancer with multiple disease.The annual neonate of China nearly 20,000,000, the most about 4%~6% exists birth defect, and wherein fetal chromosomal abnormalities is one of clinical modal birth defect type, and just having 1 example in the most about 160 example neonates is abnormal chromosome patients.Trisomic syndrome is the class that in chromosomal disorders, sickness rate is the highest, and when the number of certain chromosome intracellular is not normal two but three, namely total chromosome number mesh is to may result in patau syndrome when 47.Modal trisomic syndrome has: trisomy 21 syndrome (T21), Edwards syndrome (T18) and Patau syndrome (T13).For reducing the ratio of birth defect baby, the detection fast and accurately to chromosomal aneuploidy is necessary.
Ultrasound scanning or the non-invasive methods of biochemical markers examination, have been used for carrying out the risk judgment of chromosomal abnormality, but the method accuracy rate be relatively low, only 60-80%, and the impact of the physiologic factors such as age of becoming pregnant.The methods for prenatal diagnosis of routine then needs by invasive method such as amniocentesis or chorionic villus sampling, therefore there is risk of miscarriage, and the detection cycle is longer.1997, Maternal plasma is found that acellular foetal DNA (the Lancet.1997 Aug 16 of circulation;350(9076):485-7.Presence of fetal DNA in maternal plasma and serum.Lo YM1,Corbetta N,Chamberlain PF,Rai V,Sargent IL,Redman CW,Wainscoat JS.).1999, find to nourish in women's blood plasma of No. 21 chromosome trisomy fetuses and circulate the concentration of foetal DNA apparently higher than nourishing the concentration (Lo circulating foetal DNA in euploid fetus women's blood plasma, Y.M.D.et al., Clin Chem 45:1747-1751 (1999);Zhong, X.Y.et al., Prenat Diagn 20:795-798 (2000)).Above-mentioned it be found to be noinvasive prenatal diagnosis and provide new probability.On this basis, antenatal noinvasive field achieves many progress, as by methyl-sensitive enzyme enriches fetal DNA with reduce parent ambient interferences (PCT/US2004/033175 2004.10.08);By the Ct value of PCR comparison gene-specific fragments with No. 21 three bodies (CN200610003103.9,2006.02.10) of examination;Fetal chromosomal aneuploidy (CN200680007354.2,2006.03.17) is inferred by amplified allele based on RNA-SNP detection.But the enrichment to foetal DNA takes time and effort, and amplification technique requires specificity or the heterozygosity of gene of sequence so that it is be difficult to become a kind of general technology.
null2008,Rossa W.K.Chiu et al. proposes order-checking means can obtain bulk information (the Rossa W.K.Chiu of peripheral blood nucleic acid molecule,et al.Noninvasive prenatal diagnosis of fetal chromosomal aneuploidy by massively parallel genomic sequencing of DNA in maternalplasma.PNAS,2008,105:20458-20463),And find,Clinical meaning chromosome has in abnormal sample,Its clinical meaning chromosome has the ratio parameter of the amount of abnormal nucleic acid molecules and the amount of the nucleic acid molecules of background stainings body,There are differences with the parameter of the one or more normal control values built by normal sample.Thus, method based on high-flux sequence can be used to detect chromosomal abnormality, and eliminates the dependence to distinguished sequence amplification.But existing gene order-checking detection method needs to compare normal with multiple samples or standard for sample to be tested sample, time-consuming long, to sample requirement amount big (as, the Chinese patent application of Application No. CN200880108377.1), and the concordance of each batch sample experiment condition is had strict demand, constrain its facilitation and high-throughout application.
Therefore, it is still necessary to the method for existing process sequencing data is improved, to improve the accuracy that data process.
Summary of the invention
Present invention is primarily targeted at processing method and the processing means that a kind of sequencing data is provided, to improve the accuracy that sequencing data is processed.
To achieve these goals, according to an aspect of the invention, it is provided the processing method of a kind of sequencing data, this processing method includes: obtained the nucleotide sequence information of all chromosomes deriving from maternal peripheral blood sample by high-flux sequence;To be divided into multiple specific regions with reference to genome, in each specific regions, number N RSc of non repetitive sequence is equal;The nucleotide sequence information deriving from all chromosomes of maternal peripheral blood sample is distributed to the multiple specific regions with reference to genome, statistical sample NRSs value in each specific regions;Utilize G/C content correction sample NRSs value in each specific regions, be designated as NRSs' value;Based on NRSs' value, statistics target chromosome and the average of the NRSs' value of all specific regions on comparison chromosome respectively, correspondence is designated as the first average and the second average respectively;First average and the second average are carried out test of difference, determines whether chromosome exists aneuploidy according to difference test result.
Further, the step of G/C content correction sample NRS value in each specific regions is utilized to include: to utilize correction formula NRSs'=NRSs × α to revise sample NRSs value in each specific regions, wherein, For the I d median of all specific regions NRSs values, NRSs " it is that G/C content and the NRSs value of each specific regions utilizing sample carries out the match value of acquisition after polynomial spline matching.
Further, before utilizing the G/C content of each specific regions of sample to carry out polynomial spline matching with NRSs value, processing method also includes the step removing the abnormal specific regions of NRSs value from all specific regions of sample, it is preferred to use the method for linear fit or local polynomial regression matching removes the specific regions that NRSs value is abnormal.
Further, the arbitrary integer during NRSc value is 10000~50000.
Further, target chromosome is selected from following arbitrarily one or several combination: No. 13 chromosomes, No. 18 chromosomes, No. 21 chromosomes, X chromosome and Y chromosome;Comparison chromosome is selected from following arbitrarily one or several combination: No. 1 chromosome, No. 2 chromosomes, No. 3 chromosomes, No. 4 chromosomes, No. 5 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 8 chromosomes, No. 9 chromosomes, No. 10 chromosomes, No. 11 chromosomes and No. 12 chromosomes;Preferably, comparison chromosome is selected from following arbitrarily one or several combination: No. 1 chromosome, No. 2 chromosomes, No. 3 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 11 chromosomes, No. 12 chromosomes and No. 16 chromosomes.
To achieve these goals, according to another aspect of the present invention, providing the processing means of a kind of sequencing data, this processing means includes: sequencer module, for being obtained the nucleotide sequence information of all chromosomes deriving from maternal peripheral blood sample by high-flux sequence;Specific regions divides module, for being divided into multiple specific regions with reference to genome according to the principle that NRSc value is equal;Distribution module, for basis and the principle carrying out sequence alignment with reference to genome, the nucleotide sequence information that will derive from all chromosomes of maternal peripheral blood sample distributes to the multiple specific regions with reference to genome;First statistical module, for statistical sample NRSs value in each specific regions;Correcting module, for utilizing G/C content correction sample NRSs value in each specific regions, is designated as NRSs' value;Second statistical module, for based on NRSs' value, statistics target chromosome and the average of the NRSs' value of all specific regions on comparison chromosome, be designated as the first average and the second average respectively;Inspection module, for carrying out test of difference by the first average and the second average;Determine module, for determining whether chromosome exists aneuploidy according to difference test result.
Further, correcting module includes: the first computing unit, for calculating the I d median of all specific regions NRSs valuesFitting unit, for utilizing the G/C content of each specific regions of sample to carry out polynomial spline matching with NRSs value, obtains matched curve;Acquiring unit, for obtaining the match value NRSs of each specific regions according to matched curve ";Second computing unit, for basisFormula calculates correction factor α;Amending unit, for revising sample NRSs value in each specific regions according to correction formula NRSs'=NRSs × α.
Further, fitting unit utilizes the G/C content of each specific regions of sample to carry out polynomial spline matching with NRSs value in execution, before obtaining the step of matched curve, fitting unit also includes filtering subelement, filtering subelement and remove the step of the abnormal specific regions of NRSs value for execution from all specific regions of sample, preferably filtering subelement is linear fit subelement or local polynomial regression matching subelement.
Further, the arbitrary integer during NRSc value is 10000~50000.
Further, target chromosome is selected from following arbitrarily one or several combination: No. 13 chromosomes, No. 18 chromosomes, No. 21 chromosomes, X chromosome and Y chromosome;Comparison chromosome is selected from following arbitrarily one or several combination: No. 1 chromosome, No. 2 chromosomes, No. 3 chromosomes, No. 4 chromosomes, No. 5 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 8 chromosomes, No. 9 chromosomes, No. 10 chromosomes, No. 11 chromosomes and No. 12 chromosomes;Preferably, comparison chromosome is selected from following arbitrarily one or several combination: No. 1 chromosome, No. 2 chromosomes, No. 3 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 11 chromosomes, No. 12 chromosomes and No. 16 chromosomes.
Application technical scheme, by based on sequencing data, by dividing specific regions with the non repetitive sequence of equal bar number for principle, the data fluctuations avoiding non-repetitive sequences number heterogeneity in each specific regions and cause, and then optimize the dependency of interchromosomal nucleic acid data parameters, utilize and the parameter comparison of the parameter of clinically relevant chromosome in biological specimen with other non-clinical relative chromosome districts, so that it is determined that whether chromosomal aneuploidy exists in sample to be tested.The method achieve single pattern detection, it may not be necessary to the normal sample of standard, eliminate the dependency to experiment condition, accelerate analysis speed, be kind simply, detection means fast and accurately, the accuracy rate of its autosome detection is more than 99%, and false positive rate is less than 1%.
Accompanying drawing explanation
The Figure of description of the part constituting the application is used for providing a further understanding of the present invention, and the schematic description and description of the present invention is used for explaining the present invention, is not intended that inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 shows according to non repetitive sequence distribution schematic diagram in each specific regions on genome in S001 sample (negative sample) sequencing sequence in a kind of preferred embodiment 1 of the present invention;And
Fig. 2 shows in Fig. 1 the non repetitive sequence distribution schematic diagram in each specific regions on genome in S001 sample sequencing sequence after Exception Filter value;
Fig. 3 shows in Fig. 2 the non repetitive sequence spline curve fitting figure in each specific regions on genome in S001 sample sequencing sequence after Exception Filter value;
The number of the non repetitive sequence before Fig. 4 a and Fig. 4 b respectively illustrates the autosomal correction of each bar of S001 sample in embodiment 1 and in revised specific regions;Wherein, before Fig. 4 a display is revised, after Fig. 4 b display is revised;
Fig. 5 a and Fig. 5 b respectively illustrates the number of the non repetitive sequence in another kind of preferred embodiment before the autosomal correction of each bar of S002 sample and in revised specific regions;Wherein, before Fig. 5 a display is revised, after Fig. 5 b display is revised;
The number of the non repetitive sequence before Fig. 6 a and Fig. 6 b respectively illustrates the autosomal correction of each bar of S007 sample in another preferred embodiment and in revised specific regions;Wherein, before Fig. 6 a display is revised, after Fig. 6 b display is revised;
The number of the non repetitive sequence before Fig. 7 a and Fig. 7 b respectively illustrates the autosomal correction of each bar of S006 sample in another preferred embodiment and in revised specific regions;Wherein, before Fig. 7 a display is revised, after Fig. 7 b display is revised;
Fig. 8 a, Fig. 8 b and Fig. 8 c respectively illustrate in embodiments herein 2 No. 13 chromosome, No. 18 chromosome and the Z Distribution value figure of No. 21 chromosome in 384 example online data samples, wherein, Fig. 8 a shows No. 13 chromosome, Fig. 8 b shows No. 18 chromosome, and Fig. 8 c shows No. 13 chromosome.
Detailed description of the invention
It should be noted that in the case of not conflicting, the embodiment in the application and the feature in embodiment can be mutually combined.Describe the present invention below with reference to the accompanying drawings and in conjunction with the embodiments in detail.
Term is explained:
Sequencing data: refer to the nucleotide sequence information that sample to be tested obtains through high-flux sequence.
Kmer: sequence is cut in the way of moving base one by one continuously, the sequence length obtained is the nucleotide sequence of k, such as following this sequence: ATCGTTGCTTAATGACGTCAGTCGAAT, if if 13-mer analyzes, k-mer is ATCGTTGCTTAAT, TCGTTGCTTAATG, CGTTGCTTAATGA, GTTGCTTAATGAC ....
Non-repetitive sequences (non-repeated sequence is called for short NRS);By by sample to be tested order-checking obtain sequence compare with normal human subject genome, it is thus achieved that full-length genome level on unique kmer be non-repetitive sequences.In the application, when dividing specific regions according to the non repetitive sequence waiting bar number, the bar number divided divides according to reference to genome sequence, thus, dividing the bar number scale of non repetitive sequence in each specific regions obtained is NRSc, and the bar number scale of the actual non repetitive sequence in above-mentioned each specific regions of the sequencing sequence of sample to be tested is NRSs.
Specific regions (specified region is called for short SR), according to the specific region on each bar chromosome of the genome obtained by the division methods of specific regions described in the present invention.
Chromosome: both can refer to whole chromosome, it is also possible to refer to a part for chromosome.The mathematical derivation processing item chromosome fragment is consistent with the mathematical derivation processing all chromosome segments, and those skilled in the art knows corresponding change method.Comparison chromosome is the chromosome in healthy individuals or estimates normal chromosome, normal including statistics presumption, and chromosome here is that individual chromosome or chromosome set are (more than or equal to 2 chromosomes, it is non-13,18,21 in other words, the chromosome of X, Y or its combination in any).
" aneuploidy " and " polyploidy " are the situations that Chromosome number that cell has is different from common haploid number n or amphiploid number 2n.Aneuploid cell can be to have triploid cell, i.e. has the cell of three copy numbers of a chromosome;Or be monoploid, i.e. there is the cell of the list copy of a chromosome.Chromosomal aneuploidy, change the expression of homologue, can combine bioinformatic analysis method by new-generation sequencing platform (NGS), the expression adding up each bar chromosome according to order-checking comparison result can be determined that whether sample to be tested exists the Dysploid of this chromosome.
Sample is cell, tissue or body fluid, is selected from: maternal whole blood (peripheral blood), blood plasma, serum, urine, saliva, reproductive tract flushing liquor;Biopsy material before fetal cell or fetal cell residue, Embryonic limb bud cell;Amniotic fluid, chorionic villi sample etc..Sample may be from any animal, preferred mammal, more preferably people.
It can be the short sequence of both-end, single-ended long sequence or single-ended short sequence that DNA sequencing library carries out order-checking.Wherein the short sequence of both-end refers to the sequence less than 50bp of and then 5 ' end link primers and and then 3 ' holds the sequence less than 50bp linking primers.Preferably, the short sequence of both-end refers to the sequence being not more than 36bp of and then 5 ' end link primers and and then 3 ' holds the sequence being not more than 36bp linking primers.
Single-ended short sequence refers to the sequence less than 50bp of and then 5 ' end link primers or and then 3 ' holds the sequence less than 50bp linking primers.Preferably, single-ended short sequence refers to the sequence being not more than 36bp of and then 5 ' end link primers or and then 3 ' holds the sequence being not more than 36bp linking primers.Single-ended long sequence refers to the sequence more than 99bp of and then 5 ' end link primers or and then 3 ' holds the sequence more than 99bp linking primers.Both-end order-checking refers to test the sequence being positioned at sequence two ends respectively.Single-ended order-checking refers to that the sequence being pointed to sequence one end checks order.
Owing to the detection method of existing chromosomal aneuploidy still suffers from shortcoming in terms of accuracy and convenience, in order to improve this situation, in a kind of typical embodiment of the application, providing the processing method of a kind of sequencing data, this processing method includes: obtained the nucleotide sequence information of all chromosomes deriving from maternal peripheral blood sample by high-flux sequence;To be divided into multiple specific regions with reference to genome, in each specific regions, the number (being designated as NRSc) of non repetitive sequence is equal;The nucleotide sequence information deriving from all chromosomes of maternal peripheral blood sample is distributed to the multiple specific regions with reference to genome, statistical sample NRSs value in each specific regions;Utilize G/C content correction sample NRSs value in each specific regions, be designated as NRSs' value;Based on NRSs' value, statistics target chromosome and the average of the NRSs' value of all specific regions on comparison chromosome respectively, correspondence is designated as the first average and the second average respectively;First average and the second average are carried out test of difference, determines whether chromosome exists aneuploidy according to difference test result.
The above-mentioned processing method of the application, by based on sequencing data, by dividing specific regions with the non repetitive sequence of equal bar number for principle, the data fluctuations avoiding non-repetitive sequences number heterogeneity in each specific regions and cause, and then optimize the dependency of interchromosomal nucleic acid data parameters, utilize and the parameter comparison of the parameter of clinically relevant chromosome in biological specimen with other non-clinical relative chromosome districts, so that it is determined that whether chromosomal aneuploidy exists in sample to be tested.The method achieve single pattern detection, it may not be necessary to the normal sample of standard, eliminate the dependency to experiment condition, accelerate analysis speed, be kind simply, detection means fast and accurately, the accuracy rate of its autosome detection is more than 99%, and false positive rate is less than 1%.
Specifically, the method for above-mentioned test of difference can be existing various test of difference, such as, Z test (Z-test), u-test or t inspection etc..The preferred Z test of the application.
In above-mentioned processing method, utilize the step of G/C content correction sample NRS value in each specific regions that existing GC bearing calibration can be used also to improve the accuracy of detection.In order to make detection accuracy higher, in the application one preferred embodiment, above-mentioned modification method includes: utilize correction formula NRSs'=NRSs × α to revise sample NRSs value in each specific regions, wherein, For the I d median of all specific regions NRSs values, NRSs " it is that G/C content and the NRSs value of each specific regions utilizing sample carries out the match value of acquisition after polynomial spline matching.Revised NRSs' more Normal Distribution, so that follow-up test of difference result is more accurate.
Matching be according to the known discrete some coordinate of X, Y-axis (G/C content and the NRSs value are) f1, f2 ..., fn}, by some undetermined coefficient f (λ in adjustment fitting function1,λ2..., λ n) so that this function is minimum with the difference (least square meaning) of known point set.Known point (xi,Yi);x1<x2<…<xn, i ∈ Ζ is a series of observations, meets certain relational expressionBuild fitting functionMake: Yi=μ (xi) minimum.If fitting function is nonlinear function, the most referred to as nonlinear fitting, also it is called spline-fit.Accordingly, if fitting function is multinomial, then polynomial spline matching can be referred to as.The present invention preferred polynomial spline matching, SPL is smooth cubic curve.
Cubic spline curve gives n data point, and total n-1 interval, the equation in each interval is: fi=ai+bi(x-xi)+ci(x-xi)2+di(x-xi)3, 4 (n-1) individual unknowm coefficient need to be determined, by seriality, node, first derivative is equal, second dervative is equal, available 4n-6 equation, more artificially add 2 boundary conditions.Spline-fit (http://www.stat.wisc.edu~xie/smooth_spline tutorial.html) is completed by the function smooth.spline of R software system.
Before utilizing the G/C content of each specific regions of sample to carry out polynomial spline matching with NRSs value, above-mentioned processing method also includes the step removing the abnormal specific regions of NRSs value from all specific regions of sample, can by use GC linear fit method or by artificial screening in the way of remove exceptional value, such as delete GC value be 0, non repetitive sequence number be 0 or the most too much window of non repetitive sequence number.In this application, the method preferably employing local polynomial regression matching removes the specific regions that NRSs value is abnormal, and the method is conducive to the exquisite specificity region that the number of the inside non repetitive sequence that the discharge non-specific region of part causes is too high or too low because of chromosome structure specificity.In addition it is also possible to use linear fit approximating method.Approximating method is the method for the conventional removal exceptional value of statistics or field of bioinformatics, and concrete grammar does not repeats them here.
In above-mentioned processing method, being to carry out dividing according to the principle that NRSc value is equal during division specific regions, concrete NRSc value can be determined according to modes such as the Genome Size of sample to be tested, sequence complexities.Preferably NRSc value is the arbitrary integer in 10000~50000.
In above-mentioned processing method, target chromosome and comparison chromosome can the difference of different or actually detected demand of or species different according to the tissue of sample to be tested, cell derived rationally select.When sample to be tested is the mankind, selected objective target chromosome is selected from following arbitrarily one or several combination: No. 13 chromosomes, No. 18 chromosomes, No. 21 chromosomes, X chromosome and Y chromosome;Comparison chromosome is selected from following arbitrarily one or several combination: No. 1 chromosome, No. 2 chromosomes, No. 3 chromosomes, No. 4 chromosomes, No. 5 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 8 chromosomes, No. 9 chromosomes, No. 10 chromosomes, No. 11 chromosomes and No. 12 chromosomes;It is highly preferred that comparison chromosome is selected from following arbitrarily one or several combination: No. 1 chromosome, No. 2 chromosomes, No. 3 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 11 chromosomes, No. 12 chromosomes and No. 16 chromosomes.
In the another kind of typical embodiment of the application, additionally providing the processing means of a kind of sequencing data, this processing means includes: sequencer module, for being obtained the nucleotide sequence information of all chromosomes deriving from maternal peripheral blood sample by high-flux sequence;Specific regions divides module, for being divided into multiple specific regions with reference to genome according to the principle that NRSc value is equal;Distribution module, for basis and the principle carrying out sequence alignment with reference to genome, the nucleotide sequence information that will derive from all chromosomes of maternal peripheral blood sample distributes to the multiple specific regions with reference to genome;First statistical module, for statistical sample NRSs value in each specific regions;Correcting module, for utilizing G/C content correction sample NRSs value in each specific regions, is designated as NRSs' value;Second statistical module, for based on NRSs' value, statistics target chromosome and the average of the NRSs' value of all specific regions on comparison chromosome, be designated as the first average and the second average respectively;Inspection module, for carrying out test of difference by the first average and the second average;Determine module, for determining whether chromosome exists aneuploidy according to difference test result.
Above-mentioned detection device is by based on the sequencing data that sequencer module obtains, the specific regions improved division module is used to divide specific regions with the non repetitive sequence of equal bar number for principle, optimize the dependency of interchromosomal nucleic acid data parameters, then by performing distribution module successively, first statistical module, correcting module, second statistical module, inspection module, utilize and the parameter comparison of the parameter of clinically relevant chromosome in biological specimen with other non-clinical relative chromosome districts, determine eventually through the test of difference result of inspection module, module determines in sample to be tested, whether chromosomal aneuploidy exists.The arrangement achieves the detection of single sample sequencing data, and need not the normal sample of standard, eliminate the dependency to experiment condition, make analysis speed accelerate, the general improvements assessment to chromosomal abnormality.Be one simply, fast and accurately chromosomal aneuploidy detection device, its autosome detection accuracy rate more than 99%, false positive rate be less than 1%.
Specifically, above-mentioned inspection module can be existing various test of difference module, such as, Z test (Z test) module, u-test module or t inspection module etc..The application preferred Z test module.
Above-mentioned correcting module uses existing GC correcting module also can improve the accuracy of detection.In the application one preferred embodiment, above-mentioned correcting module includes: the first computing unit, for calculating the I d median of all specific regions NRSs valuesFitting unit, for utilizing the G/C content of each specific regions of sample to carry out polynomial spline matching with NRSs value, obtains matched curve;Acquiring unit, for obtaining the match value NRSs of each specific regions according to matched curve ";Second computing unit, for basisFormula calculates correction factor α;Amending unit, for revising sample NRSs value in each specific regions according to correction formula NRSs'=NRSs × α.
In above-mentioned preferred embodiment, by utilizing the fitting unit of polynomial spline matching to have the advantage that matching accuracy is high, so that obtaining match value more accurately, correspondingly, the most accurate by the second calculated correction factor of computing unit, and then sample to be tested NRSs value in each specific regions can be obtained more accurately by amending unit, i.e. obtain accuracy higher NRSs' value.
In above-mentioned processing means, fitting unit utilizes the G/C content of each specific regions of sample to carry out polynomial spline matching with NRSs value in execution, before obtaining the step of matched curve, fitting unit also includes filtering subelement, filter subelement from all specific regions of sample, remove the step of the abnormal specific regions of NRSs value for execution, so can improve fitting unit matching accuracy in carrying out polynomial spline fit procedure further.Preferably filtering subelement uses conventional linear fit subelement or local polynomial regression matching subelement to carry out exceptional value and filter.
Preferably, in above-mentioned processing means, NRSc value is the arbitrary integer in 10000~50000.
In above-mentioned processing means, target chromosome and comparison chromosome can the difference of different or actually detected demand of, person species different according to the tissue of sample to be tested, cell derived rationally select.When sample to be tested is the mankind, selected objective target chromosome is selected from following arbitrarily one or several combination: No. 13 chromosomes, No. 18 chromosomes, No. 21 chromosomes, X chromosome and Y chromosome;Comparison chromosome is selected from following arbitrarily one or several combination: No. 1 chromosome, No. 2 chromosomes, No. 3 chromosomes, No. 4 chromosomes, No. 5 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 8 chromosomes, No. 9 chromosomes, No. 10 chromosomes, No. 11 chromosomes and No. 12 chromosomes;It is highly preferred that comparison chromosome is selected from following arbitrarily one or several combination: No. 1 chromosome, No. 2 chromosomes, No. 3 chromosomes, No. 6 chromosomes, No. 7 chromosomes, No. 11 chromosomes, No. 12 chromosomes and No. 16 chromosomes.
The said method of the application and device thereof can be combined with other known method, device or compositions, preferably can improve the method for chromosomal abnormality detection technique, device or compositions.Such as, the mathematics model analysis of parent biochemical indicator.
Said method provided herein, it has high flux, low cost, simplicity, accuracy and highly sensitive advantage.Existing method needs to compare normal with multiple samples or standard for sample to be tested sample, the longest and big to sample requirement amount.The application achieves single pattern detection, can not rely on the normal sample of standard, it is to avoid the dependency to experiment condition, accelerates analysis speed and improves Detection accuracy.
The such scheme that the application provides is DNA sequencing means to be combined with the method for analysis of biological information, judges whether chromosome exists exception by diversity methods of inspection such as Z value inspections.If Z value is outside 4.5, it may be determined that for there is chromosomal aneuploidy.Chromosomal abnormality is preferably No. 21 chromosome trisomy, No. 13 chromosome trisomy, No. 18 chromosomes, X chromosome and the exception of Y chromosome.
It is abnormal that the application method is particularly suited for detecting chromosome quantity, and preferably chromosomal aneuploidy quantity is abnormal, and more preferably autosome aneuploidy is abnormal.
Further illustrate the beneficial effect of the application below in conjunction with specific embodiments.
The embodiment 1 processing method to sample to be tested sequencing data
(1) DNA fragmentation free in sample to be tested maternal blood is carried out high-flux sequence
(1) gather anemia of pregnant woman's whole blood, obtain blood plasma through pretreatment;
After being approved notice of consent, by venipuncture from conceived 22 weeks women (the sample S001 continued 2 i.e. afterwards) take blood blood sampling volume 5-10ml, join in ethylenediaminetetraacetic acid (EDTA) pipe, blood sample has been removed the plasma sample of hemocyte after high speed centrifugation, and each sample plasma volume is about 700ul.
(2) plasma dna is extracted;
The DNA extraction agent box HiPure Circulating DNA Kits using Magen company to produce extracts the DNA (production number is D3180-02) in blood plasma.
(3) DNA obtained from blood plasma extracting is prepared as being available for the library of a new generation's high-flux sequence platform order-checking
Plasma dna uses T4 archaeal dna polymerase, T4 PNK and Klenow enzyme carry out end reparation and add A process, uses T4 DNA ligase and sequence measuring joints to carry out adding joint and processes.Finally use the library primer added with label to carry out PCR, and use magnetic bead to be purified screening, finally give the sequencing library of machine.
(4) library prepared is carried out DNA sequencing
Sequencing library expands on the cBot instrument of Illumina, and the single-ended sequencing library of DNA makes DNA bunch, obtains the sequence that magnanimity sequencing reading length is 36bp.
(2) sequence information of DNA fragmentation in blood plasma is determined
1. pair normal human subject carries out specific regions division and statistics with reference to genome
(1) screening non-repetitive sequences
By the mankind with reference to genome (hg19 GRCh37http: //www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/), being cut into a length of 35bp, side-play amount is the magnanimity kmer set of 1bp;Therefrom screening obtains unique kmer, i.e. non-repetitive sequences on full-length genome, and records the location coordinate information of correspondence.
(2) specific regions divides
Record start position is started from the Article 1 non-repetitive sequences of No. 1 chromosome, until being accumulate to when 20000 record its final position, first specific regions this being defined as on No. 1 chromosome, there is not overlap between each specific regions.
For No. 1 chromosome until Y chromosome all repeats the process step of top, obtain positional information and the G/C content (with reference to genome, normal human subject being carried out specific regions division only need to carry out once, follow-up each testing sample carries out processing) in all chromosome specific regions according to the specific regions divided with reference to genome.
(3) specific regions statistics
Add up the G/C content distribution situation of all non-repetitive sequences in the specific regions quantity on each bar chromosome and region.
2. sample DNA sequence alignment
By Bioinformatic Sequence comparison software BWA (Burrows-Wheeler Aligner), by the DNA sequence of order-checking gained with normal human subject with reference to genome (hg19, GRCh37) carry out the most fault-tolerant comparison (to mate completely, it is not allow for base mispairing), determine all order-checking DNA sequence detailed location information on genome, including the coordinate on chromosomal origin, chromosome and at (non repetitive sequence distribution situation in each specific regions on genome are shown in Fig. 1 in S001 sample sequencing sequence in table 2) such as genome specificity Regional Distribution of Registereds.
(3) expression of chromosome to be measured is determined
1, Exception Filter value
The number (NRSs) of non-repetitive sequences in the G/C content in the genome specificity region of sample to be tested and this region is carried out local polynomial regression matching (linear fit also can) by loess function, being exceptional value by the definition of (p < 0.005) outside the positive and negative 3 times of standard deviations of match value of NRSs number, the distribution after exceptional value being filtered is as shown in Figure 2.
2, weighting is revised
After all specific regions of the genome of sample to be tested being classified according to G/C content, carry out spline curve fitting and obtain the match value of NRSs corresponding to each G/C content, be designated as NRSs ", the distribution situation of its correspondence is as shown in Figure 3.
Wherein, concrete fit procedure is: with the I d median of NRSsFor baseline, by the match value NRSs of NRS " obtain correction factor α compared with baseline value, computing formula is as follows;
NRSs'=NRSs × α (2)
Above-mentioned formula is to carry out calculating for each specific regions on sample to be tested genome, wherein,Refer on genome the I d median of NRS number, NRSs on all specific regions " it is match value, NRSs' is revised non repetitive sequence number.
Can be seen that before figure 4 below a and Fig. 4 b, Fig. 5 a and Fig. 5 b, Fig. 6 a with the correction of Fig. 6 b and Fig. 7 a and Fig. 7 b and after revising, the data fluctuations of unmodified is bigger, directly carries out the diversity between chromosome and is easier to cause false negative or false-positive testing result.And the non repetitive sequence number distribution situation in the specific regions of each chromosome tends to be steady after revising, data variance is more notable, it is easier to judge exceptional value, shows that the present processes can eliminate GC architectural difference, and avoids GC Preference problem.Can be used for the detection that chromosomal aneuploidy is abnormal, reduce the appearance of false negative result, the below figure 7a NRS number corresponding with the chr21 of Fig. 7 b exceeds with other autosome is obvious, and corresponding testing result is that the risk that No. 21 chromosomes of this sample exist aneuploidy abnormal is high.
(4) whether Z value test and judge chromosomal expression amount exists significant difference
With NRSs through the revised NRSs' of GC, average by the NRSs' of all specific regions of target detection chromosome (chr21, chr18, chr13, X or Y), carry out diversity with the average of all NRSs' compareing chromosomal (chr1, chr2 ... chr12) to compare, obtain detected value Z (Z-score), judge whether current target chromosome exists Dysploid according to Z value.When Z-score >=4.5 or Z-score≤-4.5, i.e. testing result are three body variation excessive risks, or monomer variants excessive risk;When-4.5 < Z-score < 4.5, i.e. testing result are Dysploid low-risk.
Or by the distribution situation of house-keeping gene, filter out comparison chromosomal, including chr1, chr2, chr3, chr6, chr7, chr11, chr12, chr16.
Embodiment 2 efficiency evaluation
(1) online data sample is utilized to be evaluated
Step in processing method shown in embodiment 1 can realize by calculating device with the form of module or unit.In order to evaluate the effectiveness of the method for embodiment 1, the processing means formed by the module or unit that are able to carry out above-mentioned steps below is tested.This processing means includes:
Sequencer module, for obtaining the nucleotide sequence information of all chromosomes deriving from maternal peripheral blood sample by high-flux sequence;
Alternatively, above-mentioned module performs the module of order-checking function in including the supporting sequenator such as SOLiD of the cBot instrument of Illumina, the supporting model sequenator such as Genome Analyzer, HiSeq2000/2500, Hiseq3000/4000, NextseqCN500 of Illumina or Life Technologies company.
Specific regions divides module, calls specific regions and divides modular program, will be divided into multiple specific regions with reference to genome according to the principle that NRSc value is equal;Can be that unit divides according to the non repetitive sequence of any integer bar in 10000~50000 (preferably 20000), differ greatly and the defect of data homogeneity difference there is non repetitive sequence number in overcoming the specific regions divided by length such as 20Kb or 50Kb.Distribution module, runs distribution module, and with reference to genome, the result of sequencer module output is carried out sequence alignment, is distributed by the nucleotide sequence information deriving from all chromosomes of maternal peripheral blood sample in the specific regions dividing module generation to specific regions;
Alternatively, it is possible to the module such as BWA module, BOWTIE module or the NOVOALIGN module that perform sequence alignment principle are used for carrying out the distribution of sample to be tested sequencing data,
First statistical module, for statistical sample NRSs value in each specific regions;Statistical module alternatively has SAMTOOLS module;
Correcting module, for utilizing G/C content correction sample NRSs value in each specific regions, is designated as NRSs' value;
Preferably, correcting module includes: the first computing unit, for calculating the I d median of all specific regions NRSs valuesFitting unit, for utilizing the G/C content of each specific regions of sample to carry out polynomial spline matching with NRSs value, obtains matched curve;Acquiring unit, for obtaining the match value NRSs of each specific regions according to matched curve ";Second computing unit, for basisFormula calculates correction factor α;Amending unit, for revising sample NRSs value in each specific regions according to correction formula NRSs'=NRSs × α.
Second statistical module, for based on NRSs' value, statistics target chromosome and the average of the NRSs' value of all specific regions on comparison chromosome, be designated as the first average and the second average respectively;
Inspection module, for carrying out test of difference by the first average and the second average;Alternatively, Z test module is used to carry out difference analysis;
Determine module, for determining whether chromosome exists aneuploidy according to difference test result;
Preferably, when target chromosome is autosome, and during-4.5≤Z value≤4.5, it is used for determining that target chromosome does not exist aneuploidy, otherwise, it determines there is aneuploidy.
To be derived from different experiments room, the data (the high-flux sequence data of the noinvasive prenatal gene detection project clinical research maternal blood that other mechanisms downloaded from the SRA data base http://www.ncbi.nlm.nih.gov/sra/ of NCBI upload, wherein comprise 384 example sample datas) of different NGS platform be that sample is to further illustrate effectiveness and the versatility of the application processing means.
Wherein, the testing result for No. 21 in this 384 example sample, No. 18 and No. 13 chromosomes is as shown in table 1 below:
Table 1.384 example NCBI online data positive sample detection result.
Attached: in upper table 1, " chr " represents chromosome;" gc " represents G/C content;" ZV " represents Z Value, Z value;" TEST " represents the chromosomal aneuploidy abnormality detection result obtained by the method.
From above-mentioned table 1 and Fig. 8 a, Fig. 8 b and Fig. 8 c understand, detecting 1 example T13 positive SRR358477, the equal Stable distritation of Z value of No. 13 chromosomes of remaining sample is interval interior in (-4.5,4.5);Detecting 5 example T18 positive SRR357943, SRR357972, SRR358089, SRR358257, SRR358325, the equal Stable distritation of Z value of remaining No. 18 chromosome of sample is in (-4.5,4.5) are interval;Detecting 7 example T21 positive SRR357843, SRR358020, SRR358126, SRR358144, SRR358322, SRR358352, SRR358353, the equal Stable distritation of Z value of remaining No. 21 chromosome of sample is in (-4.5,4.5) are interval.
(2) blood sample is evaluated
Evaluating above-mentioned processing method as a example by the testing result of 68 example samples (being provided by Ministry of Public Health visiting center and Beijing people hospital), evaluation result is shown in Table 2, only shows the result of front 30 samples, and verify by results of karyotype in table 2.
Table 2. blood sample testing result
Note:Represent be target detection chromosome unmodified before, the meansigma methods of NRS in all specific regions.Represent is after the weighting of specific regions G/C content is revised, the meansigma methods of all specific regions NRS on this chromosome.ZVchri(i=13,18,21) represent be this chromosome by with compare chromosome and carry out the Z value that significant difference analysis obtains.What TEST represented is the chromosomal aneuploidy abnormality detection result obtained by the method, and N (Negative) represents that testing result is negative, does not detects obvious exception.It is abnormal that T13/T18/T21 represents that testing result display target detection chromosome exists aneuploidy.What caryogram (Karyotype) represented is clinical karyotyping result, i.e. goldstandard result (46, what XN represented is chromosome number and the sex chromosome situation of caryogram normal specimens, 47, XN, + 21 representatives are that this sample karyotyping shows 47 chromosomes, have more No. 21 chromosomes than normal karyotype, i.e. mongolism).
Table 2 data show, according to two sample ZV of significant difference assay: S0002 and S0013chr13It is all higher than equal to 4.5, it is determined that No. 13 chromosomal aneuploidies anomaly exist excessive risk;Two sample ZV of S0007 and S0012chr18It is all higher than equal to 4.5, it is determined that No. 18 chromosomal aneuploidies anomaly exist excessive risk, S0003, S0006 and S0011 these three sample ZVchr21Both greater than equal to 4.5, it is determined that No. 21 chromosomal aneuploidies anomaly exist excessive risk.For No. 21, No. 18 and No. 13 chromosomes, the testing result of the application is all consistent with chromosome karyotype analysis result, and this method testing result is judged to low-risk sample, i.e. ZV value sample between-4.5 to 4.5, and its karyotyping result is also normally.Show that this method detection accuracy when the detection abnormal for chromosomal aneuploidy is higher.
Embodiment 3 stability and data volume research
(1) sample stability
Utilize said method, repeat respectively to survey 8 times to these four samples of s002, s006, s007, s008 (corresponding results of karyotype is respectively the T13 positive, T21 is positive, T18 is positive and normal), statistics chromosome relative expression quantity (being designated as CR) and Z test value (being designated as ZV) data fluctuation situation, to evaluate the stability of this detection method, evaluation result is shown in Table 3.
Table 3.s002, s006, s007, s008 repeatability detection data summary table
In upper table 3, Mean represents meansigma methods, and SD represents standard deviation, and CV represents coefficient of dispersion.As known from Table 3, the CV (centrifugal pump) of these 8 corresponding CR values of 4 sample duplicate detection be respectively less than 0.01, and the fluctuation (SD value) of ZV the most all within ± 1.1, data fluctuations is less, show that the stability of this method is preferable.
(2) data volume research
In sequencing data amount fluctuation situation of NRS number on 0.25M (raw reads) to 15M scope, research genome specificity region.Sequencing data (corresponding results of karyotype is respectively the T13 positive, T21 is positive, T18 is positive and normal) for these four samples of s002, s006, s007, s008, the random data volume intercepting 2M to 15M, (this sample is the coefficient of dispersion value of non repetitive sequence number in all specific regions of genome to carry out genome alignment and statistics ZV Yu CV with this.Statistical result is shown in Table 4.
CV value (coefficient of dispersion) that the different sequencing data amount of table 4. is corresponding and ZV value (Z value)
Knowable to upper table 4, the chromosome detection of this method suitable wide spectrum data volume, especially when data volume is 1M and more than 1M, the stability of data is preferable with the result of Z test.
From above description, can be seen that, the above embodiments of the present invention achieve following technique effect: by based on sequencing data, by dividing specific regions with the non repetitive sequence of equal bar number for principle, the data fluctuations avoiding non-repetitive sequences number heterogeneity in each specific regions and cause, and then optimize the dependency of interchromosomal nucleic acid data parameters, utilize and the parameter comparison of the parameter of clinically relevant chromosome in biological specimen with other non-clinical relative chromosome districts, so that it is determined that whether chromosomal aneuploidy exists in sample to be tested.The method achieve single pattern detection, and need not the normal sample of standard, eliminate the dependency to experiment condition, make the analysis speed accelerate, the general improvements assessment to chromosomal abnormality.Provide one simply, chromosomal aneuploidy detection means fast and accurately, the accuracy rate of its autosome detection is more than 99%, and false positive rate is less than 1%.This method is the most various present approach reduces false negative recall rate;The most existing single Sample Method, the requirement to sequencing data amount is less.
Obviously, those skilled in the art should be understood that, some modules of above-mentioned the application, element or some steps can realize with general calculating device, they can concentrate on single calculating device, or it is distributed on the network that multiple calculating device is formed, alternatively, they can realize with calculating the executable program code of device, thus, can be stored in storing in device and be performed by calculating device, or they are fabricated to respectively each integrated circuit modules, or the multiple modules in them or step are fabricated to single integrated circuit module realize.So, the application is not restricted to the combination of any specific hardware and software.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.All within the spirit and principles in the present invention, any modification, equivalent substitution and improvement etc. made, should be included within the scope of the present invention.