Embodiment
With reference to the accompanying drawings the present invention is described more fully, exemplary embodiment of the present invention wherein is described.
Fig. 1 illustrates the process flow diagram of the method for the detection of a kind of phenotype genes that the embodiment of the invention provides and analysis of biological information.
As shown in Figure 1, the detection of phenotype genes and the method for analysis of biological information 100 comprise: step 102, carry out part comparison with the nucleotide sequence in the genome encoding district of each nearly edge species prediction respectively with the sequence of genes of interest, the result who compares according to the part obtains homologous sequence.For example, can use software " BLAST " that the sequence of genes of interest is carried out the part comparison with the nucleotide sequence in the genome encoding district of each nearly edge species respectively, wherein, reference sequences is the nucleotide sequence in certain nearly edge species gene group coding district, and the sequence of comparison is the protein sequence of order ground gene; Employed parameter of building the storehouse is " pF-oT ", and comparing used default parameters is " ptblastn-ele-5 ".There are 5 kinds of values selection " p " the parameter the inside of comparison type, and wherein " tblastn " is an integral body, is meant the comparison of protein sequence and nucleic acid library.The type of the database that the representative of " p " parameter is selected, the option of " p " parameter the inside has two F and T, and wherein " F " represents nucleic acid library, and " T " represents protein pool.The meaning of " o " parameter is to judge whether the analytical sequence name and set up sequence name index, and the option of " o " parameter the inside also has two F and T; Wherein sequence name index is set up in " F " expression, and sequence name index is set up in " T " expression." e " represents expectation value (E-value also can be referred to as expectation value), and this numeric representation is only because randomness causes the possible number of times that obtains this alignment result, and this numerical value is approaching more zero, and the possibility that this incident takes place is more little; From the angle of search, the e value is more little, and alignment result is remarkable more; " 1e-5 " is the concrete threshold value of predefined expectation value.
Step 104 is screened homologous sequence, obtains homologous gene.For example, the result according to local comparison in the step 102 screens the homologous sequence that is obtained; Those skilled in the art know, and it can screen the homologous sequence data according to the coverage (Query_end-Query_stat+1/Query_length) and the expectation value parameters such as (E_value) of the similarity (Identity) of sequence, reference sequences coverage (Subject_end-Subject_start+1/Subject_length), aligned sequences.In addition, those skilled in the art know, and can also introduce new standard and screen on the basis of aforementioned screening, will be further detailed for example among other embodiment after a while.
Step 106 is extracted dna sequence dna, dna sequence dna is converted into protein sequence, and carries out overall comparison, is converted into dna sequence dna again.For example, the DNA sequences encoding of extracting is converted into protein sequence after, carry out overall comparison with MUSCLE (Multiple Protein Sequence Aligment) software, be converted into dna sequence dna again; Wherein MUSCLE is a open source software that is used for protein level multisequencing comparison in issue in 2004, in view of its on speed and the advantage of precision, become those skilled in the art's normal software that uses in the overall comparison process more already; The core concept of MUSCLE algorithm mainly is to use gradual comparison to obtain initial multisequencing comparison earlier, re-uses the result that horizontal refining iteration improves the multisequencing comparison.
Protein sequence is converted into dna sequence dna again among the present invention after carrying out overall comparison, mainly is based on following reason: amino acid of three alkali yl codings, but the pairing triplet codon of amino acid can be a plurality of; Therefore amino acid is carried out overall comparison has more accuracy, a triplet code can not interrupted.In the heterogeneic comparative analysis of back, be to analyze according to the otherness of the base of encoding, amino acid sequence is converted into nucleotide sequence, can embody these difference better.
Step 108, the dna sequence dna that conversion is obtained makes up gene tree, and calculates the Dn/Ds of each branch.Specifically, Modeltest software is selected the best model of replacing, and utilizes the Bayesian statistics method, adopts Mrbayes software that dna sequence dna is made up gene tree; And the Dn/Ds that adopts each branch of Codeml computed in software in the PAML software package.Modeltest software is selected the best model of replacing automatically, and it chooses the scheme of the optimal data model that nucleotide is replaced.Select 56 models in this scheme, and realized three different model selection frameworks: layering likelihood ratio test (hLRTs), red pond information criterion (AIC) and bayesian information criterion (BIC).Provide best model by check.The model of usefulness is " GTR+gamma+I " in this article." PAML " is that a maximum likelihood method of utilizing is carried out the software package of Phylogenetic Analysis to DNA or protein sequence, and this software package is by the Yang Ziheng exploitation and provide free to academic research and use.The parameter that is adopted when for example, adopting the Dn/Ds of each branch of Codeml computed in software under the PAML can be set to respectively: noisy=3, verbose=1, runmode=0, seqtype=1, clock=0, model=1, NSsites=0, icode=0, CodonFreq=2, fix_kappa=0, kappa=4.54006, fix_omega=0, omega=1, fix_alpha=1, alpha=.0, Malpha=0, ncatG=4, getSE=0, RateAncestor=0, fix_blength=1, method=0.
Step 110, positive selection check is just being selected the site with check.For example, those skilled in the art can carry out positive selection check according to the disclosed mode of prior art, verify to be subjected to the site just selected.The concrete mode that also will align selection check among other embodiment after a while for example is further detailed.
An embodiment of the detection of phenotype genes provided by the invention and the method for analysis of biological information with reference to the gene of predicting in the nearly edge species gene group, filters out the homologous sequence of genes of interest or gene family by the similarity of predesignating; Adopt maximum likelihood method to build chadogram, come estimating system to grow the tree maximum likelihood ratio by comparison nucleic acid or amino acid, thereby can obtain phylogenetic more accurately topological structure.The detection of phenotype genes provided by the invention and the method for analysis of biological information can be excavated more accurate information, have reduced the misleading of pseudogene to analysis of biological information, help illustrating more biological questions from the biological information of obtaining.
Fig. 2 illustrates the process flow diagram of another embodiment of the method for the detection of phenotype genes provided by the invention and analysis of biological information.
As shown in Figure 2, the detection of phenotype genes and the method for analysis of biological information 200 comprise: step 202,204,205,206,208 and 210, wherein step 202,206,208 and 210 can be carried out respectively and step 102 shown in Figure 1,106,108 and 110 same or analogous technology contents, for for purpose of brevity, repeat no more its technology contents here.
As shown in Figure 2, after step 202, execution in step 204 is screened homologous sequence according to similarity and coverage rate.For example, according to the parameter of building the storehouse " pF-oT ", compare used default parameters " p tblastn-e 1e-5 ", similarity threshold, coverage threshold value (for example reference sequences coverage threshold value and/or aligned sequences threshold value) are set, the expectation value threshold value also can also be set the homologous sequence that the part comparison obtains is carried out preliminary screening.Next execution in step 205, and according to homogenic gene ontology GO, the IPR annotation information is screened.By the homologous gene that preliminary screening is obtained, in conjunction with its gene ontology gene function annotation system (GO, Gene Ontology) and gene structure annotation system IPR (Interpro Record) annotation information, carry out programmed screening again.GO is a widely used body in field of bioinformatics, mainly comprises three branches: bioprocess, molecular function and cell assembly; It is a individual system according to the gene function note.IPR is another system of carrying out note according to gene structure.
An embodiment of the detection of phenotype genes provided by the invention and the method for analysis of biological information with reference to the gene of predicting in the nearly edge species gene group, filters out the homologous sequence of genes of interest or gene family by the similarity of predesignating; Select the more sequence in homologous region territory by sieves, the annotation information in conjunction with these genes filters out gene comparatively approaching on the function again, and this technological means can filter out homologous gene in accurate more scope.The present invention compares the gene set that phenotype genes and note go out can detect homologous gene in more accurate scope, has further reduced the possibility of choosing pseudogene.
Fig. 3 illustrates the process flow diagram of another embodiment of the method for the detection of phenotype genes provided by the invention and analysis of biological information.
As shown in Figure 3, the detection of phenotype genes and the method for analysis of biological information 300 comprise: step 302,304,306,308,310 and 311, wherein step 302,304,306 and 308 can be carried out respectively and step 102 shown in Figure 1,104,106 and 108 same or analogous technology contents, for for purpose of brevity, repeat no more its technology contents here.
As shown in Figure 3, after step 308, execution in step 310 adopts " branch-site A " model in the PAML software package to go check just selecting the site.Specifically, " branch-siteA " model belongs to the existing model that the PAML software package provides, and is used to analyze adaptive evolution.For example, in the site branch model of just selecting, the site of the branch that might be subjected to just selecting is made as prospect, and remaining site is made as background.Alternative model allows to take place just to select on the prospect site, and zero model does not then allow.Thereby the null hypothesis parameter can be set be: noisy=3, verbose=1, runmode=0, seqtype=1, CodonFreq=2, clock=0, model=2, NSsites=2, icode=0, fix_kappa=0, kappa=4.54006, fix_omega=1, omega=1, fix_alpha=1, alpha=0., Malpha=0, ncatG=4, getSE=0, RateAncestor=0, fix_blength=1, method=0; The alternative hypothesis parameter is: noisy=3, verbose=1, runmode=0, seqtype=1, CodonFreq=2, clock=0, model=2, NSsites=2, icode=0, fix_kappa=0, kappa=4.54006, fix_omega=0, omega=1.5, fix_alpha=1, alpha=0., Malpha=0, ncatG=4, getSE=0, RateAncestor=0, fix_blength=1, method=0.
Step 311 filters out according to wrong discovery rate FDR and Bayes's empirical probability and to be subjected to the gene just selected.Specifically, the threshold value that can set in advance wrong discovery rate FDR is less than 0.05, and the threshold value of Bayes's empirical probability at least one site is greater than 0.95; And filter out according to the threshold value of the threshold value of wrong discovery rate FDR and Bayes's empirical probability and to be subjected to the gene just selected.
The value that adopts non-synonym replacement/synonym to replace is a replacement rate, estimates the value of each Dn of branch, Ds by likelihood ratio test; And only by relatively Dn and Ds can't determine whether just necessarily to be subjected to the selection effect.An embodiment of the detection of phenotype genes provided by the invention and the method for analysis of biological information, also adopted statistical method to check their authenticity and reliability, promptly the gene that detects has been screened the back and be subjected to the gene just selected by branch-site A model testing; Two hypothesis by likelihood method check branch-site hypothesis, the check mistake is just being selected frequency to screen and is just being selected gene, thereby guarantee to show the authenticity and the reliability of shape genetic test, for further analysis of biological information and solution biological question provide safeguard.
Fig. 4 illustrates the process flow diagram of an embodiment of the method for the detection of phenotype genes provided by the invention and analysis of biological information.
As shown in Figure 4, the method 400 of the detection of phenotype genes and analysis of biological information comprises:
Step 402 is carried out part comparison with the nucleotide sequence in the genome encoding district of each nearly edge species prediction respectively with the sequence of genes of interest, and the result who compares according to the part obtains homologous sequence.For example, can use software " BLAST " that the sequence of genes of interest is carried out the part comparison with the nucleotide sequence in the genome encoding district of each nearly edge species respectively, wherein, reference sequences is the nucleotide sequence in certain nearly edge species gene group coding district, and the sequence of comparison is the protein sequence of order ground gene; Employed parameter of building the storehouse is " pF-oT ", and comparing used default parameters is " p tblastn-e 1e-5 ".
Step 404 is screened homologous sequence according to similarity and coverage rate.For example, according to the parameter of building the storehouse " pF-oT ", compare used default parameters " ptblastn-e 1e-5 ", similarity threshold, coverage threshold value (for example reference sequences coverage threshold value and/or aligned sequences threshold value) are set, the expectation value threshold value also can also be set the homologous sequence that the part comparison obtains is carried out preliminary screening.
Step 405, according to homogenic gene ontology GO, the IPR annotation information is screened.By the homologous gene that preliminary screening is obtained, in conjunction with its gene ontology GO and IPR annotation information, carry out programmed screening again.
Step 406 is extracted dna sequence dna, dna sequence dna is converted into protein sequence, and carries out overall comparison, is converted into dna sequence dna again.For example, the DNA sequences encoding of extracting is converted into protein sequence after, carry out overall comparison with MUSCLE software, be converted into dna sequence dna again.
Step 408, the dna sequence dna that conversion is obtained makes up gene tree, and calculates the Dn/Ds of each branch.Specifically, Modeltest software is selected the best model of replacing, and utilizes the Bayesian statistics method, adopts Mrbayes that dna sequence dna is made up gene tree; And the Dn/Ds that adopts each branch of Codeml computed in software in the PAML software package.The parameter that is adopted when for example, adopting the Dn/Ds of each branch of Codeml computed in software under the PAML can be set to respectively: noisy=3, verbose=1, runmode=0, seqtype=1, clock=0, model=1, NSsites=0, icode=0, CodonFreq=2, fix_kappa=0, kappa=4.54006, fix_omega=0, omega=1, fix_alpha=1, alpha=0, Malpha=0, ncatG=4, getSE=0, RateAncestor=0, fix_blength=1, method=0.
Step 410 adopts " branch-site A " model in the PAML software package to go check just selecting the site.For example, the null hypothesis parameter being set is: noisy=3, verbose=1, runmode=0, seqtype=1, CodonFreq=2, clock=0, model=2, NSsites=2, icode=0, fix_kappa=0, kappa=4.54006, fix_omega=1, omega=1, fix_alpha=1, alpha=0., Malpha=0, ncatG=4, getSE=0, RateAncestor=0, fix_blength=1, method=0; The alternative hypothesis parameter is: noisy=3, verbose=1, runmode=0, seqtype=1, CodonFreq=2, clock=0, model=2, NSsites=2, icode=0, fix_kappa=0, kappa=4.54006, fix_omega=0, omega=1.5, fix_alpha=1, alpha=0., Malpha=0, ncatG=4, getSE=0, RateAncestor=0, fix_blength=1, method=0.
Step 411 filters out according to wrong discovery rate FDR and Bayes's empirical probability and to be subjected to the gene just selected.Specifically, the threshold value that can set in advance wrong discovery rate FDR is less than 0.05, and the threshold value of Bayes's empirical probability at least one site is greater than 0.95; And filter out according to the threshold value of the threshold value of wrong discovery rate FDR and Bayes's empirical probability and to be subjected to the gene just selected.
Fig. 5 illustrates the process flow diagram of an embodiment of the method for the detection of phenotype genes provided by the invention and analysis of biological information.
As shown in Figure 5, the detection of phenotype genes and the method for analysis of biological information 500 comprise: step 501, determine the protein sequence of target gene.
Step 502 is chosen the nucleotide sequence in the genome encoding district of the nearly edge species prediction of each of target gene sequence.
Step 504 utilizes BLAST software to carry out the part comparison, and the result who compares according to the part obtains homologous sequence.For example, can use software " BLAST " that the sequence of genes of interest is carried out the part comparison with the nucleotide sequence in the genome encoding district of each nearly edge species respectively, wherein, reference sequences is the nucleotide sequence in certain nearly edge species gene group coding district, and the sequence of comparison is the protein sequence of order ground gene; Employed parameter of building the storehouse is " pF-oT ", and comparing used default parameters is " p tblastn-e 1e-5 ".
Step 506 is screened homologous sequence according to the similarity in homology zone.For example, according to the parameter of building the storehouse " pF-oT ", compare used default parameters " p tblastn-e1e-5 ", similarity threshold, coverage threshold value (for example reference sequences coverage threshold value and/or aligned sequences threshold value) etc. are set the homologous sequence that the part comparison obtains is carried out preliminary screening.
Step 508, according to homogenic gene ontology GO, the IPR annotation information is screened.By the homologous gene that preliminary screening is obtained, in conjunction with its gene ontology GO and IPR annotation information, carry out programmed screening again.
Step 510 is extracted dna sequence dna, dna sequence dna is converted into protein sequence, and carries out overall comparison, is converted into dna sequence dna again.For example, the DNA sequences encoding of extracting is converted into protein sequence after, carry out overall comparison with MUSCLE software, be converted into dna sequence dna again.
Step 512, Modeltest software are selected the best model of replacing, and adopt Mrbayes that dna sequence dna is made up gene tree.
Step 514 is by branch's length comparison evolutionary rate of each gene.The priority that position by child node and father node (just each branch occur priority) relatively breaks up, the length of each branch is calculated according to the base mutation rate, can reflect the problem (mainly comprising: the time of differentiation, the speed of differentiation) of each gene rate of differentiation.
Step 515 judges that by the topological structure of gene tree the direct line (orthology) of gene and collateral line (paralogy) concern, infer in the recent period extensive duplicate event.
Step 516, the Dn/Ds of each branch of calculating gene tree judges the selection that each gene is suffered.Specifically, adopt the Dn/Ds of each branch of Codeml computed in software in the PAML software package, can adopt parameter selected in the previous embodiment when calculating the Dn/Ds of each branch; Adopt Codeml to calculate " Dn/Ds " of each branch, analyze the suffered selection pressure of each species homologous gene, wherein " Dn/Ds " is that (this ratio can judge whether that selection pressure acts on this protein coding gene for ratio between the contrary opinion frequency of mutation (Dn) and the same sense mutation frequency (Ds).If Dn/Ds>1, then thinking has positive selection effect.If Dn/Ds=1 then thinks to have neutral the selection.If Dn/Ds<1, then thinking has purifying selection effect); Thereby judge the selection that each gene is suffered.
Step 517 adopts " branch-site A " model in the PAML software package to go check just selecting the site, and the result is carried out wrong discovery rate FDR check.For example, prospect hypothesis parameter and background hypothesis parameter are set, adopt " branch-site A " model in the PAML software package to go check just selecting the site.Filter out according to wrong discovery rate FDR and Bayes's empirical probability and to be subjected to the gene just selected.Specifically, the threshold value that can set in advance wrong discovery rate FDR is less than 0.05, and the threshold value of Bayes's empirical probability at least one site is greater than 0.95; And filter out according to the threshold value of the threshold value of wrong discovery rate FDR and Bayes's empirical probability and to be subjected to the gene just selected.
An embodiment of the detection of phenotype genes provided by the invention and the method for analysis of biological information with reference to the gene of predicting in the nearly edge species gene group, filters out the homologous sequence of genes of interest or gene family by the similarity of predesignating; Select the more sequence in homologous region territory by sieves, the annotation information in conjunction with these genes filters out gene comparatively approaching on the function again, and this technological means can filter out homologous gene in accurate more scope.The present invention compares the gene set that phenotype genes and note go out can detect homologous gene in more accurate scope, has further reduced the possibility of choosing pseudogene.In addition, the present invention compares Orhology and Paralogy relation by the topological structure of gene tree, and whether deduction large-scale duplicate event has taken place in the recent period; Compare Dn/Ds, selected the software of multiparameter model for use, can adjust according to the difference of species character; Can be relatively accurate show out the pressure type that each gene is subjected to natural selection, and can detect by the method and be subjected to the new gene just selected, infer that effect is played in the enhancing of which gene pairs phenotype.The present invention has simultaneously also adopted statistical method to check their authenticity and reliability, promptly the gene that detects is screened the back and is subjected to the gene just selected by branch-site A model testing; Two hypothesis by likelihood method check branch-site hypothesis, the check mistake is just being selected frequency to screen and is just being selected gene, thereby guarantee to show the authenticity and the reliability of shape genetic test, for further analysis of biological information and solution biological question provide safeguard.
Fig. 6 illustrates the structural representation of the system of the detection of a kind of phenotype genes that the embodiment of the invention provides and analysis of biological information.
As shown in Figure 6, the system 600 of a kind of detection of phenotype genes and analysis of biological information comprises: local comparing module 602, homologous gene screening module 604, dna sequence dna conversion module 606, gene tree make up module 608 and are just selecting inspection module 610.Wherein
Local comparing module 602 is used for the sequence of genes of interest is carried out part comparison with the nucleotide sequence in the genome encoding district of each nearly edge species prediction respectively, and the result who compares according to the part obtains homologous sequence.For example, local comparing module 602 can be used software " BLAST " that the sequence of genes of interest is carried out part with the nucleotide sequence in the genome encoding district of each nearly edge species respectively and compare, wherein, reference sequences is the nucleotide sequence in certain nearly edge species gene group coding district, and the sequence of comparison is the protein sequence of order ground gene; Employed parameter of building the storehouse is " pF-oT ", and comparing used default parameters is " p tblastn-e 1e-5 ".
Homologous gene screening module 604 is used for homologous sequence is screened, and obtains homologous gene.For example, according to the result of local comparing module 602 local comparisons, homologous gene screening module 604 is screened the homologous sequence that is obtained; Those skilled in the art know, and it can screen the homologous sequence data according to the coverage and the expectation value parameters such as (E_value) of the similarity of sequence, reference sequences coverage, aligned sequences.
Dna sequence dna conversion module 606 is used to extract dna sequence dna, and dna sequence dna is converted into protein sequence, and carries out overall comparison, is converted into dna sequence dna again.For example, after the DNA sequences encoding of 606 pairs of extractions of dna sequence dna conversion module is converted into protein sequence, carry out overall comparison, be converted into dna sequence dna again with MUSCLE software.
Gene tree makes up module 608, and the dna sequence dna that is used for conversion is obtained makes up gene tree, and calculates the Dn/Ds of each branch.Specifically, utilize the Bayesian statistics method, adopt Mrbayes that dna sequence dna is made up gene tree; And the Dn/Ds that adopts each branch of Codeml computed in software in the PAML software package.
Just selecting inspection module 610, be used for check and just selecting the site.Further, just selecting inspection module to adopt " branch-site A " model in the PAML software package to go check just selecting the site; And filter out according to wrong discovery rate FDR and Bayes's empirical probability and to be subjected to the gene just selected.But the description among the concrete flow process details reference method embodiment that adopts just repeats no more here.
Among the embodiment of the detection of phenotype genes provided by the invention and the system of analysis of biological information, homologous gene screening module is further used for: according to similarity and coverage rate homologous sequence is screened; And according to homogenic gene ontology GO, the IPR annotation information is screened.But the description among the concrete flow process details reference method embodiment that adopts just repeats no more here.
Among the embodiment of the detection of phenotype genes provided by the invention and the system of analysis of biological information, the threshold value that preestablishes wrong discovery rate FDR is less than 0.05, and the threshold value of Bayes's empirical probability at least one site is greater than 0.95; And filter out according to the threshold value of the threshold value of wrong discovery rate FDR and Bayes's empirical probability and to be subjected to the gene just selected.
An embodiment of the detection of phenotype genes provided by the invention and the system of analysis of biological information, local comparing module is with reference to the gene of predicting in the nearly edge species gene group, and homologous gene screening module filters out the homologous sequence of genes of interest or gene family by the similarity of predesignating; Select the more sequence in homologous region territory by sieves, the annotation information in conjunction with these genes filters out gene comparatively approaching on the function again, and this technological means can filter out homologous gene in accurate more scope.The present invention compares the gene set that phenotype genes and note go out can detect homologous gene in more accurate scope, has further reduced the possibility of choosing pseudogene.In addition, the present invention has also adopted statistical method to check their authenticity and reliability, promptly the gene that detects is screened the back and is subjected to the gene of just selecting by just selecting the inspection module check; Two hypothesis by likelihood method check branch-site hypothesis, the check mistake is just being selected frequency to screen and is just being selected gene, thereby guarantee to show the authenticity and the reliability of shape genetic test, for further analysis of biological information and solution biological question provide safeguard.
Fig. 7 illustrates the topological structure synoptic diagram of gene tree of the phenotype genes (AtHKT1) of the little salt mustard that the present invention selects for use.
Choose " protein sequence of the phenotype genes of little salt mustard (AtHKT1) " as target gene sequences, its each nearly edge species are respectively grape, willow, paddy rice and arabidopsis; Adopting BLAST software to carry out part the nucleotide sequence in the genome encoding district of the prediction of the protein sequence of the phenotype genes (AtHKT1) of little salt mustard and grape, willow, paddy rice and arabidopsis compares.The parameter of wherein building the storehouse is " pF-oT ", and comparing used default parameters is " p tblastn-e 1e-5 ".By homology similarity>0.3 is set, reference sequences coverage>0.3, the coverage of aligned sequences>0.3, expectation value<1e-5 screens for the first time, and results of screening is screened in conjunction with GO and IPR annotation information again.Table 1 illustrates the homologous gene and the note result of the phenotype genes (AtHKT1) of little salt mustard.As shown in table 1, through aforementioned screening, filter out 18 new genes altogether, the gene that is filtered out is relevant with the transportation of metal mostly.
The sequence of these genes is extracted, and use the Bayesian statistics method of utilizing, adopt Mrbayes software building gene tree, obtain topological structure as shown in Figure 7.As shown in Figure 7, at grape, willow, paddy rice, the screening new gene similar in the arabidopsis to the phenotype genes (AtHKT1) of little salt mustard.The nucleotide sequence in the genome encoding district of the prediction that two of the upper left corner zones are nearly edge species " grape " among the figure, the bulk zone in the lower right corner is the nucleotide sequence in genome encoding district of the prediction of nearly edge species " paddy rice " among the figure, and the zone, upper right side is the nucleotide sequence in genome encoding district of the prediction of nearly edge species " arabidopsis " among the figure; According to shown in Figure 7, those skilled in the art can clearly know: these genes of nearly edge species " grape ", " paddy rice ", " arabidopsis " all have on a large scale at the phenotype genes (AtHKT1) of little salt mustard and duplicate; In addition, because collateral line (Paralogy) relation is nearer, illustrate that the gene differentiation time of origin in the species is not long, and intimate gene breaks up to early between species.Length by branch more as can be known, (gene of the POPTR ending among Fig. 7 is the gene that screens from willow in willow.Branch is longer relatively.Gene with color mark among the figure is that massive duplication is arranged, so the gene of willow has only a color of no use to mark) and paddy rice in rate of differentiation fast.
Each branch of gene tree to the phenotype genes (AtHKT1) of little salt mustard shown in Figure 7 analyzes subsequently, and calculates according to " Dn/Ds " and to draw Fig. 8.Fig. 8 illustrates each branch of gene tree " Dh/Ds " result of calculation synoptic diagram of the phenotype genes (AtHKT1) of the little salt mustard of the present invention.Adopt the branch-site A model testing in the PAML software package just selecting gene, FDR detects the just selection gene of q_value<0.05, obtains just selecting shown in subordinate list 2 testing result.Table 2 illustrates the homologous gene of the phenotype genes (AtHKT1) of little salt mustard and is just selecting testing result, wherein detects 7 genes and is subjected to just to select.
With reference to the exemplary description of aforementioned the present invention, those skilled in the art can clearly know the present invention and have the following advantages:
1, the method and system of the detection of phenotype genes provided by the invention and analysis of biological information embodiment with reference to the gene of predicting in the nearly edge species gene group, filters out the homologous sequence of genes of interest or gene family by the similarity of predesignating; Adopt maximum likelihood method to build chadogram, come estimating system to grow the tree maximum likelihood ratio by comparison nucleic acid or amino acid, thereby can obtain phylogenetic more accurately topological structure.The detection of phenotype genes provided by the invention and the method for analysis of biological information can be excavated more accurate information, have reduced the misleading of pseudogene to analysis of biological information, help illustrating more biological questions from the biological information of obtaining.
2, the method and system of the detection of phenotype genes provided by the invention and analysis of biological information embodiment with reference to the gene of predicting in the nearly edge species gene group, filters out the homologous sequence of genes of interest or gene family by the similarity of predesignating; Select the more sequence in homologous region territory by sieves, the annotation information in conjunction with these genes filters out gene comparatively approaching on the function again, and this technological means can filter out homologous gene in accurate more scope.The present invention compares the gene set that phenotype genes and note go out can detect homologous gene in more accurate scope, has further reduced the possibility of choosing pseudogene.
3, the method and system of the detection of phenotype genes provided by the invention and analysis of biological information embodiment, adopted statistical method to check their authenticity and reliability, promptly the gene that detects has been screened the back and be subjected to the gene just selected by branch-site A model testing; Two hypothesis by likelihood method check branch-site hypothesis, the check mistake is just being selected frequency to screen and is just being selected gene, thereby guarantee to show the authenticity and the reliability of shape genetic test, for further analysis of biological information and solution biological question provide safeguard.
4, the method for the detection of phenotype genes provided by the invention and analysis of biological information embodiment, by the topological structure comparison Orhology and the Paralogy relation of gene tree, whether deduction large-scale duplicate event has taken place in the recent period; Compare Dn/Ds, selected the software of multiparameter model for use, can adjust according to the difference of species character; Can be relatively accurate show out the pressure type that each gene is subjected to natural selection, and can detect by the method and be subjected to the new gene just selected, infer that effect is played in the enhancing of which gene pairs phenotype.
5, the method for the detection of phenotype genes provided by the invention and analysis of biological information embodiment is widely used, and is applicable to multiple biology; And detection and analysis speed are fast, the accuracy rate height.Specifically, can use in the information analysis of the phenotypic correlation of animal, plant group.According to known phenotype genes, in same species and nearly edge species, find a plurality of new phenotype correlation genes, for the excavation and the analysis of many species functional gene provides strong support.
Description of the invention provides for example with for the purpose of describing, and is not exhaustively or limit the invention to disclosed form.Many modifications and variations are obvious for the ordinary skill in the art.The functional module of describing among the present invention and the dividing mode of functional module only are explanation thought of the present invention, and those skilled in the art can freely change the dividing mode of functional module and module structure thereof with the realization identical functions according to the needs of instruction of the present invention and practical application; Selecting and describing embodiment is for better explanation principle of the present invention and practical application, thereby and makes those of ordinary skill in the art can understand the various embodiment that have various modifications that the present invention's design is suitable for special-purpose.
The homologous gene and the note result of the phenotype genes (AtHKT1) of the little salt mustard of table 1
The homologous gene of the phenotype genes (AtHKT1) of the little salt mustard of table 2 is just being selected testing result