CN114639442B

CN114639442B - Method and system for predicting open reading frame based on single nucleotide polymorphism

Info

Publication number: CN114639442B
Application number: CN202210325529.5A
Authority: CN
Inventors: 宋波; 姜梦云; 宁卫东; 程时锋
Original assignee: Agricultural Genomics Institute at Shenzhen of CAAS
Current assignee: Agricultural Genomics Institute at Shenzhen of CAAS
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2024-01-30
Anticipated expiration: 2042-03-30
Also published as: CN114639442A

Abstract

The invention discloses a method for predicting an open reading frame based on single nucleotide polymorphism and a system for predicting the open reading frame. The invention utilizes the 3 base periodicity of nucleotide polymorphism in coding sequence in group genome variation data to test and screen open reading frames in the gene sequence to be tested, counts the using frequency of codons in the open reading frames, combines the 3 base periodicity of nucleotide polymorphism and the statistical result of codon using frequency, and comprehensively evaluates the prediction probability of small open reading frames by statistical analysis, thereby realizing accurate prediction of the small open reading frames in genome.

Description

A method and system for predicting open reading frames based on single nucleotide polymorphisms

技术领域Technical field

本发明属于生物技术领域，具体涉及一种基于单核苷酸多态性预测开放阅读框的方法，以及预测开放阅读框的系统。The invention belongs to the field of biotechnology, and specifically relates to a method for predicting an open reading frame based on single nucleotide polymorphisms, and a system for predicting an open reading frame.

背景技术Background technique

开放阅读框(Open reading frame,ORF)是DNA序列中具有编码蛋白质潜能的序列，其在基因组中的注释是下游分析和使用参考基因组所需的最重要的过程之一。目前，人们开发了各种算法来预测基因组中的ORF，但这些基于序列的方法无法预测小开放阅读框(small ORF,sORF)。近年来的研究表明，sORF编码的短于100个氨基酸的多肽在植物对非生物和生物胁迫的响应、人类的致癌作用以及一些与癌症治疗相关的生物过程中发挥着重要作用。长期以来，由于sORF的长度较短，且使用了非标准的起始密码子(CUG,GUG,UUG)，其预测一直存在问题。Open reading frames (ORFs) are sequences in DNA sequences that have the potential to encode proteins. Their annotation in the genome is one of the most important processes required for downstream analysis and use of reference genomes. Currently, various algorithms have been developed to predict ORFs in genomes, but these sequence-based methods cannot predict small open reading frames (small ORFs, sORFs). Recent studies have shown that sORF-encoded polypeptides shorter than 100 amino acids play an important role in plant responses to abiotic and biotic stresses, human carcinogenesis, and some biological processes related to cancer treatment. For a long time, the prediction of sORFs has been problematic due to their short length and use of non-standard start codons (CUG, GUG, UUG).

现有技术中，核糖体印迹测序技术(Ribo-seq)可以对核糖体保护的mRNA印记(RPFs)进行分析，可以用于精确预测包括酵母、人类、动物和植物在内的许多物种中被翻译的sORF。但这些物种大多是简单的模式生物，通常是二倍体纯合基因组，而Ribo-seq技术在复杂基因组中的应用鲜有报道。一个典型的真核核糖体的印记长度为28个碱基，对于序列的精确定位来说太短了，而这个问题在多倍体复杂基因组中会更加突出。许多植物基因组都是高重复和高杂合的多倍体复杂基因组，这极大地限制了Ribo-seq技术在这些植物中的应用。由于许多重要农作物，如小麦(六倍体)和棉花(四倍体)，都是多倍体，因此有必要开发新的方法和工具用于解决多倍体复杂基因组中小编码框的鉴定问题。In the existing technology, ribosome imprint sequencing technology (Ribo-seq) can analyze ribosome protected mRNA imprints (RPFs) and can be used to accurately predict translation in many species including yeast, humans, animals and plants. sORF. However, most of these species are simple model organisms, usually with homozygous diploid genomes, and the application of Ribo-seq technology in complex genomes has rarely been reported. The imprint length of a typical eukaryotic ribosome is 28 bases, which is too short for precise sequence positioning, and this problem will be more prominent in polyploid complex genomes. Many plant genomes are highly repetitive and heterozygous polyploid complex genomes, which greatly limits the application of Ribo-seq technology in these plants. Since many important crops, such as wheat (hexaploid) and cotton (tetraploid), are polyploid, it is necessary to develop new methods and tools to solve the problem of identification of small coding frames in polyploid complex genomes.

发明内容Contents of the invention

本发明的目的在于克服现有技术的不足，提供一种基于单核苷酸多态性预测开放阅读框的方法及系统。本发明利用群体基因组变异数据中编码序列中核苷酸多态性的3碱基周期性，引入密码子的使用频率，通过统计分析综合评定编码框的预测概率值，进而预测复杂基因组中的编码框。The purpose of the present invention is to overcome the shortcomings of the existing technology and provide a method and system for predicting an open reading frame based on single nucleotide polymorphisms. This invention utilizes the 3-base periodicity of the nucleotide polymorphism in the coding sequence in the population genome variation data, introduces the usage frequency of codons, comprehensively evaluates the predicted probability value of the coding frame through statistical analysis, and then predicts the coding frame in the complex genome. .

本发明的目的是通过以下技术方案来实现的：一种基于单核苷酸多态性预测开放阅读框的方法，包括以下步骤：The object of the present invention is achieved through the following technical solutions: a method for predicting an open reading frame based on single nucleotide polymorphisms, including the following steps:

S1、获取待预测的转录本信息，提取候选长开放阅读框；S1. Obtain transcript information to be predicted and extract candidate long open reading frames;

S2、评估待预测的候选长开放阅读框中单核苷酸多态性的变化规律，根据预设的第一筛选条件筛选真实长开放阅读框；S2. Evaluate the changing patterns of single nucleotide polymorphisms in the candidate long open reading frames to be predicted, and screen the real long open reading frames according to the preset first screening conditions;

S3、统计所述真实长开放阅读框中各密码子的使用频率；S3. Statistics of the usage frequency of each codon in the real long open reading frame;

S4、从所述转录本信息中提取候选开放阅读框，评估待预测的候选开放阅读框中单核苷酸多态性的变化规律和密码子使用频率，将符合预设的第二筛选条件的候选开放阅读框作为预测结果。S4. Extract candidate open reading frames from the transcript information, evaluate the changing patterns of single nucleotide polymorphisms and codon usage frequencies in the candidate open reading frames to be predicted, and select those that meet the preset second screening conditions. Candidate open reading frames as prediction results.

进一步的，提取所述候选长开放阅读框和候选开放阅读框的依据为：以起始密码子AUG开头，以终止密码子UAG、UAA或UGA结尾，且候选长开放阅读框和候选开放阅读框的序列长度为3的整数倍。Further, the basis for extracting the candidate long open reading frame and the candidate open reading frame is: starting with the start codon AUG, ending with the stop codon UAG, UAA or UGA, and the candidate long open reading frame and the candidate open reading frame The sequence length is an integer multiple of 3.

进一步的，所述候选长开放阅读框的长度大于900bp，所述候选开放阅读框的长度大于100bp。Further, the length of the candidate long open reading frame is greater than 900 bp, and the length of the candidate open reading frame is greater than 100 bp.

进一步的，评估待预测的所述候选长开放阅读框中单核苷酸多态性的变化规律包括：Further, evaluating the change patterns of single nucleotide polymorphisms in the candidate long open reading frame to be predicted includes:

S21、获取待预测样本的群体变异数据，计算待预测的候选长开放阅读框中各位点的核苷酸多样性值；S21. Obtain the population variation data of the sample to be predicted, and calculate the nucleotide diversity value of each site in the candidate long open reading frame to be predicted;

S22、分别检验候选长开放阅读框中第3n个碱基的核苷酸多样性值是否大于第3n-2个碱基和第3n-1的核苷酸多样性值，其中，1≤n≤L/3，L为候选长开放阅读框的长度，得到P₁和P₂，计算合并后的P值。S22. Test whether the nucleotide diversity value of the 3n base in the candidate long open reading frame is greater than the nucleotide diversity value of the 3n-2 base and the 3n-1 base, where 1≤n≤ L/3, L is the length of the candidate long open reading frame, get P ₁ and P ₂ , and calculate the combined P value.

进一步的，所述第一筛选条件为P值小于0.0001。Further, the first screening condition is that the P value is less than 0.0001.

进一步的，评估待预测的所述候选开放阅读框中单核苷酸多态性的变化规律和密码子使用频率包括：Further, evaluating the change pattern and codon usage frequency of single nucleotide polymorphisms in the candidate open reading frame to be predicted includes:

S41、获取待预测样本的群体变异数据，计算待预测的候选开放阅读框中各位点的核苷酸多样性值；S41. Obtain the population variation data of the sample to be predicted, and calculate the nucleotide diversity value of each site in the candidate open reading frame to be predicted;

S42、分别检验候选开放阅读框中第3n个碱基的核苷酸多样性值是否大于第3n-2个碱基和第3n-1的核苷酸多样性值，其中，1≤n≤L'/3，L'为候选开放阅读框的长度，得到P₁'和P₂'，；分别检验候选开放阅读框中以第3n-2个碱基为起点的三联体作为密码子的使用频率是否高于以第3n-1个碱基和第3n个碱基为起点的三联体作为密码子的使用频率，得到P₃'和P₄'，计算P₁'、P₂'、P₃和P₄四个值合并后的P'值。S42. Test whether the nucleotide diversity value of the 3nth base in the candidate open reading frame is greater than the nucleotide diversity value of the 3n-2th base and the 3n-1th base, where 1≤n≤L '/3, L' is the length of the candidate open reading frame, and P ₁ ' and P ₂ ' are obtained; respectively test the usage frequency of the triplet starting from the 3n-2 base as a codon in the candidate open reading frame. Is it higher than the usage frequency of the triplet starting from the 3n-1th base and the 3nth base as a codon? Get P ₃ ' and P ₄ ', calculate P ₁ ', P ₂ ', P ₃ and P' value after combining the four values of P ₄ .

进一步的，所述第二筛选条件为控制符合预设要求的P'值的错误发现率FDR，控制FDR≤0.0001。Further, the second filtering condition is to control the false discovery rate FDR of the P' value that meets the preset requirements, and control FDR≤0.0001.

进一步的，所述预设要求为P'值小于0.05。Further, the preset requirement is that the P' value is less than 0.05.

本发明的另一个目的是提供一种基于单核苷酸多态性预测开放阅读框的系统，包括处理器和存储介质，所述存储介质存储有所述处理器可执行的机器可读指令，所述机器可读指令执行时执行上述的预测开放阅读框的方法。Another object of the present invention is to provide a system for predicting open reading frames based on single nucleotide polymorphisms, including a processor and a storage medium, the storage medium storing machine-readable instructions executable by the processor, When the machine-readable instructions are executed, the above-mentioned method of predicting an open reading frame is executed.

本发明的有益效果是：The beneficial effects of the present invention are:

1)本发明利用基因编码序列在群体基因组变异数据中编码序列中核苷酸多态性的3碱基周期性，基于基因编码序列中密码子第三位碱基通常是简并碱基，更容易发生突变而不受到自然选择，故而密码子第三位碱基在自然群体中表现出更高的多态性。通过分析搜索群体基因组变异多态性数据中存在显著3碱基周期性的序列片段，用以确定开放阅读框的翻译相位，进而判断其起始和终止位点，完成开放阅读框的预测。通过引入密码子的使用频率，统计分析综合评定开放阅读框的预测概率值，进而实现对基因组中的开放阅读框的准确预测。本发明的方法也适用于多倍体复杂基因组中对小开放阅读框的预测与鉴定，有利于推进对多倍体复杂基因组的研究和开发。1) The present invention utilizes the 3-base periodicity of the nucleotide polymorphism in the coding sequence of the gene coding sequence in the population genome variation data. Based on the fact that the third base of the codon in the gene coding sequence is usually a degenerate base, it is easier to Mutations occur without natural selection, so the third base of the codon shows higher polymorphism in natural populations. By analyzing the sequence fragments with significant 3-base periodicity in the genome variation polymorphism data of the search population, we can determine the translation phase of the open reading frame, and then determine its start and end sites to complete the prediction of the open reading frame. By introducing the frequency of codon usage, statistical analysis comprehensively evaluates the predicted probability value of the open reading frame, thereby achieving accurate prediction of the open reading frame in the genome. The method of the present invention is also suitable for predicting and identifying small open reading frames in polyploid complex genomes, and is beneficial to promoting the research and development of polyploid complex genomes.

2)本发明还提供应用本发明方法预测开放阅读框的系统，将本发明的方法步骤处理过程以计算机程序形式应用于计算机上，使用者在输入待预测样本的群体变异数据、转录本等必要信息后，计算机程序将预测结果输出，有利于提升本发明方法的使用效率，促进本发明方法在多倍体复杂基因组研究中的应用。2) The present invention also provides a system for predicting open reading frames by applying the method of the present invention. The process of the method steps of the present invention is applied to the computer in the form of a computer program. The user inputs the population variation data, transcripts, etc. of the sample to be predicted as necessary. After receiving the information, the computer program outputs the prediction results, which is beneficial to improving the efficiency of the method of the present invention and promoting the application of the method of the present invention in the research of polyploid complex genomes.

附图说明Description of the drawings

图1为本发明的技术路线原理图。Figure 1 is a schematic diagram of the technical route of the present invention.

图2为本发明的方法流程图。Figure 2 is a flow chart of the method of the present invention.

图3为本发明实施例一中预测出的两个开放阅读框的实例。Figure 3 is an example of two open reading frames predicted in Example 1 of the present invention.

图4为本发明实施例一的预测效果评估结果，本发明方法从棉花SNPs鉴定出来的开放阅读框的表现。Figure 4 is the prediction effect evaluation result of Example 1 of the present invention, and the performance of the open reading frame identified from cotton SNPs by the method of the present invention.

图5为本发明实施例一的小开放阅读框的预测结果。Figure 5 shows the prediction results of small open reading frames in Example 1 of the present invention.

图6为本发明实施例一的蛋白质质谱数据的支持证据。Figure 6 is supporting evidence of protein mass spectrometry data in Example 1 of the present invention.

图7为本发明实施例二中预测出的两个开放阅读框的实例。Figure 7 is an example of two open reading frames predicted in Example 2 of the present invention.

图8为本发明实施例二的预测效果评估结果，本发明方法从小麦SNPs鉴定出来的开放阅读框的表现。Figure 8 is the prediction effect evaluation result of Example 2 of the present invention, and the performance of the open reading frame identified from wheat SNPs by the method of the present invention.

图9为本发明实施例二的小开放阅读框的预测结果。Figure 9 shows the prediction results of small open reading frames in Example 2 of the present invention.

图10为本发明实施例二的蛋白质质谱数据的支持证据。Figure 10 is supporting evidence of protein mass spectrometry data in Example 2 of the present invention.

具体实施方式Detailed ways

下面将结合实施例，对本发明的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域技术人员在没有付出创造性劳动的前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solution of the present invention will be clearly and completely described below with reference to the embodiments. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without any creative efforts fall within the scope of protection of the present invention.

如图1和图2所示，本发明提供一种基于单核苷酸多态性预测开放阅读框的方法，包括以下步骤：As shown in Figures 1 and 2, the present invention provides a method for predicting an open reading frame based on single nucleotide polymorphisms, which includes the following steps:

S1、获取待预测的转录本信息，提取候选长开放阅读框。S1. Obtain transcript information to be predicted and extract candidate long open reading frames.

通过从待预测的基因组序列中获得转录本序列信息，并从中提取长度大于900bp的候选长开放阅读框。这里提取候选长开放阅读框的依据为：以起始密码子AUG开头，以终止密码子UAG、UAA或UGA结尾，且候选长开放阅读框序列长度为3的整数倍，长度大于900bp。By obtaining transcript sequence information from the genome sequence to be predicted, and extracting candidate long open reading frames with a length greater than 900 bp. The basis for extracting candidate long open reading frames here is: starting with the start codon AUG and ending with the stop codon UAG, UAA or UGA, and the length of the candidate long open reading frame sequence is an integer multiple of 3 and greater than 900 bp.

S2、评估待预测的候选长开放阅读框中单核苷酸多态性的变化规律，根据预设的第一筛选条件筛选真实长开放阅读框。S2. Evaluate the changing patterns of single nucleotide polymorphisms in the candidate long open reading frames to be predicted, and screen the real long open reading frames according to the preset first screening conditions.

其中，评估待预测的候选长开放阅读框中单核苷酸多态性的变化规律主要包括以下步骤：Among them, evaluating the change patterns of single nucleotide polymorphisms in candidate long open reading frames to be predicted mainly includes the following steps:

S21、获取待预测基因组的群体变异数据，计算待预测的候选长开放阅读框中各位点的核苷酸多样性值，以待预测候选长开放阅读框中各位点的核苷酸多样性值作为筛选条件的基础。S21. Obtain the population variation data of the genome to be predicted, calculate the nucleotide diversity value of each site in the candidate long open reading frame to be predicted, and use the nucleotide diversity value of each site in the candidate long open reading frame to be predicted as The basis for filtering criteria.

S22、分别检验候选长开放阅读框中第3n个碱基的核苷酸多样性值是否大于第3n-2个碱基和第3n-1的核苷酸多样性值，即检验各候选长开放阅读框中各密码子的第三个核苷酸的核苷酸多样性值是否大于密码子的第一个及第二个核苷酸的核苷酸多样性值。其中，1≤n≤L/3，L为候选长开放阅读框的长度，得到检验结果P₁和P₂，利用python语言“scipy.stats”模块中的“combine_pvalues”函数将P1和P2合并，计算得到合并后的P值，具体计算方法为：S22. Test whether the nucleotide diversity value of the 3n base in the candidate long open reading frame is greater than the nucleotide diversity value of the 3n-2 base and 3n-1, that is, test each candidate long open reading frame. Whether the nucleotide diversity value of the third nucleotide of each codon in the reading frame is greater than the nucleotide diversity value of the first and second nucleotide of the codon. Among them, 1≤n≤L/3, L is the length of the candidate long open reading frame, and the test results P ₁ and P ₂ are obtained. Use the "combine_pvalues" function in the "scipy.stats" module of the python language to combine P1 and P2. Calculate the combined P value. The specific calculation method is:

P＝scipy.stats.combine_pvalues([P1,P2])。P=scipy.stats.combine_pvalues([P1,P2]).

第一筛选条件为P值小于0.0001，当P值满足第一筛选条件时，将该候选长开放阅读框评价为真实长开放阅读框。The first screening condition is that the P value is less than 0.0001. When the P value meets the first screening condition, the candidate long open reading frame is evaluated as a true long open reading frame.

S3、统计真实长开放阅读框中各密码子的使用频率，统计每个密码子在真实长开放阅读框中出现的次数，计算各密码子占所有密码子出现次数的比例，即为各密码子的使用频率，用于代表各密码子在整个待预测基因中的使用频率。S3. Count the usage frequency of each codon in the real long open reading frame, count the number of times each codon appears in the real long open reading frame, and calculate the proportion of each codon to the number of occurrences of all codons, which is each codon The usage frequency is used to represent the usage frequency of each codon in the entire gene to be predicted.

S4、从转录本信息中提取候选开放阅读框，这里提取候选开放阅读框的依据为：以起始密码子AUG开头，以终止密码子UAG、UAA或UGA结尾，且候选开放阅读框序列长度为3的整数倍，长度大100bp。通过检验筛选验证所提取各候选开放阅读框是否符合开放阅读框的特征，得到预测结果。S4. Extract candidate open reading frames from transcript information. The basis for extracting candidate open reading frames here is: starting with the start codon AUG and ending with the stop codon UAG, UAA or UGA, and the sequence length of the candidate open reading frame is An integer multiple of 3 is 100bp longer. Verify whether each extracted candidate open reading frame conforms to the characteristics of the open reading frame through inspection and screening, and obtain the prediction result.

其中，检验筛选过程主要为评估待预测的候选开放阅读框中单核苷酸多态性的变化规律和密码子使用频率，主要包括以下步骤：Among them, the inspection and screening process is mainly to evaluate the change pattern and codon usage frequency of single nucleotide polymorphisms in the candidate open reading frames to be predicted, which mainly includes the following steps:

S41、获取待预测基因组的群体变异数据，计算待预测的候选开放阅读框各位点的核苷酸多样性值，以待预测候选开放阅读框中各位点的核苷酸多样性值作为筛选条件的基础。S41. Obtain the population variation data of the genome to be predicted, calculate the nucleotide diversity value of each site in the candidate open reading frame to be predicted, and use the nucleotide diversity value of each site in the candidate open reading frame to be predicted as the screening condition. Base.

S42、分别检验候选开放阅读框中第3n个碱基的核苷酸多样性值是否大于第3n-2个碱基和第3n-1的核苷酸多样性值，即核验各候选开放阅读框中各密码子的第三个核苷酸的核苷酸多样性值是否大于密码子的第一个及第二个核苷酸的核苷酸多样性值，其中，1≤n≤L'/3，L'为候选开放阅读框的长度，得到检验结果P₁'和P₂'。S42. Check whether the nucleotide diversity value of base 3n in the candidate open reading frame is greater than the nucleotide diversity value of base 3n-2 and base 3n-1, that is, check each candidate open reading frame. Whether the nucleotide diversity value of the third nucleotide of each codon is greater than the nucleotide diversity value of the first and second nucleotide of the codon, where 1≤n≤L'/ 3. L' is the length of the candidate open reading frame, and the test results P ₁ ' and P ₂ ' are obtained.

分别检验候选开放阅读框中以第3n-2个碱基为起点的三联体作为密码子的使用频率是否高于以第3n-1个碱基和第3n个碱基为起点的三联体作为密码子的使用频率，得到检验结果P₃'和P₄'，通过P₃'和P₄'可体现候选开放阅读框中三联体作为密码子使用频率与S3中统计结果的一致性，从而反映该候选开放阅读框作为开放阅读框的可靠性。利用python语言“scipy.stats”模块中的“combine_pvalues”函数将P₁'、P₂'、P₃'和P₄'合并，计算得到P₁'、P₂'、P₃'和P₄'四个值合并后的P'值。具体计算方法为：Test whether the triplet starting from the 3n-2 base in the candidate open reading frame is used as a codon more frequently than the triplet starting from the 3n-1 base and the 3n base as the codon. The usage frequency of codons is obtained, and the test results P ₃ ' and P ₄ ' are obtained. P ₃ ' and P ₄ ' can reflect the consistency of the usage frequency of triplets as codons in the candidate open reading frame with the statistical results in S3, thus reflecting the Reliability of candidate ORFs as ORFs. Use the "combine_pvalues" function in the "scipy.stats" module of the python language to combine P ₁ ', P ₂ ', P ₃ ' and P ₄ ', and calculate P ₁ ', P ₂ ', P ₃ ' and P ₄ ' The P' value after combining the four values. The specific calculation method is:

P＝scipy.stats.combine_pvalues([P₁',P₂',P₃',P₄'])。P=scipy.stats.combine_pvalues([P ₁ ',P ₂ ',P ₃ ',P ₄ ']).

第二筛选条件为控制符合预设要求的P'值的错误发现率FDR，这里的预设要求为P'值小于0.05，控制FDR≤0.0001，满足第二筛选条件的即为预测结果。The second filtering condition is to control the false discovery rate FDR of the P' value that meets the preset requirements. The preset requirement here is that the P' value is less than 0.05, and the control FDR ≤ 0.0001. Those that meet the second filtering condition are the prediction results.

应当说明的是，本发明基于单核苷酸多态性预测开放阅读框的方法不适用于多态性太低的群体基因组数据，如群体个数小于400时，不适应本发明的方法进行开放阅读框的预测。It should be noted that the method of predicting open reading frames based on single nucleotide polymorphisms of the present invention is not suitable for population genome data with too low polymorphisms. For example, when the number of populations is less than 400, the method of the present invention is not suitable for open reading frame data. Reading frame prediction.

本发明还提供一种基于单核苷酸多态性预测开放阅读框的系统，包括处理器和存储介质，储存介质的形式可以是磁盘、ROM或RAM，该存储介质上存储有处理器可执行的机器可读指令，该机器可读指令主要体现为可在计算机处理器上执行的计算机程序，该程序执行时执行上述的预测开放阅读框的方法，以实现对开放阅读框的预测。The present invention also provides a system for predicting open reading frames based on single nucleotide polymorphisms, including a processor and a storage medium. The storage medium can be in the form of a disk, ROM or RAM, and the storage medium stores executable files of the processor. The machine-readable instructions are mainly embodied as a computer program that can be executed on a computer processor. When the program is executed, the above-mentioned method of predicting an open reading frame is executed to achieve prediction of the open reading frame.

实施例一：棉花群体数据的分析Example 1: Analysis of cotton population data

本实施例的实验数据从figshare下载，该数据由Li JiangYing等于2021年发表于Genome Biology，文章名称为“Cotton pan-genome retrieves the lost sequences andgenes during domestication and selection”，共产生1961个样本的全基因组重测序数据。The experimental data of this example was downloaded from figshare. The data was published in Genome Biology by Li JiangYing and others in 2021. The article is titled "Cotton pan-genome retrieves the lost sequences and genes during domestication and selection", generating a total of 1961 complete genomes of samples. Resequencing data.

S1、从转录本信息中提取候选长开放阅读框，候选长开放阅读框以起始密码子AUG开头，以终止密码子UAG、UAA或UGA结尾，且候选长开放阅读框序列长度为3的整数倍，长度大于900bp。S1. Extract the candidate long open reading frame from the transcript information. The candidate long open reading frame starts with the start codon AUG and ends with the stop codon UAG, UAA or UGA, and the candidate long open reading frame sequence length is an integer of 3. times, the length is greater than 900bp.

S2、依据待预测的候选长开放阅读框中各位点的单核苷酸多样性值进行检验，根据第一筛选条件筛选获得4065个真实长开放阅读框。S2. Test based on the single nucleotide diversity value of each site in the candidate long open reading frame to be predicted, and obtain 4065 true long open reading frames based on the first screening condition.

S3、统计S2所获得真实长开放阅读框中各密码子的使用频率。S3. Statistics of the usage frequency of each codon in the real long open reading frame obtained in S2.

S4、从转录本序列中提取所有候选开放阅读框，并依据第二筛选条件进行检验筛选，如图3和图4所示，共有86889个候选开放阅读框被预测为真实开放阅读框，召回率的76％(占基因组已知开放阅读框的比例，即真阳性的数量与注释的ORFs总数的比例乘100％)，准确率高达94％(预测所得的开放阅读框中与已知阅读框一致的比例，即真阳性的数量与预测的ORFs总数的比例乘100％)，综合得分为84％[综合评分＝2×召回率×准确率/(召回率+准确率)]。S4. Extract all candidate open reading frames from the transcript sequence, and conduct inspection and screening based on the second filtering conditions. As shown in Figure 3 and Figure 4, a total of 86,889 candidate open reading frames were predicted to be true open reading frames, and the recall rate 76% (accounting for the proportion of known open reading frames in the genome, that is, the ratio of the number of true positives to the total number of annotated ORFs multiplied by 100%), and the accuracy is as high as 94% (the predicted open reading frame is consistent with the known reading frame The ratio, that is, the ratio of the number of true positives to the total number of predicted ORFs multiplied by 100%), the comprehensive score is 84% [comprehensive score = 2×recall×precision/(recall+precision)].

如图5所示，其中还包括了4704个小开放阅读框，其中含有1182个uORF，316个ouORF，2110个dORF，557个odORF，477个internal ORF，62个truncated ORF。如图6所示，图中虚线表示蛋白质质谱数据对基因组中已知ORF的支持度，已发表的蛋白质质谱数据分析显示，这些预测得到的小开放阅读框均受到了很好的支持。As shown in Figure 5, it also includes 4704 small open reading frames, including 1182 uORFs, 316 ouORFs, 2110 dORFs, 557 odORFs, 477 internal ORFs, and 62 truncated ORFs. As shown in Figure 6, the dotted line in the figure represents the support degree of protein mass spectrometry data for known ORFs in the genome. Published protein mass spectrometry data analysis shows that these predicted small open reading frames are well supported.

实施例二：小麦群体数据的分析Example 2: Analysis of wheat population data

本实施例的实验数据从NCBI(编号：PRJNA476679)和CNCB(编号：GVM000082)下载，第一组数据由Cheng Hong等于2019年发表于Genome Biology，文章名称为“Frequentintra-and inter-species introgression shape the landscape of geneticvariation in bread wheat”，对93个小麦进行全基因组重测序。第二组数据由Zhou Yao等于2020年发表于Nature Genetics，文章名称为“Triticum population sequencing prov-ides insights into wheat adaptation”，共对414个小麦品种进行了全基因组重测序。本实施例对两组数据进行合并之后用于小开放阅读框的预测。The experimental data of this example were downloaded from NCBI (No. PRJNA476679) and CNCB (No. GVM000082). The first set of data was published by Cheng Hong et al. in Genome Biology in 2019. The article is titled "Frequentintra-and inter-species introgression shape the landscape of genetic variation in bread wheat”, whole genome resequencing of 93 wheat species was performed. The second set of data was published in Nature Genetics by Zhou Yao et al. in 2020. The article is titled "Triticum sequencing population prov-ides insights into wheat adaptation". A total of 414 wheat varieties were resequenced across the whole genome. In this embodiment, two sets of data are combined and used to predict small open reading frames.

S1、从转录本信息中提取候选长开放阅读框，以起始密码子AUG开头，以终止密码子UAG、UAA或UGA结尾，且候选长开放阅读框序列长度为3的整数倍，长度大于900bp。S1. Extract the candidate long open reading frame from the transcript information, starting with the start codon AUG and ending with the stop codon UAG, UAA or UGA, and the length of the candidate long open reading frame sequence is an integer multiple of 3, and the length is greater than 900bp .

S2、依据待预测的候选长开放阅读框中各位点的核苷酸多样性值进行检验，根据第一筛选条件筛选获得13683个真实长开放阅读框。S2. Test based on the nucleotide diversity value of each site in the candidate long open reading frame to be predicted, and obtain 13683 real long open reading frames based on the first screening condition.

S4、从转录本序列中提取所有候选开放阅读框，并依据第二筛选条件进行检验筛选，如图7和图8所示，共有117140个候选开放阅读框被预测为真实开放阅读框，占总预测的87％，准确率高达95％，综合得分为91％。S4. Extract all candidate open reading frames from the transcript sequence, and conduct inspection and screening based on the second filtering conditions. As shown in Figure 7 and Figure 8, a total of 117,140 candidate open reading frames were predicted to be true open reading frames, accounting for the total The prediction is 87%, the accuracy is as high as 95%, and the overall score is 91%.

如图9所示，经过检验筛选成功预测了5025个小开放阅读框，其中含有232个uORF，21个ouORF，234个dORF，129个odORF，3532个internal ORF，675个extend ORF,202个truncated ORF。如图10所示，图中虚线表示蛋白质质谱数据对基因组中已知ORF的支持度，已发表的蛋白质质谱数据分析显示，这些预测得到的小开放阅读框均受到了很好的支持。As shown in Figure 9, 5025 small open reading frames were successfully predicted after inspection and screening, including 232 uORFs, 21 ouORFs, 234 dORFs, 129 odORFs, 3532 internal ORFs, 675 extend ORFs, and 202 truncated ORF. As shown in Figure 10, the dotted line in the figure represents the support degree of protein mass spectrometry data for known ORFs in the genome. Published protein mass spectrometry data analysis shows that these predicted small open reading frames are well supported.

以上所述仅是本发明的优选实施方式，应当理解本发明并非局限于本文所披露的形式，不应看作是对其他实施例的排除，而可用于各种其他组合、修改和环境，并能够在本文所述构想范围内，通过上述教导或相关领域的技术或知识进行改动。而本领域人员所进行的改动和变化不脱离本发明的精神和范围，则都应在本发明所附权利要求的保护范围内。The above are only preferred embodiments of the present invention. It should be understood that the present invention is not limited to the form disclosed herein and should not be regarded as excluding other embodiments, but can be used in various other combinations, modifications and environments, and Modifications can be made within the scope of the ideas described herein through the above teachings or technology or knowledge in related fields. Any modifications and changes made by those skilled in the art that do not depart from the spirit and scope of the present invention shall be within the protection scope of the appended claims of the present invention.

Claims

1. A method for predicting open reading frames based on single nucleotide polymorphisms, characterized by: the method comprises the following steps:

s1, acquiring transcript information to be predicted, and extracting candidate long open reading frames;

s2, evaluating the change rule of single nucleotide polymorphism in the candidate long open reading frame to be predicted, and screening the true long open reading frame according to a preset first screening condition;

s3, counting the use frequency of each codon in the real long open reading frame;

s4, extracting candidate open reading frames from the transcript information, evaluating the change rule and the codon use frequency of single nucleotide polymorphism in the candidate open reading frames to be predicted, and taking the candidate open reading frames meeting the preset second screening conditions as prediction results;

the evaluation of the change rule of the single nucleotide polymorphism in the candidate long open reading frame to be predicted comprises:

s21, acquiring group variation data of a sample to be predicted, and calculating nucleotide diversity values of all sites in a candidate long open reading frame to be predicted;

s22, respectively checking whether the nucleotide diversity value of the 3 n-th base in the candidate long open reading frame is larger than the nucleotide diversity values of the 3n-2 nd base and the 3n-1 st base, wherein n is more than or equal to 1 and less than or equal to L/3, L is the length of the candidate long open reading frame, P1 and P2 are obtained, and calculating the combined P value;

the evaluating of the change rule and codon usage frequency of single nucleotide polymorphisms in the candidate open reading frames to be predicted comprises:

s41, acquiring group variation data of a sample to be predicted, and calculating nucleotide diversity values of all sites in a candidate open reading frame to be predicted;

s42, respectively checking whether the nucleotide diversity value of the 3 n-th base in the candidate open reading frame is larger than the nucleotide diversity values of the 3n-2 nd base and the 3n-1 st base, wherein n is more than or equal to 1 and less than or equal to L '/3, L' is the length of the candidate open reading frame, and P1 'and P2' are obtained; and respectively checking whether the use frequency of the triplet taking the 3n-2 th base as the starting point in the candidate open reading frame is higher than that of the triplet taking the 3n-1 st base and the 3 n-th base as the starting point in the candidate open reading frame, obtaining P3' and P4', and calculating the P ' value after combining the four values of P1', P2', P3 and P4.

2. The method for predicting open reading frames based on single nucleotide polymorphisms according to claim 1, wherein: the basis for extracting the candidate long open reading frames and the candidate open reading frames is as follows: beginning with the start codon AUG and ending with the stop codon UAG, UAA or UGA, and the sequence lengths of the candidate long open reading frame and the candidate open reading frame are integer multiples of 3.

3. The method for predicting open reading frames based on single nucleotide polymorphisms according to claim 1, wherein: the length of the candidate long open reading frame is greater than 900bp, and the length of the candidate open reading frame is greater than 100bp.

4. The method for predicting open reading frames based on single nucleotide polymorphisms according to claim 1, wherein: the first screening condition is that the P value is less than 0.0001.

5. The method for predicting open reading frames based on single nucleotide polymorphisms according to claim 1, wherein: the second screening condition is to control the error discovery rate FDR of the P' value meeting the preset requirement, and the FDR is controlled to be less than or equal to 0.0001.

6. The method for predicting open reading frames based on single nucleotide polymorphisms as recited in claim 5 wherein: the preset requirement is that the P' value is less than 0.05.

7. A system for predicting open reading frames based on single nucleotide polymorphisms, characterized by: comprising a processor and a storage medium storing machine-readable instructions executable by the processor, which when executed perform the method of predicting an open reading frame of any one of claims 1-6.