JP2005135154A

JP2005135154A - Method for predicting gene ontology term based on sequence similarity

Info

Publication number: JP2005135154A
Application number: JP2003370572A
Authority: JP
Inventors: Junichi Uechi; 潤一上地; Koichi Kimura; 宏一木村
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2003-10-30
Filing date: 2003-10-30
Publication date: 2005-05-26

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method to make both recall values and accuracy values sufficiently higher than those in the conventional methods for the prediction of gene ontology terms based on sequence similarity. <P>SOLUTION: Using a gene sequence whose gene ontology terms are known, gene ontology terms are predicted and the accuracy of the prediction is calculated while the requirements for the prediction are varied. The optimal requirements for the prediction are searched and/or determined. Using the optimal requirements for the prediction, the gene ontology terms are predicted. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は遺伝子配列の情報解析に係わり、配列類似性検索により、遺伝子機能に対応する遺伝子オントロジータームを推定する遺伝子オントロジーターム予測方法に関する。 The present invention relates to information analysis of gene sequences, and relates to a gene ontology term prediction method for estimating a gene ontology term corresponding to a gene function by sequence similarity search.

従来、遺伝子配列の特徴を表す遺伝子オントロジータームを予測する方法として、遺伝子配列がもつ機能モチーフを抽出し、その機能モチーフに対応する遺伝子オントロジータームを選択する方法があった（下記非特許文献１）。配列類似性による方法としては、遺伝子オントロジーを採用した遺伝子配列のデータベースを用いて、配列類似性値がある閾値をこえる遺伝子配列の遺伝子オントロジータームを予測結果としてそのままもちいる方法がある（下記非特許文献２）。 Conventionally, as a method for predicting a gene ontology term representing the characteristics of a gene sequence, there has been a method of extracting a functional motif possessed by a gene sequence and selecting a gene ontology term corresponding to the functional motif (Non-patent Document 1 below). . As a method based on sequence similarity, there is a method that uses a gene ontology term of a gene sequence that exceeds a certain threshold value as a prediction result using a gene sequence database that adopts gene ontology (the following non-patent document). Reference 2).

しかし、従来方法は予測精度をより高くすることを意識した方法とはなっておらず、従来方法で用いる閾値は、予測精度をより高くするように決められた値ではない。予測精度には、予測の正解率（Precision値）と、予測の回収率（Recall値）の二つの要素があり、Recall値とPrecision値が共に十分に高くなるような方法が望ましい。 However, the conventional method is not a method conscious of increasing the prediction accuracy, and the threshold value used in the conventional method is not a value determined to increase the prediction accuracy. There are two elements of prediction accuracy, the accuracy rate of prediction (Precision value) and the recovery rate of prediction (Recall value), and it is desirable that the Recall value and Precision value be sufficiently high.

Apweiler, R., et al., Nucleic Acids Res., 29: 37-40, 2001Apweiler, R., et al., Nucleic Acids Res., 29: 37-40, 2001 The FANTOM Consortium and the RIKEN Genome Exploration Research Group Phase I & II Team, nature, 420:563-573, 2002The FANTOM Consortium and the RIKEN Genome Exploration Research Group Phase I & II Team, nature, 420: 563-573, 2002

本発明が解決しようとする課題は、配列類似性に基づいた遺伝子オントロジーターム予測において、従来方法に比べRecall値とPrecision値が共に十分に高くなるような方法を提供することである。 The problem to be solved by the present invention is to provide a method in which both the recall value and the precision value are sufficiently higher in gene ontology term prediction based on sequence similarity than in the conventional method.

本発明は、遺伝子オントロジーターム予測のRecall値とPrecision値を共に十分に高めるため、以下の処理工程から構成される方法によって遺伝子オントロジーターム予測を行う。 The present invention performs gene ontology term prediction by a method comprising the following processing steps in order to sufficiently increase both the recall value and precision value of gene ontology term prediction.

すなわち、本発明による遺伝子オントロジーターム予測方法は、遺伝子オントロジータームの割り当てられた第１の複数の遺伝子配列と、遺伝子オントロジータームの割り当てられた第２の複数の遺伝子配列との間の配列類似性値を用い、遺伝子オントロジー予測の予測精度が十分に高くなる条件を決定する工程と、第３の遺伝子配列と、遺伝子オントロジーの割当てられた第４の複数の遺伝子配列との配列類似性値を計算し、前記配列類似性値と前記工程で決定された条件に従い、前記第３の遺伝子配列に遺伝子オントロジータームを割り当てる工程とを含むことを特徴とする。 That is, the gene ontology term prediction method according to the present invention provides a sequence similarity value between a first plurality of gene sequences to which a gene ontology term is assigned and a second plurality of gene sequences to which a gene ontology term is assigned. And calculating the sequence similarity value between the third gene sequence and the fourth plurality of gene sequences to which the gene ontology is assigned, and the step of determining a condition that the prediction accuracy of gene ontology prediction is sufficiently high. And assigning a gene ontology term to the third gene sequence according to the sequence similarity value and the conditions determined in the step.

第１の複数の遺伝子配列と第２の複数の遺伝子配列を準備するために、まず、遺伝子の特徴がすでに知られており、その特徴を表す遺伝子オントロジータームが各エントリーに付与されている遺伝子配列データベースを用い、このデータベースを、乱数を用いて２分することにより、第１の複数の遺伝子配列と第２の複数の遺伝子配列を決定することが好ましい。 In order to prepare the first plurality of gene sequences and the second plurality of gene sequences, first, gene features in which gene features are already known and gene ontology terms representing the features are assigned to each entry It is preferable to determine the first plurality of gene sequences and the second plurality of gene sequences by using a database and dividing the database into two using random numbers.

また、本発明による遺伝子オントロジーターム予測方法は、遺伝子オントロジータームの割当てられている第１の複数の遺伝子配列各々と、遺伝子オントロジータームの割当てられている第２の複数の遺伝子配列各々との配列類似性値を計算する第１の工程と、前記工程で計算された配列類似性値を用いて、前記第１の複数の遺伝子配列各々について、前記第１の遺伝子配列との配列類似性値が第１の閾値を超える遺伝子配列を前記第２の複数の遺伝子配列中から選択し、前期選択した遺伝子配列に割当てられている遺伝子オントロジーターム中から出現頻度が第２の閾値を超える遺伝子オントロジータームを、前記第１の遺伝子配列の遺伝子の特徴として予測する第２の工程と、前記第１の閾値および前記第２の閾値を別の値に設定しつつ前記第２の工程を繰り返す第３の工程と、前記第２〜第３の工程で用いた前記第１の閾値と前記第２の閾値毎に、前記第１の閾値と前記第２の閾値を用いた予測結果について予測精度を各々求め、前記予測のうち、十分に高い予測精度に対応する前記第１の閾値と前記第２の閾値を決定し、前記第１の閾値を最適配列類似性閾値とし、前記第２の閾値を最適ターム出現頻度閾値とする第４の工程と、第３の遺伝子配列と、遺伝子の特徴として遺伝子オントロジータームの割当てられている第４の複数の遺伝子配列各々との配列類似性値を各々求め、前記第３の遺伝子配列との前記配列類似性値が前記最適配列類似性閾値を超えた複数の遺伝子配列を選択し、前記複数の遺伝子配列に割り当てられている複数の遺伝子オントロジータームのうち、出現頻度が前記最適ターム出現頻度閾値を超えた遺伝子オントロジータームを選択し、該遺伝子オントロジータームを前記第３の遺伝子配列の遺伝子の特徴として予測する第５の工程とを含むことを特徴とする。 In addition, the gene ontology term prediction method according to the present invention provides a sequence similarity between each of a plurality of first gene sequences to which a gene ontology term is assigned and each of a plurality of second gene sequences to which a gene ontology term is assigned. Using the first step of calculating the sex value and the sequence similarity value calculated in the step, for each of the first plurality of gene sequences, the sequence similarity value with the first gene sequence is the first Selecting a gene sequence exceeding a threshold of 1 from the plurality of second gene sequences, a gene ontology term having an appearance frequency exceeding a second threshold among gene ontology terms assigned to the gene sequence selected in the previous period, A second step of predicting as a gene characteristic of the first gene sequence, and setting the first threshold and the second threshold to different values The first threshold value and the second threshold value are set for each of the third step, the second threshold value, and the second threshold value used in the second to third steps. Each of the prediction accuracy is obtained for the prediction results used, the first threshold value and the second threshold value corresponding to sufficiently high prediction accuracy among the predictions are determined, and the first threshold value is determined as the optimal sequence similarity threshold And a fourth step of setting the second threshold value as an optimum term appearance frequency threshold value, a third gene sequence, and each of a plurality of fourth gene sequences to which gene ontology terms are assigned as gene features A plurality of gene sequences each of which is obtained by obtaining a sequence similarity value, selecting a plurality of gene sequences whose sequence similarity value with the third gene sequence exceeds the optimum sequence similarity threshold, and Of gene ontology terms Frequency selects the Gene Ontology terms exceeds the optimal term occurrence frequency threshold, characterized in that it comprises a fifth step of predicting the gene ontology terms as a feature of a gene of the third gene sequence.

前記第１〜４の工程の目的は、実際の予測に先立ち、予測精度が十分に高くなるような予測の配列類似性閾値とターム出現頻度閾値を探索することである。また、予測精度を求めるために、前記第１の配列として、実際の遺伝子の特徴が知られており、その特徴として遺伝子オントロジータームの割当てられている前記第１の複数の配列を用いている。 The purpose of the first to fourth steps is to search for a sequence similarity threshold value and a term appearance frequency threshold value for predicting sufficiently high in prediction accuracy prior to actual prediction. Further, in order to obtain the prediction accuracy, the characteristics of an actual gene are known as the first sequence, and the first plurality of sequences to which gene ontology terms are assigned are used as the first sequence.

また、本発明による遺伝子オントロジーターム予測方法は、第１の複数の遺伝子配列および第２の複数の遺伝子配列および第３の遺伝子配列および第４の複数の遺伝子配列は、核酸塩基配列もしくは蛋白質アミノ酸配列であることを特徴とする。 In addition, the gene ontology term prediction method according to the present invention includes a first plurality of gene sequences, a second plurality of gene sequences, a third gene sequence, and a fourth plurality of gene sequences, wherein a nucleobase sequence or a protein amino acid sequence It is characterized by being.

本発明で用いる遺伝子配列は、すべて核酸塩基配列であるか、あるいはすべて蛋白質アミノ酸配列であることが好ましい。 The gene sequences used in the present invention are preferably all nucleobase sequences or all protein amino acid sequences.

また、本発明による遺伝子オントロジー予測方法は、前記配列類似性値として、遺伝子配列間アライメントから得られる変数あるいは前記変数の関数を用いることを特徴とする。 The gene ontology prediction method according to the present invention is characterized in that a variable obtained from alignment between gene sequences or a function of the variable is used as the sequence similarity value.

本発明では配列類似性値として、配列相同性検索ツールBLASTで用いられるE-value値、またはBLAST以外の配列相同性検索ツールで用いるE-value値に相当する値、またはアライメントのアイデンティティ、またはアライメント長、またはアライメントする双方の配列全長に占めるアライメント長の割合を用いることが好ましい。 In the present invention, as a sequence similarity value, an E-value value used in a sequence homology search tool BLAST, a value corresponding to an E-value value used in a sequence homology search tool other than BLAST, an identity of alignment, or an alignment It is preferable to use the length or the ratio of the alignment length to the total length of both sequences to be aligned.

また、本発明による遺伝子オントロジーターム予測方法は、前記複数の第２の遺伝子配列および前記複数の第４の遺伝子配列に予め割当てられている遺伝子オントロジータームだけではなく、該遺伝子オントロジータームの上位概念にあたる遺伝子オントロジーターム各々についても前期出現頻度を算出し予測対象とすることを特徴とする。 The gene ontology term prediction method according to the present invention is not only a gene ontology term pre-assigned to the plurality of second gene sequences and the plurality of fourth gene sequences, but also a superordinate concept of the gene ontology term. Each gene ontology term is also characterized by calculating the appearance frequency in the previous term and making it a prediction target.

遺伝子配列に予め割当てられている遺伝子オントロジータームの上位概念にあたる遺伝子オントロジータームも予測対象する目的は、上位概念の遺伝子オントロジータームも予測できるようにし、予測のRecall値を高めることである。 The purpose of predicting a gene ontology term that is a superordinate concept of a gene ontology term pre-assigned to a gene sequence is to enable prediction of a gene ontology term of a superordinate concept and to increase the recall value of the prediction.

また、本発明による遺伝子オントロジーターム予測方法は、前記第１の配列との前記配列相同性値が前記第１の閾値を超えた複数の遺伝子配列のうち、前記第１の遺伝子配列との前記配列相同性値が比較的高いn本の遺伝子配列を選択し、前記n本の遺伝子配列に割り当てられている複数の遺伝子オントロジータームに含まれるある同一の遺伝子オントロジータームの総数をmとしたとき、m/nを前記遺伝子オントロジータームの前期出現頻度とすることを特徴とする。 In addition, the gene ontology term prediction method according to the present invention provides the sequence with the first gene sequence among a plurality of gene sequences in which the sequence homology value with the first sequence exceeds the first threshold. When n gene sequences having a relatively high homology value are selected and the total number of the same gene ontology terms included in the plurality of gene ontology terms assigned to the n gene sequences is m, m / n is defined as the appearance frequency of the gene ontology term.

第１の閾値（配列類似性値の閾値）により選ばれた遺伝子オントロジータームをそのまま予測結果としてしまうと、誤った遺伝子オントロジータームを多く含んでしまうためPrecision値が低下することが分かっている、そこで第１の閾値で選ばれた遺伝子オントロジー各々について、出現頻度を計算し、出現頻度が高い遺伝子オントロジーを選択することにより、より配列類似性値の高い配列に割当てられ、かつ、より高い出現頻度で現れる遺伝子オントロジータームを選別することを目的とする。 It is known that if the gene ontology term selected by the first threshold (sequence similarity value threshold) is used as a prediction result as it is, the Precision value will decrease because it contains many erroneous gene ontology terms. For each gene ontology selected with the first threshold, by calculating the appearance frequency and selecting a gene ontology with a high appearance frequency, it is assigned to a sequence with a higher sequence similarity value, and with a higher appearance frequency. The purpose is to select gene ontology terms that appear.

また、本発明による遺伝子オントロジーターム予測方法は、ある遺伝子配列について予測された第１の遺伝子オントロジータームと前記遺伝子配列に予め割り当てられていた第２の遺伝子オントロジータームが同一である場合と、前記第１の遺伝子オントロジータームが前記第２の遺伝子オントロジータームの下位概念もしくは上位概念に位置する場合に、前記２の遺伝子オントロジータームは正しく予測された遺伝子オントロジータームであるとすることを特徴とする。 The gene ontology term prediction method according to the present invention includes a first gene ontology term predicted for a gene sequence and a second gene ontology term previously assigned to the gene sequence, When one gene ontology term is located in a subordinate concept or a superordinate concept of the second gene ontology term, the second gene ontology term is a correctly predicted gene ontology term.

また、本発明による遺伝子オントロジーターム予測方法は、前記複数の第１の遺伝子配列に予め割り当てられている遺伝子オントロジータームの総数をAとし、前記複数の第１の遺伝子配列各々について予測された遺伝子オントロジータームの総数をBとし、前記予測された遺伝子オントロジータームのうち前記正しく予測された遺伝子オントロジータームの総数をCとし、次式（１）を満足する実数をRとし、次式（２）を満足する実数をPとしたとき、実数Rと実数Pの関数を前期予測精度とすることを特徴とする。
R = C/A ……（１）
P = C/B ……（２） In the gene ontology term prediction method according to the present invention, the total number of gene ontology terms assigned in advance to the plurality of first gene sequences is A, and the gene ontology predicted for each of the plurality of first gene sequences. The total number of terms is B, the total number of correctly predicted gene ontology terms among the predicted gene ontology terms is C, the real number satisfying the following equation (1) is R, and the following equation (2) is satisfied When the real number to be performed is P, the function of the real number R and the real number P is used as the previous prediction accuracy.
R = C / A (1)
P = C / B (2)

実数Rは予測のRecall値に対応し、実数Pは予測のPrecision値に対応する。さらに、前記予測精度は、実数Rの値と実数Pの値が共に高いほど高い値となる関数を用いる。したがって、予測精度がより高くなる予測条件を探索することにより、Recall値およびPrecision値が共により高くなる予測条件を探索することが可能となる。 The real number R corresponds to the predicted recall value, and the real number P corresponds to the predicted precision value. Further, the prediction accuracy uses a function that becomes higher as the value of the real number R and the value of the real number P are both higher. Therefore, it is possible to search for a prediction condition in which the Recall value and the Precision value are both higher by searching for a prediction condition in which the prediction accuracy is higher.

本発明によれば、配列類似性検索に基づく遺伝子オントロジーターム予測において、十分に高い精度で遺伝子オントロジータームを予測することが可能となる。 According to the present invention, gene ontology terms can be predicted with sufficiently high accuracy in gene ontology terms prediction based on sequence similarity search.

以下発明の実施の形態を、図を用いて詳細に説明する。
図１に与えられた遺伝子配列の特徴を遺伝子オントロジータームとして予測する方法において、予め最適予測条件を求めておくことにより高い予測精度で予測を行うことを目的とした、本発明の一実施例における処理の流れを示す。本実施の形態としては、遺伝子配列として、蛋白質アミノ酸配列を用い、蛋白質データベースとしてSIB（Swiss Institute of Bioinformatics）がインターネット上で公開しているSWISS-PROTおよびTrEMBLを用いた場合について説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
In the method of predicting the feature of the gene sequence given in FIG. 1 as a gene ontology term, it is intended to perform prediction with high prediction accuracy by obtaining an optimal prediction condition in advance. The flow of processing is shown. In the present embodiment, a case will be described in which protein amino acid sequences are used as gene sequences, and SWISS-PROT and TrEMBL published by SIB (Swiss Institute of Bioinformatics) on the Internet are used as protein databases.

まず、図１において、101は遺伝子オントロジーターム（以下省略のためGOタームと呼ぶ）を予測する対象である遺伝子配列である。また、102は遺伝子の特徴が既知であり、その特徴に対応するGOタームがエントリーに割当てられている遺伝子配列データである。また、103は102の遺伝子配列データベースの各エントリーとGOタームとを対応付けたファイルであり、SIBがインターネット上で公開しているものを用いる。104は概念の上下関係によりGOタームをツリー状に構造化したデータのファイルであり、Gene Ontology Consortiumがインターネット上で公開しているものを用いる。Gene Ontology Consortium の開発したGOタームには、molecular function、biological process、cellular componentの３つのカテゴリーがあるが、ここでは104のファイルとしてmolecular functionのタームからなるものを用いる。次に、103のファイルを参照し、GOタームを持つエントリーを102中から検索し、GOタームを持つ遺伝子配列からなる105のデータベースを作成する。このとき、実験による証拠に基づいて割当てられたGOタームのみを扱うようにするため、103のファイルを参照し、GOタームのevidence codeがIEA（Inferred from electronic annotation：コンピュータによる機械的なアノテーションによる予測）となっているGOタームは除外するのが好ましい。次に、106のデータベースをテストデータセット107と学習データセット108に分割する。５回のクロスバリデーションテストを行う場合は、106のデータベースを５分割することで、テストデータセット107と学習データセット108を作成する。このとき乱数を用いてランダムに分割するのがよい。次に109の工程で、テストデータセット107の遺伝子配列各々と学習データセット108の遺伝子配列各々との配列類似性値を計算する。配列類似性値の計算には、米国NCBI（National Center for Biotechnology Information）がインターネット上で公開しているプログラムBLAST、あるいは類似のプログラムを用いる。110において、予測条件を変化させながら、予測条件各々について、GOタームを予測し、予測結果を遺伝子配列−既知GOターム−GOターム対応ファイル111に出力する。この110の工程では、学習データセット108の遺伝子配列に割当てられている遺伝子オントロジータームのうち、ある予測条件を満たしたものを選択する。この工程では、108に予め割当てられているGOタームのみではなく、このGOタームの全ての上位概念に位置するGOタームも予測条件を満たしていれば、予測する。あるGOタームの上位概念に位置するGOタームの選択は、GOタームのツリー構造ファイル104を参照して行う。110の工程で得られる111のファイルには予測条件各々について、（１）各遺伝子配列、（２）各遺伝子配列に予め割当てられている既知GOタームおよび（３）予測されたGOタームとの対応関係が記述されている。111のファイルを用い、112の予測精度の算出処理および最適予測条件の決定を行う。この工程では、110で用いた予測条件毎に予測精度を算出し、予測精度が最大となる予測条件（最適予測条件）を決定する。ここまでの工程は、最適予測条件を求めるためのものである。以降はこの最適予測条件を用い、予測対象遺伝子配列101に対しGOターム予測を行う。113の工程では、GOターム予測の対象となる101の遺伝子配列各々と106のデータベースの遺伝子配列各々との配列類似性値を算出する。114の工程では、得られた配列類似性値と、112の工程で得られた最適予測条件を用いてGOタームの割当てを行い、割当てられたGOタームを予測結果として115に出力する。114の工程ではまた、GOタームツリー構造ファイル104を用いて、110の工程同様、上位概念のタームでも最適予測条件を満たせば予測結果として115に出力する。 In FIG. 1, reference numeral 101 denotes a gene sequence that is a target for predicting a gene ontology term (hereinafter referred to as GO term for omission). Reference numeral 102 denotes gene sequence data in which gene features are known and GO terms corresponding to the features are assigned to entries. Reference numeral 103 denotes a file in which each entry of the 102 gene sequence database is associated with the GO term, and the file published by SIB on the Internet is used. 104 is a data file in which the GO terms are structured in a tree shape according to the hierarchical relationship of concepts, and the one published on the Internet by the Gene Ontology Consortium is used. There are three categories of molecular terms, biological processes, and cellular components in the GO terms developed by the Gene Ontology Consortium. Here, 104 files consist of molecular function terms. Next, 103 files are referenced, and entries having GO terms are searched from 102, and 105 databases comprising gene sequences having GO terms are created. At this time, in order to handle only GO terms assigned based on experimental evidence, 103 files are referenced, and the evidence code of GO terms is predicted by IEA (Inferred from electronic annotation) ) GO terms are preferably excluded. Next, the database 106 is divided into a test data set 107 and a learning data set 108. When five cross-validation tests are performed, the test data set 107 and the learning data set 108 are created by dividing the database 106 into five. At this time, it is preferable to randomly divide using random numbers. Next, in step 109, sequence similarity values between each gene sequence in the test data set 107 and each gene sequence in the learning data set 108 are calculated. For the calculation of the sequence similarity value, the program BLAST published by the US NCBI (National Center for Biotechnology Information) on the Internet or a similar program is used. At 110, while changing the prediction condition, a GO term is predicted for each prediction condition, and the prediction result is output to the gene sequence-known GO term-GO term correspondence file 111. In step 110, a gene ontology term assigned to the gene sequence of the learning data set 108 is selected that satisfies a certain prediction condition. In this step, not only the GO terms pre-assigned to 108 but also the GO terms located in all superordinate concepts of this GO term are predicted if they satisfy the prediction conditions. Selection of a GO term located in a superordinate concept of a certain GO term is performed with reference to the GO term tree structure file 104. The 111 files obtained in the 110 steps include (1) each gene sequence, (2) a known GO term pre-assigned to each gene sequence, and (3) a predicted GO term for each prediction condition. The relationship is described. Using the 111 files, 112 prediction accuracy calculation processing and optimum prediction conditions are determined. In this step, the prediction accuracy is calculated for each prediction condition used in 110, and the prediction condition (optimum prediction condition) that maximizes the prediction accuracy is determined. The steps so far are for obtaining the optimum prediction condition. Thereafter, GO term prediction is performed on the gene sequence 101 to be predicted using the optimum prediction conditions. In step 113, sequence similarity values are calculated between each of the 101 gene sequences to be subjected to GO term prediction and each of the 106 gene sequences in the database. In step 114, GO terms are assigned using the obtained sequence similarity value and the optimal prediction condition obtained in step 112, and the assigned GO terms are output to 115 as prediction results. In step 114, the GO term tree structure file 104 is used, and as in step 110, if the term of the higher concept satisfies the optimum prediction condition, it is output to 115 as a prediction result.

図２は、109における、テストデータセットの配列と学習データセットの配列との配列類似性値算出処理１の結果から得られるデータのデータ構造を表す。201は、１本のテストデータセットの遺伝子配列に対応するデータであり、全体のデータはこの繰り返し構造を含む。201は少なくとも、テストデータセットの遺伝子配列を識別する名前及び、そのテストデータセットの配列に予め割当てられている202のGOタームの繰り返し構造及び、テストデータセットの配列と類似性のあった学習データセットの遺伝子配列に関する情報203の繰り返し構造を含む。202はGOタームの情報であり、GOタームとGOタームを識別するGOタームIDを含む。203は少なくとも、学習データセットの遺伝子配列の配列名及び、その配列割当てられているGOタームに関する204の繰り返し構造を含む。204はGOタームの情報であり、少なくともGOタームもしくはGOタームを識別するIDを含む。 FIG. 2 shows the data structure of data obtained from the result of the sequence similarity value calculation process 1 between the test data set array and the learning data set array in 109. 201 is data corresponding to the gene sequence of one test data set, and the whole data includes this repetitive structure. 201 is at least a name for identifying the gene sequence of the test data set, 202 GO term repetition structure pre-assigned to the test data set sequence, and learning data similar to the test data set sequence It contains a repetitive structure of information 203 on the gene sequence of the set. Reference numeral 202 denotes GO term information, which includes a GO term ID that identifies the GO term and the GO term. 203 includes at least the sequence name of the gene sequence of the learning data set and 204 repeat structures for the GO term to which the sequence is assigned. 204 is GO term information and includes at least an GO term or an ID for identifying the GO term.

図３は、110における、GOターム割当て処理を説明するためのフローチャートである。301の終了判定を含む繰り返し処理により、全ての配列類似性閾値とターム出現頻度閾値の組み合わせについて、以下の処理を行う。302で、配列類似性閾値とターム出現頻度閾値の組み合わせを設定する。303の終了判定を含む繰り返し処理により、テストデータセットの全ての遺伝子配列について以下の処理を行う。303で処理中のテストデータセットの遺伝子配列に対する201に示す情報を読み込む。この中には、202に示す配列類似性の見られた学習データセットの遺伝子配列の情報が複数含まれる。305で、203の複数のデータのうち、配列類似性値が配列類似性閾値を超えていないデータを削除する。ただし、配列類似性値としてBLASTツールのE-value値を用いた場合は、配列類似性値が配列類似性閾値以下のときにデータを削除する。306で、203のデータ各々の持つ204の複数のGOターム各々について出現頻度を算出する。307で、出現頻度が出現頻度閾値以上のGOタームを選択する。308で、選択されたGOタームを予測GOタームとし、テストデータセット配列名と、予測GOタームに関する情報と、202のGOタームに関する情報とを対応付けて109のファイルに出力する。 FIG. 3 is a flowchart for explaining the GO term assignment process at 110. The following processing is performed for all combinations of sequence similarity thresholds and term appearance frequency thresholds through repeated processing including 301 end determination. In 302, a combination of sequence similarity threshold and term appearance frequency threshold is set. The following processing is performed for all gene sequences in the test data set by repeated processing including 303 end determination. The information shown in 201 for the gene sequence of the test data set being processed in 303 is read. This includes a plurality of pieces of information on the gene sequence of the learning data set in which the sequence similarity shown in 202 is seen. In 305, the data whose sequence similarity value does not exceed the sequence similarity threshold is deleted from the plurality of pieces of data 203. However, when the E-value value of the BLAST tool is used as the sequence similarity value, the data is deleted when the sequence similarity value is equal to or less than the sequence similarity threshold. In 306, the appearance frequency is calculated for each of a plurality of 204 GO terms possessed by each of the 203 data. In 307, a GO term whose appearance frequency is equal to or higher than the appearance frequency threshold is selected. In 308, the selected GO term is set as the predicted GO term, and the test data set array name, the information on the predicted GO term, and the information on the 202 GO term are associated and output to the file 109.

図４は、109のファイルのデータ構造を表す。401は302で設定した配列類似性閾値およびターム出現頻度閾値に対応するデータであり、全体のデータはこの繰り返し構造を持つ。401は、少なくとも、302で設定した配列類似性閾値および、ターム出現頻度閾値のデータおよび、テストデータセットの全ての配列に関するデータ402の繰り返し構造を含む。402は、少なくとも、テストデータセットの配列名、および402の配列に対して110で予測されたGOタームに関する繰り返し構造403のデータと、402の配列に予め割当てられているGOタームに関する繰り返し構造404のデータを含む。ただし、あるテストデータセットの配列についてＧＯタームが予測できなかった場合は、402は、403のデータを含まない。403および404はGOタームあるいはGOタームID、あるいはその両方の情報を含む。 FIG. 4 shows the data structure of 109 files. 401 is data corresponding to the sequence similarity threshold and the term appearance frequency threshold set in 302, and the entire data has this repeating structure. 401 includes at least the sequence similarity threshold value set in 302, the term appearance frequency threshold data, and the repetition structure of the data 402 for all sequences in the test data set. 402 is at least the sequence name of the test data set and the data of the repeat structure 403 for the GO term predicted at 110 for the 402 sequence and the repeat structure 404 for the GO term pre-assigned to the 402 sequence. Contains data. However, if the GO term cannot be predicted for an array of a certain test data set, 402 does not include 403 data. 403 and 404 contain information on GO terms and / or GO term IDs.

次に110の工程に含まれる各処理の手順を、具体例を用いて説明する。これらの工程ではある201のデータを扱うが、このデータはすでに205の工程により、加工されている。この201のデータに含まれる、学習データセットの遺伝子配列各々について、その遺伝子の持つGOタームと配列類似性値を取り出し、図５に示す表を作成する。この表に含まれる各GOタームについて、そのＧＯタームの上位概念に位置するターム（親ターム）をすべて選択し、図５の表に追加することで、図６に示す表を作成する。親タームの選択は104のGOタームツリー構造ファイルを参照し行う。この104のファイルは、ＧＯターム同士の概念の上下関係を図７で示すようなツリー構造で記述している。図７は矢印の矢の向きに下位概念のタームが位置するように表記しており、702のタームＢは701のタームＡにとって下位概念に位置するターム（子ターム）である。次に、図６のような表を参照し、比較的配列類似性の高い（Ｅ-value値の小さい）遺伝子配列に多く出現するＧＯタームを選択する。そのために、配列類似性の高い順（E-value値の小さい順）に上位Ｎ位までのグループに注目し、そのグループ中で出現頻度が出現頻度閾値を超えるＧＯタームを選択する。たとえば、701の上位３位までの配列グループ中でＧＯタームＢは67%の出現頻度で現れている（３配列中２配列）。このように1からnまでの各Ｎの値（nは配列類似性閾値により選ばれた遺伝子配列の本数）において各タームの出現頻度を計算し、図８のような表を得る。そして、307の工程で、この表中で出現頻度がターム出現頻度閾値に満たないＧＯタームを削除し、残ったＧＯタームを予測ＧＯタームとして選択する。ターム出現頻度閾値が７０％の場合、図８の表を参照し、ＧＯタームＡ、Ｂ、Ｄ、Ｅが予測ＧＯタームとして選択される。このようにタームの出現頻度を考慮することでタームＥのように配列類似性値がもっとも高い（E-value値が最も低い）配列に割り当てられていないタームでも、全体的にみて出現頻度の高いタームであれば予測ＧＯタームとして選択することができる。こうして選ばれた予測ＧＯタームを、402で示すデータ構造を持つデータに加工して出力していき、303および301の繰り返し処理を経ながら追加出力することにより、図４全体のデータ構造を持つファイル111を生成する。 Next, the procedure of each process included in step 110 will be described using a specific example. These processes deal with 201 data, which has already been processed by 205 processes. For each gene sequence of the learning data set included in the 201 data, the GO terms and sequence similarity values possessed by the gene are extracted, and the table shown in FIG. 5 is created. For each GO term included in this table, all the terms (parent terms) located in the superordinate concept of the GO term are selected and added to the table of FIG. 5 to create the table shown in FIG. The parent term is selected by referring to the 104 GO term tree structure files. This file 104 describes the hierarchical relationship between the concepts of GO terms in a tree structure as shown in FIG. FIG. 7 shows the terms of the subordinate concept positioned in the direction of the arrow. The term B of 702 is a term (child term) positioned in the subordinate concept of the term A of 701. Next, referring to a table as shown in FIG. 6, GO terms that frequently appear in gene sequences having relatively high sequence similarity (small E-value values) are selected. For this purpose, attention is paid to the top N groups in descending order of sequence similarity (in the order of decreasing E-value values), and GO terms whose appearance frequency exceeds the appearance frequency threshold are selected. For example, GO term B appears at a frequency of 67% in the top 3 sequence groups of 701 (2 sequences out of 3). Thus, the appearance frequency of each term is calculated for each value of N from 1 to n (where n is the number of gene sequences selected by the sequence similarity threshold), and a table as shown in FIG. 8 is obtained. In step 307, the GO term whose appearance frequency is less than the term appearance frequency threshold is deleted from the table, and the remaining GO term is selected as the predicted GO term. When the term appearance frequency threshold is 70%, the GO terms A, B, D, and E are selected as the predicted GO terms with reference to the table of FIG. Considering the appearance frequency of the term in this way, even if the term is not assigned to the sequence having the highest sequence similarity value (the lowest E-value value), such as the term E, the appearance frequency is high overall. If it is a term, it can be selected as a predicted GO term. The predicted GO term selected in this way is processed into data having the data structure indicated by 402 and output, and after additional processing through 303 and 301, the file having the data structure shown in FIG. 111 is generated.

次に、予測精度の算出および最適予測精度の決定に関する112の処理内容を詳細に説明する。この工程で現れる情報はすべて、図４に示したデータ構造を持つ111のファイルのデータから得られる。このファイルに含まれる配列相同性閾値およびタームの出現頻度閾値各々についてRecall値およびPrecision値を算出する。Recall値は、テストデータセットの配列全てが持つＧＯタームの総数のうち、予測されたＧＯタームの総数の割合であり、予測すべきＧＯターム全体のうち、とりこぼしなく予測できたＧＯタームの割合である。また、Precision値は予測したＧＯタームの総数に占める正しく予測されたＧＯタームの総数であり、予測の正解率を意味する。ＧＯタームが正しく予測されたか否かは次のように判断する。予測されたＧＯタームＡとテストデータセット配列に予め割り当てられているＧＯタームＢが同一のＧＯタームであれば、ＧＯタームＡは正しく予測されたとする。また、ＧＯタームＡとＧＯタームＢが同一でなくとも、互いに概念の上下関係にあればＧＯタームＢは正しく予測されたとする。次にRecall値とPrecision値を用い、
(F-measure値) = 2(Recall値)(Precision値)／((Recall値)+(Precision値))
によりF-measure値を算出し、この値を予測精度とする。このF-measure値はRecall値とPrecision値が共に高い値になるほど大きな値となる評価尺度である。F-measure値以外にも、Recall値とPrecision値が共に高い値になるほど大きな値となるような評価尺度が他にあれば、その評価尺度を予測精度としてもよい。 Next, details of 112 processes related to calculation of prediction accuracy and determination of optimal prediction accuracy will be described. All the information appearing in this process is obtained from data of 111 files having the data structure shown in FIG. A Recall value and a Precision value are calculated for each sequence homology threshold and term appearance frequency threshold contained in this file. The Recall value is the ratio of the total number of predicted GO terms out of the total number of GO terms in all the test data set arrays, and the ratio of the GO terms that could be predicted without missing out of the total GO terms to be predicted It is. The Precision value is the total number of correctly predicted GO terms in the total number of predicted GO terms, which means the prediction accuracy rate. Whether or not the GO term is correctly predicted is determined as follows. If the predicted GO term A and the GO term B pre-assigned to the test data set array are the same GO term, it is assumed that the GO term A is correctly predicted. Further, even if GO term A and GO term B are not the same, it is assumed that GO term B is correctly predicted as long as they are in a conceptual vertical relationship. Next, using the Recall value and Precision value,
(F-measure value) = 2 (Recall value) (Precision value) / ((Recall value) + (Precision value))
The F-measure value is calculated by using this value as the prediction accuracy. This F-measure value is an evaluation measure that increases as the Recall value and Precision value increase. In addition to the F-measure value, if there is another evaluation measure that increases as the Recall value and the Precision value become higher, the evaluation measure may be used as the prediction accuracy.

以上説明したRecall値、Precision値、F-measure値を、配列類似性閾値およびターム出現頻度閾値毎に計算し、図９で示すような表を得る。この表を参照し、F-measure値が最大となる配列類似性値およびターム出現頻度閾値を求め、それぞれを最適配列類似性値、最適ターム出現頻度閾値とし、この二つの閾値の組み合わせを最適予測条件とする。また、105のデータベースを５分割し、５回のクロスバリデーションテストを行った場合は、F-measure値が最大となる配列類似性値およびターム出現頻度閾値はそれぞれ５つ求められるので、この５つの値の平均値を最適配列類似性値、最適ターム出現頻度閾値とするのが好ましい。 The Recall value, Precision value, and F-measure value described above are calculated for each sequence similarity threshold and term appearance frequency threshold to obtain a table as shown in FIG. Referring to this table, the sequence similarity value and the term appearance frequency threshold that maximize the F-measure value are obtained, and the optimum sequence similarity value and the optimum term appearance frequency threshold are set as the optimal prediction. Condition. In addition, when 105 databases are divided into five and five cross-validation tests are performed, five sequence similarity values and term appearance frequency thresholds that maximize the F-measure value are obtained. It is preferable to use the average value as the optimum sequence similarity value and the optimum term appearance frequency threshold.

次に、最適配列類似性値と最適ターム出現頻度閾値を用い、101の遺伝子配列に対しGOタームを予測する113〜115の処理手順を説明する。この手順は、109〜110の処理により111のファイルを出力する手順と基本的に同じである。ただし、109の処理で用いる学習データセット108の代わりに、ＧＯターム予測対象遺伝子配列101を用い、また、テストデータセット107の代わりに105のデータベースを用いる。さらに、最適配列類似性値と最適ターム出現頻度閾値を用いた予測を１回だけ行うため、201のような繰り返し処理は行わない。このような処理により、101の各配列についてＧＯタームを予測し、その結果を115のファイルに出力する。 Next, 113 to 115 processing procedures for predicting GO terms for 101 gene sequences using the optimal sequence similarity value and the optimal term appearance frequency threshold will be described. This procedure is basically the same as the procedure of outputting 111 files by the processing of 109 to 110. However, the GO term prediction target gene sequence 101 is used in place of the learning data set 108 used in the process 109, and 105 databases are used in place of the test data set 107. Furthermore, since the prediction using the optimal sequence similarity value and the optimal term appearance frequency threshold value is performed only once, the iterative process like 201 is not performed. By such processing, the GO term is predicted for each of the 101 sequences, and the result is output to 115 files.

上記の方法を用いて、最適配列類似性閾値と最適ターム出現頻度閾値の決定を試みた。102の遺伝子オントロジーターム既知の遺伝子配列のデータとしては、公共のタンパク質データベースであるSWISS-PROTのタンパク質アミノ酸配列を用いた。105の工程により、7830の配列が得られた。さらに、これら配列を1560配列ずつ５分割し、5回のクロスバリデーションテストを行った。配列類似性閾値は1から1E-80まで段階的に変化させ、ターム出現頻度閾値は０％から１００％まで段階的に変化させた。そして、これら配列類似性閾値およびターム出現頻度閾値についてそれぞれRecall値、Precision値、およびF-measure値を計算した。５回のクロスバリデーションテストを行ったので、Recall値とPrecision値は５回のテストの平均値を用いた。その結果、配列類似性閾値が0.01、ターム出現頻度閾値60%のときF-measure値が0.63であり最大であった。また、このときのRecall値とPrecision値はそれぞれ0.55、0.75であった。また比較のため、遺伝子の機能予測でよく用いられる配列類似性閾値１Ｅ-10を用い、タームの出現頻度による選択を行わずに予測を行った。その結果、Recall値0.60、Precision値0.39、F-measure値0.47であった。したがってF-measure値の点において本手法の予測精度は比較の方法を上回っており、本手法の有効性が実証された。 An attempt was made to determine the optimal sequence similarity threshold and the optimal term appearance frequency threshold using the above method. The protein amino acid sequence of SWISS-PROT, which is a public protein database, was used as the gene sequence data of 102 gene ontology terms. The 105 steps yielded 7830 sequences. Further, these sequences were divided into 1560 sequences by 5 and subjected to 5 cross-validation tests. The sequence similarity threshold was changed in steps from 1 to 1E-80, and the term appearance frequency threshold was changed in steps from 0% to 100%. The Recall value, Precision value, and F-measure value were calculated for the sequence similarity threshold and the term appearance frequency threshold, respectively. Since five cross-validation tests were performed, the average value of the five tests was used for the Recall value and Precision value. As a result, when the sequence similarity threshold was 0.01 and the term appearance frequency threshold was 60%, the F-measure value was 0.63, which was the maximum. The Recall value and Precision value at this time were 0.55 and 0.75, respectively. For comparison, a sequence similarity threshold 1E-10, which is often used for gene function prediction, was used for prediction without selection based on the appearance frequency of terms. As a result, the Recall value was 0.60, the Precision value was 0.39, and the F-measure value was 0.47. Therefore, the prediction accuracy of this method exceeded the comparison method in terms of F-measure value, and the effectiveness of this method was proved.

以下に、図１０を用いて、本願発明の手順を説明する。
＜手順１＞
GOターム既知遺伝子配列データ1011、遺伝子配列−GOターム対応データ1012を入力し、1005のGOターム既知遺伝子配列をデータベース化するプログラムにより、1014のGOターム既知遺伝子配列データベースを出力する。 Hereinafter, the procedure of the present invention will be described with reference to FIG.
<Procedure 1>
The GO term known gene sequence data 1011 and the gene sequence-GO term correspondence data 1012 are inputted, and a program for creating a database of 1005 GO term known gene sequences is outputted as 1014 GO term known gene sequence database.

＜手順２＞
1006のGOターム既知遺伝子配列をデータベース化するプログラムにより、1014のGOターム既知遺伝子配列データベースを入力し、学習データセット1015とテストデータセット1016を出力する。 <Procedure 2>
A program for creating a database of 1006 GO term known gene sequences is used to input a 1014 GO term known gene sequence database and output a learning data set 1015 and a test data set 1016.

＜手順３＞
1006の学習データセットの遺伝子配列とテストデータセットの遺伝子配列との配列類似性値を求めるプログラムにより、学習データセット1015とテストデータセット1016を入力し、1017の学習データセットの遺伝子配列とテストデータセットの遺伝子配列との配列類似性値データを出力する。 <Procedure 3>
The training data set 1015 and the test data set 1016 are input by the program that calculates the sequence similarity value between the gene sequence of the training data set 1006 and the gene sequence of the test data set. Output sequence similarity value data with a set of gene sequences.

＜手順４＞
1007の複数の予測条件で学習データセットの遺伝子配列のGOタームを予測するプログラムを用い、1017の学習データセットの遺伝子配列とテストデータセットの遺伝子配列との配列類似性値データ、および、1013のGOタームツリー構造ファイルを入力し、1018の各予測条件における学習データセットの遺伝子配列の既知GOタームと予測されたGOタームの対応データを出力する。 <Procedure 4>
Using a program that predicts the GO term of the gene sequence of the learning data set with multiple prediction conditions of 1007, the sequence similarity value data of the gene sequence of the 1017 learning data set and the gene sequence of the test data set, and 1013 The GO term tree structure file is input, and the correspondence data between the known GO term of the gene sequence of the learning data set and the predicted GO term in each of 1018 prediction conditions is output.

＜手順５＞
1008の各予測条件での予測精度を算出し最適予測条件を決定するプログラムを用い、1018の各予測条件における学習データセットの遺伝子配列の既知GOタームと予測されたGOタームの対応データを入力し最適予測条件を決定する。 <Procedure 5>
Using the program that calculates the prediction accuracy under each prediction condition of 1008 and determines the optimal prediction condition, input the correspondence data of the known GO term of the gene sequence of the learning data set and the predicted GO term in each prediction condition of 1018 Determine optimal prediction conditions.

＜手順６＞
1009のGOターム予測対象遺伝子配列とGOターム既知遺伝子配列との配列類似性値を求めるプログラムを用い、1019のGOターム予測対象遺伝子配列、および、1014のGOターム既知遺伝子配列データベースを入力し、1020のGOターム予測対象遺伝子とGOターム既知遺伝子配列との配列類似性値データを出力する。 <Procedure 6>
Using a program for determining sequence similarity values between 1009 GO term predicted gene sequences and GO term known gene sequences, 1019 GO term predicted gene sequences and 1014 GO term known gene sequence database were input, 1020 The sequence similarity value data between the GO term prediction target gene and the GO term known gene sequence is output.

＜手順７＞
1010の最適予測条件にもとづきGOターム予測対象遺伝子配列に対しGOタームを予測するプログラムを用い、1020のGOターム予測対象遺伝子とGOターム既知遺伝子配列との配列類似性値データ、および、1013のGOタームツリー構造ファイルを用い、1021のGOターム予測結果を出力する。 <Procedure 7>
Using a program that predicts the GO term against the target sequence of GO terms based on the optimal prediction conditions of 1010, the sequence similarity value data of 1020 GO terms predicted gene sequence and the known GO gene sequence, and 1013 GO Using the term tree structure file, output 1021 GO term prediction results.

＜手順８＞
1002ディスプレイ、1003ポインティングデバイスを用い、1002ディスプレイに、1021GOターム予測結果を出力数する。 <Procedure 8>
Using 1002 display and 1003 pointing device, output 1021GO term prediction results on 1002 display.

手順は上記の通りであるが、補助記憶装置に予め記憶しておくべきデータは、少なくとも、GOターム既知遺伝子配列データ1011、遺伝子配列−GOターム対応データ1012、GOタームツリー構造ファイル1013、GOターム予測対象遺伝子配列1019である。他の補助記憶装置の残りのデータは、計算過程で生成することができる。 Although the procedure is as described above, the data to be stored in advance in the auxiliary storage device is at least GO term known gene sequence data 1011, gene sequence-GO term correspondence data 1012, GO term tree structure file 1013, GO term. This is the predicted gene sequence 1019. The remaining data of the other auxiliary storage device can be generated in the calculation process.

本発明の遺伝子オントロジーターム予測方法の全体的な流れを説明するためのフローチャート。The flowchart for demonstrating the whole flow of the gene ontology term prediction method of this invention. 配列類似性値計算の結果から得られるデータのデータ構造。Data structure of data obtained from the result of sequence similarity value calculation. 遺伝子オントロジーターム割り当て処理の流れを説明するためのフローチャート。The flowchart for demonstrating the flow of a gene ontology term allocation process. 予測精度を計算するために用いるデータのデータ構造。The data structure of the data used to calculate the prediction accuracy. 配列類似性閾値により選ばれた配列に関するデータを表す表。A table representing data relating to sequences selected by sequence similarity thresholds. 配列類似性閾値により選ばれた配列の遺伝子オントロジータームに上位概念の遺伝子オントロジータームが追加されたデータを表す表。The table | surface showing the data by which the gene ontology term of the high-order concept was added to the gene ontology term of the arrangement | sequence selected by the sequence similarity threshold value. 遺伝子オントロジーのツリー構造を表す説明図。Explanatory drawing showing the tree structure of gene ontology. 最適予測条件の探索による得られるデータを表す表。A table representing data obtained by searching for an optimal prediction condition. Recall値、Precision値、F-measure値を、配列類似性閾値およびターム出現頻度閾値毎に計算した表。The table which calculated the Recall value, Precision value, and F-measure value for every sequence similarity threshold value and term appearance frequency threshold value. 本発明の手順を説明するシステム構成図。The system block diagram explaining the procedure of this invention.

Explanation of symbols

101：予測の対象となる遺伝子配列データ、102：遺伝子オントロジータームが既知の遺伝子配列データ、103：遺伝子オントロジータームが既知の遺伝子配列名と遺伝子オントロジータームとの対応ファイル、104：遺伝子オントロジータームのツリー構造ファイル、105：配列類似性値算出処理のために遺伝子オントロジータームが既知の遺伝子配列に関するデータベースを作成する処理、106：配列類似性値算出処理のために用いる遺伝子オントロジーターム既知の遺伝子配列データベース、107：最適予測条件の探索のために用いる遺伝子配列のテストデータセット、108：最適予測条件の探索のために用いる遺伝子配列の学習データセット、109：配列類似性値算出処理、110：遺伝子オントロジータームの割り当て処理、111：遺伝子オントロジー割り当て処理により得られたファイル、112：予測精度の算出および最適予測条件の決定処理、113：遺伝子オントロジー予測対象配列と遺伝子オントロジーターム既知遺伝子配列データベースの配列との配列類似性値算出処理、114：遺伝子オントロジーターム予測対象遺伝子配列への遺伝子オントロジーターム割り当て処理、115：遺伝子オントロジーターム予測結果データファイル。
202：テストデータセットの配列に予め割り当てられている遺伝子オントロジータームに関するデータ、203：テストデータセットと配列類似性の見られた学習データセットに関するデータ、204：テストデータセットと配列類似性の見られた学習データセットに予め割り当てられている遺伝子オントロジータームに関するデータ。
403：テストデータセットの配列について予測された遺伝子オントロジータームに関するデータ、404：テストデータセットの配列に予め割り当てられている遺伝子オントロジータームに関するデータ。
601：上位３位までの配列グループ。
701〜702：遺伝子オントロジーのツリー構造中に含まれるある遺伝子オントロジーターム。 101: Gene sequence data to be predicted, 102: Gene sequence data with known gene ontology terms, 103: Corresponding file of gene sequence names with known gene ontology terms and gene ontology terms, 104: Tree of gene ontology terms Structure file, 105: processing for creating a database regarding gene sequences whose gene ontology terms are known for sequence similarity value calculation processing, 106: gene sequence database for known gene ontology terms used for sequence similarity value calculation processing, 107: gene sequence test data set used for searching for optimal prediction conditions, 108: gene sequence learning data set used for searching for optimal prediction conditions, 109: sequence similarity value calculation processing, 110: gene ontology term Assignment processing, 111: Gene ontology assignment 112: calculation of prediction accuracy and determination processing of optimum prediction conditions, 113: calculation processing of sequence similarity values between gene ontology prediction target sequences and sequences of gene ontology terms known gene sequence database, 114: Gene ontology term prediction process data assignment process to gene ontology term prediction target gene sequence, 115: gene ontology term prediction result data file.
202: Data on gene ontology terms pre-assigned to sequences in test data set, 203: Data on learning data set with sequence similarity with test data set, 204: Data with sequence similarity with test data set Data related to gene ontology terms that are pre-assigned to different learning data sets.
403: Data related to gene ontology terms predicted for the sequence of the test data set, 404: Data related to gene ontology terms pre-assigned to the sequences of the test data set.
601: The top three sequence groups.
701-702: A gene ontology term included in the tree structure of a gene ontology.

Claims

Using the sequence similarity value between the first plurality of gene sequences to which the gene ontology term is assigned and the second plurality of gene sequences to which the gene ontology term is assigned, the prediction accuracy of gene ontology prediction is sufficiently high A step of determining a condition to increase;
A sequence similarity value between a third gene sequence and a fourth plurality of gene sequences assigned with gene ontology is calculated, and the third gene is calculated according to the sequence similarity value and the conditions determined in the step. Assigning gene ontology terms to sequences. A method for predicting gene ontology terms.

Calculating a sequence similarity value between each of the first plurality of gene sequences assigned to the gene ontology term and each of the second plurality of gene sequences assigned to the gene ontology term;
Using the sequence similarity value calculated in the step, for each of the first plurality of gene sequences, a gene sequence having a sequence similarity value with the first gene sequence that exceeds a first threshold is the second A gene ontology term having an appearance frequency exceeding a second threshold value among gene ontology terms assigned to the gene sequence selected in the previous period is selected as a gene characteristic of the first gene sequence. A second step to predict;
A third step of repeating the second step while setting the first threshold and the second threshold to different values;
For each of the first threshold value and the second threshold value used in the second to third steps, a prediction accuracy is obtained for each prediction result using the first threshold value and the second threshold value, and the prediction Among the first threshold value and the second threshold value corresponding to sufficiently high prediction accuracy, the first threshold value as an optimal sequence similarity threshold value, and the second threshold value as an optimal term appearance frequency threshold value. And a fourth step,
A sequence similarity value between each of the third gene sequence and each of a plurality of fourth gene sequences to which gene ontology terms are assigned as gene characteristics is obtained, and the sequence similarity value with the third gene sequence is determined. Selects a plurality of gene sequences exceeding the optimum sequence similarity threshold, and among gene ontology terms assigned to the plurality of gene sequences, a gene ontology whose appearance frequency exceeds the optimum term appearance frequency threshold A gene ontology term prediction method comprising: selecting a term and predicting the gene ontology term as a gene characteristic of the third gene sequence.

3. The gene ontology term prediction method according to claim 2, wherein the first plurality of gene sequences, the second plurality of gene sequences, the third gene sequence, and the fourth plurality of gene sequences are a nucleobase sequence or a protein amino acid sequence. A gene ontology term prediction method characterized by

The gene ontology term prediction method according to any one of claims 2 to 3, wherein a variable obtained from alignment between gene sequences or a function of the variable is used as the sequence similarity value. Method.

In the gene ontology term prediction method according to any one of claims 2 to 4, not only the gene ontology terms pre-assigned to the plurality of second gene sequences and the plurality of fourth gene sequences, A gene ontology term prediction method, characterized by calculating the appearance frequency of each gene ontology term, which is a superordinate concept of the gene ontology term, as a prediction target.

The gene ontology term prediction method according to any one of claims 2 to 5, wherein, among the plurality of gene sequences whose sequence similarity value with the first sequence exceeds the first threshold, the first The total number of the same gene ontology terms included in the plurality of gene ontology terms selected from the n gene sequences having a relatively high sequence homology value with the gene sequence of A gene ontology term prediction method, wherein m / n is an early appearance frequency of the gene ontology term, where m is n.

The gene ontology term prediction method according to any one of claims 2 to 6, wherein the first gene ontology term predicted for a certain gene sequence is the same as the second gene ontology term previously assigned to the gene sequence. And when the first gene ontology term is located in a subordinate concept or a superordinate concept of the second gene ontology term, the second gene ontology term is a correctly predicted gene ontology term A gene ontology term prediction method characterized by that.

The gene ontology term prediction method according to any one of claims 2 to 7, wherein A is a total number of gene ontology terms preassigned to the plurality of first gene sequences, and the plurality of first gene sequences. The total number of gene ontology terms predicted for each is B, the total number of correctly predicted gene ontology terms among the predicted gene ontology terms is C, and the real number satisfying the following equation (1) is R: A gene ontology term prediction method, wherein a real number satisfying the formula (2) is P, and a function of the real number R and the real number P is used as a prediction accuracy.
R = C / A (1)
P = C / B (2)