JP2016502162A

JP2016502162A - Primary analysis driven by a database of raw sequencing data

Info

Publication number: JP2016502162A
Application number: JP2015536149A
Authority: JP
Inventors: ローレントゴーティエ，; オーレルンド，
Original assignee: Danmarks Tekniskie Universitet
Current assignee: Danmarks Tekniskie Universitet
Priority date: 2012-10-15
Filing date: 2013-10-11
Publication date: 2016-01-21
Also published as: US20150294065A1; WO2014060305A1; CN104919466A; EP2915084A1

Abstract

本発明は、未加工のシークエンシングリードから生物配列含有試料の供給源を同定するための方法に関する。本方法を使用して、未知ＤＮＡの供給源を同定することができ、診断、生物テロ防御、食物安全性および品質ならびに衛生適用に使用することができる。別の態様において、本発明は、本発明の方法において使用することができる、参照配列のデータベースに関する。本方法は、事前にインデックス付けされた参照配列のコレクションと、配列決定機器からのリード等、生物配列の着信した問い合わせセットをスコアリングするシステムと、問い合わせセットの部分を提出するシステムに頼る。The present invention relates to a method for identifying a source of a biosequence-containing sample from a raw sequencing lead. Using this method, sources of unknown DNA can be identified and used for diagnostics, bioterrorism defense, food safety and quality, and hygiene applications. In another aspect, the invention relates to a database of reference sequences that can be used in the methods of the invention. The method relies on a system for scoring an incoming query set of biological sequences, such as a collection of pre-indexed reference sequences, reads from sequencing equipment, and a system for submitting portions of the query set.

Description

本発明は、生物配列の可能性の高い供給源を同定するための方法に関する。さらなる態様において、本発明は、この目的のために使用されるように適合されたデータベースに関する。 The present invention relates to a method for identifying a likely source of a biological sequence. In a further aspect, the present invention relates to a database adapted to be used for this purpose.

ＤＮＡ配列決定は、塩基（Ａ、Ｔ、ＣまたはＧ）の配列を同定する実験プロセスである。現在、数千個の塩基を超えてＤＮＡの分子全体を配列決定することができる技術は存在せず、大部分の技術は、１００〜２００塩基の間を配列決定する。細菌ゲノムは、数百万個の塩基を優に含有し得る。ここ数年間、配列決定のコストは大幅に低下され、これにより、ヒトの健康、食物の品質管理または微生物群集の研究等の目的のための試料に由来するＤＮＡの大規模配列決定をますます一般的なものとした。処置を可能な限り個別化するために、全ヒトゲノムの配列決定が治療法においてより頻繁に使用されることや、ルーチンの配列決定が行われて、特異的な生体の有無を管理することが想定できる。それ自体を最終目標として、あるいはより複雑なデータ解析への足掛かりまたはより費用のかかる解析に取り組む前の配列決定データの品質管理ステップとして、可能性の高い起源ＤＮＡを迅速に同定することは、急速に必要なものとなりつつある。 DNA sequencing is an experimental process that identifies the sequence of a base (A, T, C or G). Currently, there are no techniques that can sequence the entire molecule of DNA beyond thousands of bases, and most techniques sequence between 100 and 200 bases. The bacterial genome can contain millions of bases. Over the past few years, sequencing costs have been greatly reduced, which makes it increasingly common to sample large-scale DNA from samples for purposes such as human health, food quality control or microbial community studies It was a typical one. In order to individualize treatment as much as possible, it is envisaged that sequencing of the entire human genome is used more frequently in therapy and routine sequencing is used to manage the presence or absence of specific organisms. it can. Rapid identification of probable source DNA as a goal in itself or as a quality control step in sequencing data before starting to work on more complex data analysis or more expensive analysis It is becoming necessary for

一次解析は、参照ゲノムと配列をアライメントすること（参照種の配列が公知であることが必要とされる）、あるいはモデルなしでジグソーパズルの再構成を試みること（いわゆる配列決定タグのｄｅ−ｎｏｖｏアセンブリ−未知試料の内容の同定（indentifying）は、補足的ステップを必要とするであろう）による、配列決定から得られる相対的に短い配列（ショートリードと呼ばれる）の解明からなる。参照に対するアライメントは、ｄｅｎｏｖｏアセンブリよりも計算的にさらに容易なタスクであると考えられる。 Primary analysis aligns the sequence with the reference genome (requires that the sequence of the reference species is known) or attempts to reconstruct the jigsaw without a model (de-novo assembly of so-called sequencing tags) -Identification of the contents of the unknown sample (indentifying would require additional steps) and consists of elucidating relatively short sequences (called short reads) resulting from sequencing. Alignment to reference is considered a computationally easier task than de novo assembly.

非特異的または全ゲノム配列決定が手頃に利用できるようになる前には、先ず特異的領域を丹念に配列決定し、アセンブルし、対象とする推定領域を同定していた。最も単純な方法は、ＲＮＡをタンパク質に翻訳するための開始コドン（ＡＴＧ／ＡＵＧ）および翻訳を終結する終止コドンのうち１個（ＴＡＧ／ＵＡＧ、ＴＡＡ／ＵＡＡ、ＴＧＡ／ＵＧＡ）によって定義される区間を見出すことによる、オープンリーディングフレーム（ＯＲＦ）の探索である。次に、ＯＲＦをあらゆる公知遺伝子のリストに対してアライメントさせた。アライメントのための方法は、Ｓｍｉｔｈ−Ｗａｔｅｒｍａｎアルゴリズム、ＢＬＡＳＴアルゴリズムおよびプログラム、ＳＳＡＨＡならびにＢＬＡＴ等、アライメントアルゴリズムおよびプログラムを含む。これらの目標は、インデックス付き配列のデータベースにおける最適なアライメントを見出し、あらゆるアライメントに対するスコアのランク付けにより、最良のマッチ、したがって、問い合わせ配列に最も可能性の高い機能を見出すことである。異なる生物学的機能を有する同様のマッチの数の増加は、機能アノテーションの目的のための、「最良にマッチする遺伝子の群」またはオルソロガス遺伝子のクラスター（ＣＯＧ）を構築することによる該原理の拡大をもたらす。完全ゲノムが徐々に利用し易くなってきたため、Ｍｕｍｍｅｒアルゴリズムを設計して、完全ゲノムのペアをアライメントし、遺伝的に関係する種間で全体的なゲノム構造が比較される様子を可視化した。 Prior to the availability of non-specific or whole-genome sequencing, specific regions were first carefully sequenced and assembled to identify the putative regions of interest. The simplest method is an interval defined by an initiation codon (ATG / AUG) for translating RNA into protein and one of the termination codons (TAG / UAG, TAA / UAA, TGA / UGA) to terminate translation Search for an open reading frame (ORF). The ORF was then aligned against any known gene list. Methods for alignment include alignment algorithms and programs such as the Smith-Waterman algorithm, BLAST algorithm and program, SSAHA and BLAT. These goals are to find the best alignment in the indexed sequence database and find the best match and therefore the most likely function in the query sequence by ranking the scores against any alignment. Increasing the number of similar matches with different biological functions expands the principle by building a “group of best matching genes” or orthologous gene clusters (COG) for the purpose of functional annotation Bring. As complete genomes have become increasingly accessible, Mummer algorithms were designed to align complete genome pairs and visualize how the overall genome structure is compared between genetically related species.

データベースにおいて現在利用できる配列の数のため、公知配列の莫大なプールに対する新たな配列のアライメントは、相対的に長い時間を要することがあり、ＢＬＡＳＴは、ほぼ最適な結果を見出しつつ以前のアルゴリズムを加速させたという意味においてブレークスルーであった。しかし、ウェブに基づく検索エンジンがほぼ即座に検索結果を返すことができる時代において、あらゆる公知配列に対する検索は、依然として相対的に時間がかかる。 Due to the number of sequences currently available in the database, the alignment of new sequences to a vast pool of known sequences can take a relatively long time, and BLAST found previous algorithms while finding near optimal results. It was a breakthrough in the sense that it was accelerated. However, in an era where web-based search engines can return search results almost immediately, searching for any known sequence is still relatively time consuming.

Ningら２００１年、（Genome：１１巻：１７２５〜１７２９頁）は、数ギガ塩基のＤＮＡを含有するデータベースにおいて速いアライメントを実行するためのアルゴリズム、ＳＳＡＨＡ（ハッシュ化アルゴリズムによる配列検索およびアライメント（ｓｅｑｕｅｎｃｅｓｅａｒｃｈａｎｄａｌｉｇｎｍｅｎｔｂｙｈａｓｈｉｎｇａｌｇｏｒｉｔｈｍ））について記載する。ＳＳＡＨＡは、アライナである；したがって、完全問い合わせ配列毎に、これが、参照配列のコレクションにおける各エントリーとマッチする位置およびどの程度マッチしているか報告するタスクを有する。ＳＳＡＨＡ方法は、問い合わせ配列の全長にわたり可能な限り多くのマッチ見出すためのものである。データベースにおける配列は、これらをｋ個の近接塩基の連続したｋ個組（k-tuple）へと切断し、次にハッシュ表を使用して各ｋ個組の各発生の位置を保存することにより前処理される。データベースにおける問い合わせ配列の検索は、ハッシュ表から、問い合わせ配列におけるｋ個組毎の「ヒット」を得て、次に結果における選別を実行することにより行われる。ＳＳＡＨＡアルゴリズムは、ハイスループット一塩基多型検出および非常に大規模な配列アセンブリに使用される。ＳＳＡＨＡにおいて、各ｋ個組の存在および位置は、同じルックアップ（lookup）構造において保存され、該構造は、コンピュータシステムのメモリにロードされる。 Ning et al., 2001 (Genome: 11: 1752-1729) is an algorithm for performing fast alignment in databases containing DNA of several gigabases, SSAHA (sequence search and alignment with a hashing algorithm). and alignment by hashing algorithm)). SSAHA is an aligner; therefore, for each complete query sequence, it has the task of reporting where it matches each entry in the collection of reference sequences and how much it matches. The SSAHA method is for finding as many matches as possible over the entire length of the query sequence. Sequences in the database are obtained by cutting them into consecutive k-tuples of k adjacent bases and then using a hash table to store the location of each occurrence of each k-tuple. Preprocessed. The query sequence search in the database is performed by obtaining “hits” for every k sets in the query sequence from the hash table, and then performing a selection on the results. The SSAHA algorithm is used for high-throughput single nucleotide polymorphism detection and very large sequence assembly. In SSAHA, the presence and location of each k set is stored in the same lookup structure, which is loaded into the memory of the computer system.

公知のマッピングまたはアライメントアルゴリズムおよびプログラムは、Ｅｒｌａｎｄ、Ｃｏｒｏｎａ、ＢＦＡＳＴ、Ｂｏｗｔｉｅ、ＢＷＡ、ＮｏｖｏＡｌｉｇｎ等の方法を含む。これらの目標は、公知参照におけるリードの位置を見出すことである。延いては、マッチを見出すことができなかったリードは、この配列に由来せずとフラグ付けすることができる。これらのプログラムおよびアルゴリズムは、問い合わせセットにおける全ての配列、すなわち、全てのシークエンシングリードを評価すること、また、その全てに対し、ショートリードで作業する際のアライメントと呼ばれることが多い最適なアライメントを見出すよう試みることの両方による、長い検索時間の弱点も抱える。興味深いことに、上述のプログラムは全て、厳密性をスピードに引き換える発見的技術研究（heuristics）を使用するため、見出す結果が異なる。 Known mapping or alignment algorithms and programs include methods such as Erland, Corona, BFAST, Bowtie, BWA, NovoAlign. These goals are to find the position of the lead in the known reference. Consequently, reads that fail to find a match can be flagged as not originating from this sequence. These programs and algorithms evaluate all sequences in the query set, i.e. all sequencing reads, and all of them have an optimal alignment, often referred to as alignment when working with short reads. It also has a long search time weakness, both by trying to find it. Interestingly, all of the above programs use heuristics that trade strictness for speed, so the results you find will be different.

ＵＳ２００６２８６５６６は、突然変異を検出するためにｋ−ｍｅｒを使用する方法を開示する。この方法は、標的核酸配列の一部を第２の配列セグメントと比較して、標的核酸配列の一部に対するマッチを検出することによる、標的核酸配列における明らかな突然変異の検出を含む。 US2006286565 discloses a method of using k-mer to detect mutations. The method includes detecting an apparent mutation in the target nucleic acid sequence by comparing a portion of the target nucleic acid sequence with a second sequence segment and detecting a match to the portion of the target nucleic acid sequence.

ＵＳ２０１２０００４１１は、配列情報の短い文字列をマッチさせて、参照ゲノムデータベースに由来するゲノムを同定することに基づく、試料内の生物の集団を特徴付けすることができるシステムおよび方法を開示する。この特許出願は、参照配列における短い文字列の１コレクションにおいて短い文字列の存在が検索され、参照配列における位置の別の１コレクションにおいて位置が検索される方法を開示していない。 US201204031 discloses a system and method that can characterize a population of organisms in a sample based on matching short strings of sequence information to identify a genome derived from a reference genome database. This patent application does not disclose a method in which the presence of a short string is searched in one collection of short strings in the reference sequence and the position is searched in another collection of positions in the reference sequence.

米国特許出願公開第２００６／２８６５６６号明細書US Patent Application Publication No. 2006/286656 米国特許出願公開第２０１２／０００４１１号明細書US Patent Application Publication No. 2012/000411

Ningら２００１年、Genome：１１巻：１７２５〜１７２９頁Ning et al. 2001, Genome: 11: 1752-1729.

本発明は、配列決定機器から得られるＤＮＡリード（またはショートリード）あるいはＮもしくはＣ末端配列決定または質量分析から得られるタンパク質配列等、未加工の配列の供給源を同定するための新規方法を提供する。本方法は、事前にインデックス付けされた参照配列のコレクションと、配列決定機器からのリード等、生物配列の着信した問い合わせセットをスコアリングするシステムと、問い合わせセットの部分を提出するシステムに頼る。これは、クライアント・サーバーに基づくアプローチを使用することによって行うことができ、サーバー実体は、参照のコレクションを保持し、スコアリングを実行し、一方、クライアントは、問い合わせ配列のサブセットを提出する。 The present invention provides a novel method for identifying sources of raw sequences such as DNA reads (or short reads) obtained from sequencing instruments or protein sequences obtained from N- or C-terminal sequencing or mass spectrometry To do. The method relies on a system for scoring an incoming query set of biological sequences, such as a collection of pre-indexed reference sequences, reads from sequencing equipment, and a system for submitting portions of the query set. This can be done by using a client-server based approach, where the server entity maintains a collection of references and performs scoring, while the client submits a subset of the query sequence.

本発明によって提供されるアプローチは、試料中に見出されるＤＮＡの異なる供給源の迅速な決定を可能にし、供給源配列や参照配列の所定の遺伝子の完全配列の知識に頼らない。 The approach provided by the present invention allows for the rapid determination of different sources of DNA found in a sample and does not rely on knowledge of the complete sequence of a given gene in the source sequence or reference sequence.

ショートリードは、それが由来する完全参照を表さないにもかかわらず、該参照の特徴的なシグナルを保持する。ショートリードは、部分配列（ｋ−ｍｅｒまたはｋ個組と呼ばれる）へとさらに分解することができ、かかるｋ−ｍｅｒは、未加工の配列決定データの供給源を同定するために、インデックス付けされたｋ−ｍｅｒのコレクションにおいて検索される。 A short lead retains the characteristic signal of the reference, even though it does not represent the complete reference from which it is derived. Short reads can be further broken down into subsequences (called k-mers or k-tuples), such k-mers are indexed to identify the source of raw sequencing data. Search in a collection of k-mers.

第１の態様において、本発明は、生物配列の可能性の高い供給源を同定する方法であって、
ａ）供給源から配列またはショートリードのサブセットをサンプリングするステップと、
ｂ）サブセットからの配列をｋ−ｍｅｒに断片化するステップと、
ｃ）前記サブセットからのｋ−ｍｅｒを、参照配列のｋ−ｍｅｒを含むデータベースに対して問い合わせるステップと、
ｄ）いずれの参照が、ｋ−ｍｅｒを含有するか決定するステップと、
ｅ）可能性の高い供給源参照の記述を返すステップと
を含む方法に関する。 In a first aspect, the present invention is a method for identifying a likely source of a biological sequence comprising:
a) sampling a sequence or a subset of short reads from a source;
b) fragmenting sequences from the subset into k-mers;
c) querying a k-mer from said subset against a database containing the k-mer of the reference sequence;
d) determining which references contain k-mers;
e) returning a description of a likely source reference.

本方法は、完全問い合わせセットのアライメントに着目する、したがって、入力装置（クライアント等）からアライメントを実行できるデータベースおよびスコアリングユニット（サーバー等）へと配列全体の伝達を要求する伝統的なアライメントおよびマッピングアルゴリズムを上回るいくつかの利点を有する。本発明において、配列のサブセットのみが、断片化および問い合わせに付され、これにより、データ伝達の必要を最小化する。伝達されるサブセットは、例えば、固定されたサイズのランダムサブセット、フィルターをかけたサブセット、適応サンプリング、入力およびスコアリング実体間の反復性同期的もしくは非同期的ダイアログまたはこれらのいずれかの組合せとなり得るが、これらに限定されない。 The method focuses on the alignment of the complete query set, and thus traditional alignment and mapping that requires transfer of the entire sequence from an input device (such as a client) to a database and scoring unit (such as a server) that can perform the alignment. Has several advantages over algorithms. In the present invention, only a subset of the sequences are subjected to fragmentation and interrogation, thereby minimizing the need for data transmission. The transmitted subset can be, for example, a fixed size random subset, a filtered subset, adaptive sampling, an iterative synchronous or asynchronous dialog between input and scoring entities, or any combination thereof. However, it is not limited to these.

シークエンシングリードのアセンブリまたはゲノム構築と続く検索に基づく方法、あるいは参照のコレクションにわたるあらゆるリードをマッピングする方法と比較すると、本方法は、完全アライメントの実行を試みず、データのサブセットにおいて作業することにより、相当に低いコンピュータ処理能力を要求し、これにより、数秒以内に結果を得ることができる。よって、本発明の方法は、例えば、クライアントとして低いコンピュータ処理能力を有するタブレットまたは携帯型装置（例えば、携帯電話等）を用いたクライアント・サーバーアプローチを使用してランすることができる。１サブセットのデータに対し相対的に速く結果を得ることができるため、追加的なサブセットのデータの検索に要求される時間は、相当に低減する。このように、試料におけるＤＮＡの異なる供給源の同一性は、配列全体のアライメントに基づく従来方法と比較して、相当に低減した期間において決定することができる。 Compared to methods based on sequencing read assembly or genome construction followed by search, or mapping every read across a collection of references, the method does not attempt to perform a complete alignment, but rather by working on a subset of the data. , Requiring considerably less computer processing power, so that results can be obtained within seconds. Thus, the method of the present invention can be run using a client-server approach using, for example, a tablet or portable device (eg, a mobile phone, etc.) having low computer processing power as a client. Since results can be obtained relatively quickly for a subset of data, the time required to retrieve the additional subset of data is significantly reduced. Thus, the identity of different sources of DNA in a sample can be determined in a significantly reduced period compared to conventional methods based on whole sequence alignment.

その最も広範な態様において、本発明は、データベースにおける存在のみの問い合わせに関する。しかし、好ましい実施形態において、データベースは、参照配列におけるｋ−ｍｅｒの位置も問い合わせされ、よって、供給源ｋ−ｍｅｒの連続性の計算を可能にし、評価をより正確なものとする。生物は、互いに遺伝的に関係することが多く、本発明は、参照配列のコレクションにおける密接な親を見出すこともできる。 In its broadest aspect, the present invention relates to queries only for presence in a database. However, in a preferred embodiment, the database is also queried for the position of the k-mer in the reference sequence, thus allowing calculation of the continuity of the source k-mer, making the assessment more accurate. Organisms are often genetically related to each other and the present invention can also find close parents in a collection of reference sequences.

２種の別々のデータベースまたはコレクションにおけるデータのコンパイルは、参照におけるｋ−ｍｅｒの存在の検索を位置の検索から分断し、永続的保存よりも検索が速くなり得る、メモリへの可能な限り多くの存在の検索のキャッシュ化等、最適化を考慮することを可能にする。ｋ−ｍｅｒが存在することが判明した場合、また、十分な時間がある場合は補足的最適化ステップにおいて、所定の参照における位置の検索を行うことができる。よって、本発明の好ましい実施形態は、生物配列の可能性の高い供給源を同定する方法であって、
ａ）供給源から配列のサブセットをサンプリングするステップと、
ｂ）サブセットからの配列をｋ−ｍｅｒに断片化するステップと、
ｃ）前記サブセットからのｋ−ｍｅｒを、参照配列のｋ−ｍｅｒを含む第１のコレクションに対して問い合わせるステップと、
ｄ）前記サブセットからのｋ−ｍｅｒを、参照配列におけるｋ−ｍｅｒの位置を含む第２のコレクションに対して問い合わせるステップと、
ｅ）いずれの参照がｋ−ｍｅｒを含有するか決定するステップと、
ｆ）可能性の高い供給源参照の記述を返すステップと
を含み、参照配列のｋ−ｍｅｒを含むコレクションが、参照配列におけるｋ−ｍｅｒの位置を含むコレクションとは別々である方法に関する。 Compiling data in two separate databases or collections decouples the search for the presence of a k-mer in a reference from a location search and can search as much as possible into memory, which can be faster than a persistent store. It makes it possible to consider optimization such as caching of existing searches. If it is found that a k-mer exists, and if there is sufficient time, a search for a location at a given reference can be performed in a supplemental optimization step. Thus, a preferred embodiment of the present invention is a method for identifying a likely source of a biological sequence comprising:
a) sampling a subset of the sequence from the source;
b) fragmenting sequences from the subset into k-mers;
c) querying a k-mer from said subset against a first collection containing a k-mer of a reference sequence;
d) querying a k-mer from said subset against a second collection containing the position of the k-mer in the reference sequence;
e) determining which references contain k-mers;
f) returning a description of a likely source reference, wherein the collection containing the k-mer of the reference sequence is separate from the collection containing the position of the k-mer in the reference sequence.

よって、本発明の好ましい実施形態は、生物配列の可能性の高い供給源を同定する方法であって、
ａ）供給源から配列またはショートリードのサブセットをサンプリングするステップと、
ｂ）サブセットからの配列をｋ−ｍｅｒに断片化するステップと、
ｃ）前記サブセットからのｋ−ｍｅｒを、参照配列のｋ−ｍｅｒを含む第１のコレクションに対して問い合わせるステップと、
ｄ）前記サブセットからのｋ−ｍｅｒを、参照配列におけるｋ−ｍｅｒの位置を含む第２のコレクションに対して問い合わせるステップと、
ｅ）いずれの参照がｋ−ｍｅｒを含有するか決定するステップと、
ｆ）可能性の高い供給源参照の記述を返すステップと
を含み、参照配列のｋ−ｍｅｒを含む前記コレクションが、参照配列におけるｋ−ｍｅｒの位置を含むコレクションとは別々である方法に関する。 Thus, a preferred embodiment of the present invention is a method for identifying a likely source of a biological sequence comprising:
a) sampling a sequence or a subset of short reads from a source;
b) fragmenting sequences from the subset into k-mers;
c) querying a k-mer from said subset against a first collection containing a k-mer of a reference sequence;
d) querying a k-mer from said subset against a second collection containing the position of the k-mer in the reference sequence;
e) determining which references contain k-mers;
f) returning a description of a likely source reference, wherein the collection containing the k-mer of the reference sequence is separate from the collection containing the position of the k-mer in the reference sequence.

本発明の注目すべき一特色は、可能性の高い参照が同定されたら、可能性の高い参照に関する情報が使用者に返されることである。返された情報は、例えば、可能性の高い種およびその起源もしくは供給源、ならびに／または可能性の高い種の完全ゲノム配列に関する情報になり得る。これにより、使用者は、突然変異および挿入等、僅かな変動を同定するために、最先端のアライメントまたはゲノム構築アルゴリズムを使用して、参照配列に対し未知試料に由来する残りの未加工のリードをアライメントさせることができる。 One notable feature of the present invention is that once a likely reference is identified, information about the likely reference is returned to the user. The returned information can be, for example, information about the likely species and its origin or source, and / or the complete genome sequence of the likely species. This allows the user to use a state-of-the-art alignment or genome construction algorithm to identify minor variations such as mutations and insertions, and to use the remaining raw reads from unknown samples relative to the reference sequence. Can be aligned.

さらなる態様において、本発明は、参照配列のｋ−ｍｅｒを含むデータベースであって、
ａ）参照配列からのｋ−ｍｅｒの第１のコレクションと、
ｂ）参照配列における各ｋ−ｍｅｒの位置の第２のコレクションと
を含むデータベースに関する。 In a further aspect, the present invention is a database comprising a k-mer of reference sequences comprising:
a) a first collection of k-mers from a reference sequence;
b) relates to a database comprising a second collection of each k-mer position in the reference sequence.

２種の別々のデータベースまたはコレクションにおけるデータのコンパイルは、参照におけるｋ−ｍｅｒの存在の検索を位置の検索から分断し、永続的保存よりも検索が速くなり得る、メモリへの可能な限り多くの存在の検索のキャッシュ化等、最適化を考慮することを可能にする。ｋ−ｍｅｒが存在することが判明した場合、また、十分な時間がある場合は補足的最適化ステップにおいて、所定の参照における位置の検索を行うことができる。 Compiling data in two separate databases or collections decouples the search for the presence of a k-mer in a reference from a location search and can search as much as possible into memory, which can be faster than a persistent store. It makes it possible to consider optimization such as caching of existing searches. If it is found that a k-mer exists, and if there is sufficient time, a search for a location at a given reference can be performed in a supplemental optimization step.

第３の態様において、本発明は、入力デバイスと、中央処理ユニットと、メモリと、出力デバイスとを好ましくは含む、供給源配列の可能性の高い供給源を同定するためのデータ処理システムであって、前記データ処理システムが、実行されると本発明の方法を実施させる命令シーケンスを表すデータを内部に保存し、メモリが、本発明に係るデータベースをさらに含むシステムに関する。 In a third aspect, the present invention is a data processing system for identifying a likely source of a source array, preferably including an input device, a central processing unit, a memory, and an output device. The data processing system stores data representing an instruction sequence for executing the method of the present invention when the data processing system is executed, and the memory further includes a database according to the present invention.

図３は、本発明のシステムの一実施形態の要点を図解する。要点とは、サンプリングが「クライアント」において行われ、最小量の情報が伝達されるようにすることである。最も可能性の高い参照の記述子の使用は、本図において図解されていない。 FIG. 3 illustrates the main points of one embodiment of the system of the present invention. The point is that sampling is done at the “client” so that a minimum amount of information is communicated. The most likely use of reference descriptors is not illustrated in this figure.

デバイス（入力、出力、メモリ、ＣＰＵ）は、携帯型、固定型、クラウドおよび／またはオンラインベースとなり得る。 Devices (input, output, memory, CPU) can be portable, fixed, cloud and / or online based.

好ましくは、データベースは、サーバーに保存され、入力および出力デバイスは、１個または複数のクライアントであり、クライアントおよびサーバーは、データ通信接続を介して接続されており、サーバーの共有は、参照のコレクションの集中化と、別々のプロセスまたはいっそ別々の機器において実行する場合、クライアントにわたるサーバーにおける演算能力の分布を可能にする。かかる実施形態において、クライアントは、クライアントが、供給源配列のサブセットをサンプリングし、これらをｋ−ｍｅｒに断片化し、これらをサーバーに伝達することを可能にする命令シーケンスを含むことができる。 Preferably, the database is stored on a server, the input and output devices are one or more clients, the client and server are connected via a data communication connection, and server sharing is a collection of references Centralization and the distribution of computing power on servers across clients when running in separate processes or even separate devices. In such embodiments, the client can include a sequence of instructions that allows the client to sample a subset of the source array, fragment them into k-mers, and communicate them to the server.

クライアントは、クライアントが、サーバーとダイアログして、サンプリング手順を適応または妨害する、あるいはサーバーからクライアントに伝達された配列に基づき、１個または複数のより大型の配列への供給源配列のアセンブリを実行することを可能にする命令シーケンスをさらに含むことができる。 The client interacts with the server to adapt or interfere with the sampling procedure, or perform assembly of the source array into one or more larger arrays based on the array communicated from the server to the client It may further include an instruction sequence that allows

一実行において、本システムは、データ接続を介して配列決定装置に接続されている。 In one implementation, the system is connected to the sequencing device via a data connection.

さらなる態様において、本発明は、実行されると本発明の方法を実施させる命令シーケンスを含有するコンピュータソフトウェア製品、および実行されると本発明の方法を実施させる命令シーケンスを含有する集積回路製品に関する。 In a further aspect, the present invention relates to a computer software product containing an instruction sequence that when executed causes the method of the invention to be performed, and an integrated circuit product containing an instruction sequence that when executed causes the method of the invention to be performed.

図１は、「存在」および「位置」データベースの構築を示す図である。FIG. 1 is a diagram showing the construction of a “present” and “location” database. 図２は、典型的には、配列決定からの未加工のリードである、問い合わせＤＮＡ断片のセットのスコアリングを示す図である。FIG. 2 shows the scoring of a set of query DNA fragments, which are typically raw reads from sequencing. 図２は、典型的には、配列決定からの未加工のリードである、問い合わせＤＮＡ断片のセットのスコアリングを示す図である。FIG. 2 shows the scoring of a set of query DNA fragments, which are typically raw reads from sequencing. 図３は、本発明のシステムのアーキテクチャの概要を示す図である。FIG. 3 is a diagram showing an overview of the architecture of the system of the present invention. 図４は、変動するリードサイズ（行）およびランダム置換率（列）に従った、問い合わせとして使用したデータベースにおける７４７種の細菌ゲノムの平均ランク（ｘ軸）およびランクの標準偏差（ｙ軸）を示す図である。FIG. 4 shows the average rank (x-axis) and standard deviation (y-axis) of 747 bacterial genomes in the database used as a query, according to varying read size (row) and random replacement rate (column). FIG. 図４は、変動するリードサイズ（行）およびランダム置換率（列）に従った、問い合わせとして使用したデータベースにおける７４７種の細菌ゲノムの平均ランク（ｘ軸）およびランクの標準偏差（ｙ軸）を示す図である。FIG. 4 shows the average rank (x-axis) and standard deviation (y-axis) of 747 bacterial genomes in the database used as a query, according to varying read size (row) and random replacement rate (column). FIG. 図５は、実施例１および２においても使用される、インデックス付けおよびスコアリング手順の特定例の全体像を示す図である。（Ａ）参照配列のコレクションのインデックス付けにおいて、非重複ｋ−ｍｅｒは、２種の別個のキー値保存へとインデックス付けされ、そのうち一方は、ｋ−ｍｅｒが見出された参照とｋ−ｍｅｒとを関連付け（「存在」）、もう一方は、ｋ−ｍｅｒが見出された参照における位置とｋ−ｍｅｒとを関連付ける（「位置」）。（Ｂ）問い合わせセットにおけるシークエンシングリードを処理する際に、重複ｋ−ｍｅｒは、「存在」保存においてルックアップした。重複ｋ−ｍｅｒの使用は、リードの始まりおよび参照配列の始まりの間で（点線）、相対的に迅速にミスアライメント（ｍｉｓａｌｉｇｎｍｅｎｔ）の解消を可能にする。本図において、ｋ−ｍｅｒのサブセットのみが、インデックス付けステップによるフェーズにあり、したがって、これらのみが、「存在」に見出され得る。（Ｃ）所定のリードのため、十分なリードにマッチする可能性がある参照を保持するためだけに閾値が適用される。非常に大型の参照が、哺乳動物ゲノムに対する細菌リード等、互いに素な（disjoint）散乱したｋ−ｍｅｒを含有する状況は、例えば、参照における最小領域内の最高濃度のｋ−ｍｅｒを使用して、「位置」保存が問い合わせされる最後のステップにおいて解消される。FIG. 5 shows an overview of a specific example of an indexing and scoring procedure that is also used in Examples 1 and 2. (A) In indexing a collection of reference sequences, non-overlapping k-mers are indexed into two separate key value stores, one of which is the reference and k-mer where the k-mer was found. ("Existence"), and the other associates the position in the reference where the k-mer was found with the k-mer ("position"). (B) In processing sequencing reads in the query set, duplicate k-mers looked up in the “presence” store. The use of overlapping k-mers allows for a relatively quick elimination of misalignment between the beginning of the read and the beginning of the reference sequence (dashed line). In this figure, only a subset of k-mers are in phase with the indexing step, so only these can be found “existing”. (C) For a given lead, the threshold is applied only to hold references that may match enough leads. The situation where very large references contain disjoint scattered k-mers, such as bacterial leads to the mammalian genome, uses, for example, the highest concentration of k-mer within the minimum region in the reference. In the last step, the “location” storage is queried. 図６は、細菌リードを示す図である。７４７種のゲノムのセットにおける細菌ゲノム毎に、本出願人らは、数種のリード長（５０ヌクレオチド（ｎｔ）、７５ｎｔ、１００ｎｔ、１５０ｎｔ、２００ｎｔ、２５０ｎｔ）および数種の置換誤り率（０％、１％、５％、１０％）をシミュレートした。各問い合わせにおいて１００種のランダムリードを使用し、リストにおける正しい参照のランクの分布を記録した；１のランクは、正しい参照が、リストの最上部に存在したことを意味する。返されたヒットのリストは、２５の最大の長さに設定され、本出願人らは、リストに全く存在しない場合、「見当たらない」として参照を計数した。正しい検査細菌ゲノムのパーセンテージは、各パネルの右側に入れ子されたバーで表される。本図は、予想通り、誤り率が増加するにつれて性能が劣化することを示すが、長さ５０のリードは、相対的に減少した性能を有すると思われることも示す。１００ヌクレオチドを超えるリード長の増加は、１００ヌクレオチドのリードと比較して、僅かな改善しかもたらさず、誤り率における限定的な補整効果を有する。FIG. 6 is a diagram showing bacterial leads. For each bacterial genome in a set of 747 genomes, we have several read lengths (50 nucleotides (nt), 75 nt, 100 nt, 150 nt, 200 nt, 250 nt) and several substitution error rates (0% 1%, 5%, 10%). 100 random reads were used in each query and the distribution of the correct reference rank in the list was recorded; a rank of 1 means that the correct reference was at the top of the list. The returned list of hits was set to a maximum length of 25, and Applicants counted the reference as “missing” if none existed in the list. The percentage of the correct test bacterial genome is represented by a nested bar on the right side of each panel. The figure shows that, as expected, the performance degrades as the error rate increases, but also shows that the 50-length lead appears to have a relatively reduced performance. Increasing the read length beyond 100 nucleotides results in only a slight improvement compared to 100 nucleotide reads and has a limited compensation effect on error rate. 図６は、細菌リードを示す図である。７４７種のゲノムのセットにおける細菌ゲノム毎に、本出願人らは、数種のリード長（５０ヌクレオチド（ｎｔ）、７５ｎｔ、１００ｎｔ、１５０ｎｔ、２００ｎｔ、２５０ｎｔ）および数種の置換誤り率（０％、１％、５％、１０％）をシミュレートした。各問い合わせにおいて１００種のランダムリードを使用し、リストにおける正しい参照のランクの分布を記録した；１のランクは、正しい参照が、リストの最上部に存在したことを意味する。返されたヒットのリストは、２５の最大の長さに設定され、本出願人らは、リストに全く存在しない場合、「見当たらない」として参照を計数した。正しい検査細菌ゲノムのパーセンテージは、各パネルの右側に入れ子されたバーで表される。本図は、予想通り、誤り率が増加するにつれて性能が劣化することを示すが、長さ５０のリードは、相対的に減少した性能を有すると思われることも示す。１００ヌクレオチドを超えるリード長の増加は、１００ヌクレオチドのリードと比較して、僅かな改善しかもたらさず、誤り率における限定的な補整効果を有する。FIG. 6 is a diagram showing bacterial leads. For each bacterial genome in a set of 747 genomes, we have several read lengths (50 nucleotides (nt), 75 nt, 100 nt, 150 nt, 200 nt, 250 nt) and several substitution error rates (0% 1%, 5%, 10%). 100 random reads were used in each query and the distribution of the correct reference rank in the list was recorded; a rank of 1 means that the correct reference was at the top of the list. The returned list of hits was set to a maximum length of 25, and Applicants counted the reference as “missing” if none existed in the list. The percentage of the correct test bacterial genome is represented by a nested bar on the right side of each panel. The figure shows that, as expected, the performance degrades as the error rate increases, but also shows that the 50-length lead appears to have a relatively reduced performance. Increasing the read length beyond 100 nucleotides results in only a slight improvement compared to 100 nucleotide reads and has a limited compensation effect on error rate. 図７は、細菌リード（リードの数）を示す図である。７４７種のゲノムのセットにおける細菌ゲノム毎に、本出願人らは、数種のリード長（５０ｎｔ、７５ｎｔ、１００ｎｔ、１５０ｎｔ、２００ｎｔ、２５０ｎｔ）および数種の置換誤り率（０％、１％、５％、１０％）をシミュレートした。各問い合わせにおいて１００、２００または３００種のランダムリードを使用し、リストにおける正しい参照のランクの分布を記録した；１のランクは、正しい参照が、リストの最上部に存在したことを意味する。曲線は、１００、２００および３００種のリードを表示する。１００種のリードから３００種のリードに由来するランダム試料におけるリード数の増加が、性能の相対的に僅かな増加をもたらすことが理解できる。誤り率またはリード長は、さらにより強い効果を有した。FIG. 7 is a diagram showing bacterial leads (number of leads). For each bacterial genome in a set of 747 genomes, we have several read lengths (50 nt, 75 nt, 100 nt, 150 nt, 200 nt, 250 nt) and several substitution error rates (0%, 1%, 5%, 10%) was simulated. 100, 200 or 300 random reads were used in each query and the distribution of the correct reference rank in the list was recorded; a rank of 1 means that the correct reference was at the top of the list. The curve displays 100, 200 and 300 leads. It can be seen that increasing the number of leads in random samples from 100 to 300 leads leads to a relatively slight increase in performance. The error rate or lead length had an even stronger effect. 図７は、細菌リード（リードの数）を示す図である。７４７種のゲノムのセットにおける細菌ゲノム毎に、本出願人らは、数種のリード長（５０ｎｔ、７５ｎｔ、１００ｎｔ、１５０ｎｔ、２００ｎｔ、２５０ｎｔ）および数種の置換誤り率（０％、１％、５％、１０％）をシミュレートした。各問い合わせにおいて１００、２００または３００種のランダムリードを使用し、リストにおける正しい参照のランクの分布を記録した；１のランクは、正しい参照が、リストの最上部に存在したことを意味する。曲線は、１００、２００および３００種のリードを表示する。１００種のリードから３００種のリードに由来するランダム試料におけるリード数の増加が、性能の相対的に僅かな増加をもたらすことが理解できる。誤り率またはリード長は、さらにより強い効果を有した。FIG. 7 is a diagram showing bacterial leads (number of leads). For each bacterial genome in a set of 747 genomes, we have several read lengths (50 nt, 75 nt, 100 nt, 150 nt, 200 nt, 250 nt) and several substitution error rates (0%, 1%, 5%, 10%) was simulated. 100, 200 or 300 random reads were used in each query and the distribution of the correct reference rank in the list was recorded; a rank of 1 means that the correct reference was at the top of the list. The curve displays 100, 200 and 300 leads. It can be seen that increasing the number of leads in random samples from 100 to 300 leads leads to a relatively slight increase in performance. The error rate or lead length had an even stronger effect. 図８は、細菌リード、性能の可変性を示す図である。７４７種の検査細菌ゲノムの同定手順の１反復を５回実行した場合の、真の参照の平均ランク（ランク、ｘ軸）およびランクの標準偏差（Ｓランク、ｙ軸）。平均ランクが１に最も近いと、パーフェクトな性能に最も近くなり、ランクの標準偏差が最も小さいと、サンプリング効果に対し合理性が最も低くなる。検査した多数の細菌ゲノムが、散乱において等しいまたは近似した座標（coordinate）を生じる場合、明確さを増加させるために、本出願人らは、六角形のビニング（ｈｅｘａｇｏｎａｌｂｉｎｎｉｎｇ）を使用し、それに応じてその区域を着色する。各散布図の右側の垂直なバーは、上位２５マッチ以内でない検査ゲノムの数を示し、六角形のビニングと同じスケールで着色する。異なるリードサイズ（行）および誤り率（ランダム置換、列）を試行し、散布図のマトリクスを生じる。FIG. 8 is a diagram showing the variability of bacterial leads and performance. Average rank (rank, x-axis) of true references and standard deviation of rank (S-rank, y-axis) when one iteration of the 747 test bacterial genome identification procedure was performed 5 times. When the average rank is closest to 1, it is closest to perfect performance, and when the standard deviation of rank is the smallest, the rationality for the sampling effect is the lowest. In order to increase clarity when multiple bacterial genomes examined produce equal or approximate coordinates in the scatter, we use hexagonal binning and respond accordingly. Color the area. The vertical bar on the right side of each scatter plot indicates the number of test genomes that are not within the top 25 matches and is colored on the same scale as the hexagonal binning. Different lead sizes (rows) and error rates (random substitutions, columns) are tried, resulting in a scatter plot matrix. 図８は、細菌リード、性能の可変性を示す図である。７４７種の検査細菌ゲノムの同定手順の１反復を５回実行した場合の、真の参照の平均ランク（ランク、ｘ軸）およびランクの標準偏差（Ｓランク、ｙ軸）。平均ランクが１に最も近いと、パーフェクトな性能に最も近くなり、ランクの標準偏差が最も小さいと、サンプリング効果に対し合理性が最も低くなる。検査した多数の細菌ゲノムが、散乱において等しいまたは近似した座標（coordinate）を生じる場合、明確さを増加させるために、本出願人らは、六角形のビニング（ｈｅｘａｇｏｎａｌｂｉｎｎｉｎｇ）を使用し、それに応じてその区域を着色する。各散布図の右側の垂直なバーは、上位２５マッチ以内でない検査ゲノムの数を示し、六角形のビニングと同じスケールで着色する。異なるリードサイズ（行）および誤り率（ランダム置換、列）を試行し、散布図のマトリクスを生じる。FIG. 8 is a diagram showing the variability of bacterial leads and performance. Average rank (rank, x-axis) of true references and standard deviation of rank (S-rank, y-axis) when one iteration of the 747 test bacterial genome identification procedure was performed 5 times. When the average rank is closest to 1, it is closest to perfect performance, and when the standard deviation of rank is the smallest, the rationality for the sampling effect is the lowest. In order to increase clarity when multiple bacterial genomes examined produce equal or approximate coordinates in the scatter, we use hexagonal binning and respond accordingly. Color the area. The vertical bar on the right side of each scatter plot indicates the number of test genomes that are not within the top 25 matches and is colored on the same scale as the hexagonal binning. Different lead sizes (rows) and error rates (random substitutions, columns) are tried, resulting in a scatter plot matrix. 図９は、細菌リード、同じ種を示す図である。マッチのパーセンテージは、正しい種、すなわち、図７に示す正しく厳密に同じ参照ではなく、同じ種の細菌に属する本出願人らのコレクションにおける参照と、正しい種が上位２５マッチ内ではない事例のパーセンテージをもたらす。より短いリード（５０ｎｔ）の性能は相対的に低く、ノイズがこれをさらに減少させる（最初の行におけるバー・プロット）が、１００ｎｔから非常に良好になり、ノイズに対しロバスト状態を維持する。FIG. 9 is a diagram showing bacterial leads, the same species. The percentage of matches is the correct species, i.e. not the exact exact same reference shown in FIG. 7, but the reference in our collection belonging to the same species of bacteria and the percentage of cases where the correct species is not in the top 25 matches. Bring. The performance of the shorter lead (50 nt) is relatively low and the noise further reduces it (bar plot in the first row), which is much better from 100 nt and remains robust to noise. 図９は、細菌リード、同じ種を示す図である。マッチのパーセンテージは、正しい種、すなわち、図７に示す正しく厳密に同じ参照ではなく、同じ種の細菌に属する本出願人らのコレクションにおける参照と、正しい種が上位２５マッチ内ではない事例のパーセンテージをもたらす。より短いリード（５０ｎｔ）の性能は相対的に低く、ノイズがこれをさらに減少させる（最初の行におけるバー・プロット）が、１００ｎｔから非常に良好になり、ノイズに対しロバスト状態を維持する。FIG. 9 is a diagram showing bacterial leads, the same species. The percentage of matches is the correct species, i.e. not the exact exact same reference shown in FIG. 7, but the reference in our collection belonging to the same species of bacteria and the percentage of cases where the correct species is not in the top 25 matches. Bring. The performance of the shorter lead (50 nt) is relatively low and the noise further reduces it (bar plot in the first row), which is much better from 100 nt and remains robust to noise.

本発明は、試料に存在するタンパク質、ＤＮＡまたはＲＮＡに由来する生物配列情報の可能性の高い供給源の同定の実行におけるスピードおよび精度のバランスを保つ。 The present invention balances the speed and accuracy in performing identification of a likely source of biological sequence information derived from protein, DNA or RNA present in a sample.

本発明の方法において使用されるべき配列情報は、例えば、核酸配列決定機器から、またはタンパク質のＣもしくはＮ末端配列決定から、もしくは質量分析タンパク質配列決定からの未加工のリードとなり得る。よって、本発明の文脈における単語、試料配列は、ショートリードとも呼ばれるかかる未加工のリードを指す。 The sequence information to be used in the methods of the invention can be, for example, a raw lead from a nucleic acid sequencing instrument, or from a C or N-terminal sequencing of a protein, or from mass spectrometry protein sequencing. Thus, the word, sample array in the context of the present invention refers to such raw leads, also called short leads.

特定の一実施形態において、図２において説明されている本発明は、次の事柄得る。
・参照ＤＮＡによるデータベースの作成（図１を参照）。データベースは、２部構成である：１）参照に関してインデックス付けされたあらゆる参照ＤＮＡのｋ−ｍｅｒのデータベース、および２）データベース１に由来するｋ−ｍｅｒと参照配列における位置との間の関連のデータベース。よって、参照ｋ−ｍｅｒＩＤおよび位置は、２種の異なるデータベースにおいて保存される。 In one particular embodiment, the invention described in FIG. 2 can:
Create a database with reference DNA (see Figure 1). The database is in two parts: 1) a database of k-mers of any reference DNA indexed with reference, and 2) a database of associations between k-mers from database 1 and positions in the reference sequence . Thus, the reference k-mer ID and location are stored in two different databases.

図１は、データベース構築の一実施形態を図解する。データベースを作成するための入力データは、公開または独自データベースに由来するＤＮＡである。次に、これらは、好ましくは空間を節約するために非重複的となり得るＫ−ｍｅｒに分割される。ｋ−ｍｅｒは、さらに２ビットにビットパッキングすることができ、これは、各塩基が２ビットのメモリのみを占めることを意味する。ｋ−ｍｅｒの保存を加速させるため、これらは、好ましくは、データベースにおける挿入前に選別される。さらに、ｋ−ｍｅｒが由来する参照配列における名称および該配列における位置は、別々のデータベースにおいて保存することができる。 FIG. 1 illustrates one embodiment of database construction. Input data for creating the database is DNA derived from a public or proprietary database. These are then divided into K-mers which can preferably be non-overlapping to save space. The k-mer can be further bit-packed to 2 bits, meaning that each base occupies only 2 bits of memory. In order to accelerate the storage of k-mers, they are preferably screened before insertion in the database. Furthermore, the name in the reference sequence from which the k-mer is derived and the position in the sequence can be stored in separate databases.

・参照データベースに対する、供給源に由来するｋ−ｍｅｒの問い合わせ配列に分解されたリードの選択の検索。 • Search for a selection of reads resolved into a k-mer query sequence from the source against a reference database.

・主要スコアは、データベースにおける所定の参照配列に見出すことができる、問い合わせ配列に由来するｋ−ｍｅｒの数から計算される。 A major score is calculated from the number of k-mers derived from the query sequence that can be found in a given reference sequence in the database.

・示唆される配列が使用者に返され、これをより重い伝統的な解析（ｍｏｒｅｈｅａｖｙａｎｄｔｒａｄｉｔｉｏｎａｌａｎａｌｙｓｉｓ）に使用することができる。 • Suggested sequences are returned to the user and can be used for more traditional and analytical analysis (more heavy and traditional analysis).

本発明の本実行の特徴を次に示す。
・検索において、ｋ−ｍｅｒの正確なマッチのみが登録される。
・問い合わせリードは、例えば、長さ１６の多数のｋ−ｍｅｒに分解される。各ｋ−ｍｅｒの出発点は、１ずつ漸進される。
・「伝統的」ではない、ｄｅｎｏｖｏアライメントまたはマッピング方法。 The features of this implementation of the invention are as follows.
Only the exact match of k-mer is registered in the search.
The query lead is broken down into a number of k-mers of length 16, for example. The starting point for each k-mer is incremented by one.
A de novo alignment or mapping method that is not “traditional”.

図２は、ｋ−ｍｅｒデータベースを検索するための可能なアルゴリズムの１種を図解する。リードは、ステップサイズ１によるスライドウィンドウを使用してｋ−ｍｅｒに分割される。ｋ−ｍｅｒが、現検索において既に遭遇（ビジット）していた場合、次のｋ−ｍｅｒが選択される。次に、ｋ−ｍｅｒデータベースにおいてｋ−ｍｅｒをルックアップする。これがデータベースに存在する場合、参照配列の同一性および該配列における位置を情報検索する。次に、リードの近似連続性を算出し、最大の連続したセグメントが閾値を超える場合、ヒット計数が増加する。リードにおけるあらゆるｋ−ｍｅｒに対しこれを繰り返す。リード毎に、問い合わせ配列の長さで割ったヒットの数（ヒット計数）としてスコアを算出し、次に、マッチする参照配列の長さで割ったヒット計数を算出する。これは、多数のリードに対して繰り返され、得られたスコアに応じて先験的にまたは動的に定義することができる。スコアを選別し、最良のマッチを使用者に返す。 FIG. 2 illustrates one type of possible algorithm for searching the k-mer database. The lead is divided into k-mers using a sliding window with a step size of 1. If a k-mer has already been encountered in the current search, the next k-mer is selected. Next, the k-mer is looked up in the k-mer database. If this is present in the database, information is searched for the identity of the reference sequence and its position in the sequence. Next, the approximate continuity of the lead is calculated and if the largest consecutive segment exceeds the threshold, the hit count is increased. Repeat this for every k-mer in the lead. For each read, the score is calculated as the number of hits divided by the length of the query sequence (hit count), and then the hit count divided by the length of the matching reference sequence is calculated. This is repeated for a large number of leads and can be defined a priori or dynamically depending on the score obtained. Screen the score and return the best match to the user.

正確なマッチは、リードのレベルにおいて為されない。スコアリングは、リードに沿ったｋ−ｍｅｒマッチの見逃しを可能にする（そのため、生物学的試料における配列決定エラーおよび突然変異に対するロバスト性が確実になる）。 An exact match is not made at the lead level. Scoring allows missed k-mer matches along the lead (thus ensuring robustness against sequencing errors and mutations in biological samples).

システムの全体像を次に示す。
・あらゆる公知の参照ＤＮＡ配列をｋ−ｍｅｒにインデックス付けし、参照（例えば、種）および参照配列における位置を保存する。本ステップは、好ましくは、新たな配列の追加またはさらなる配列情報の追加により、参照ＤＮＡ配列がアップデートされた場合にのみ行われる。
・配列をｋ−ｍｅｒに分割し、データベースに対しマッチングさせ、参照配列のヒット数を計数し、好ましくは、位置情報によりマッチングを精密化することにより、ＤＮＡの短い配列を保存することができるクライアント。 The whole system is shown below.
• Index any known reference DNA sequence to the k-mer, preserving the reference (eg, species) and position in the reference sequence. This step is preferably performed only when the reference DNA sequence has been updated by the addition of new sequences or additional sequence information.
Clients that can store short sequences of DNA by dividing the sequence into k-mers, matching against the database, counting the number of hits in the reference sequence, and preferably refining the matching by location information .

得られた参照は、次の目的のためにその後に使用することができる。
・参照にマッチするリードを取り除き、別の異なる参照に由来するより少ない存在量のＤＮＡが存在するか見出す。
・該参照に対するアライメントを実行する、あるいはデータベースにおける参照を使用してより大型の断片を反復的に構築し、以前にアセンブルされた参照を活用することによるｄｅ−ｎｏｖｏアセンブリよりもさらに優れた性能をもたらす；さらに、データベースのサイズが増加し、より多くのアセンブルした参照が追加されるにつれ、性能が高まるであろう。
・様々な生物または遺伝子（例えば、診断目的に関連）の可能性の高い存在を同定する。 The resulting reference can then be used for the following purposes.
Remove leads that match the reference and find if there is less abundance of DNA from another different reference.
Perform better alignment than de-novo assembly by performing alignment on the reference, or iteratively constructing larger fragments using references in the database and leveraging previously assembled references In addition, performance will increase as the size of the database increases and more assembled references are added.
Identify the likely presence of various organisms or genes (eg, relevant for diagnostic purposes).

未加工のリードの副試料（sub-sample）のみが必要とされるため、これは、感染病原体の同定等、初歩的診断を行うために移行させるデータの量を減少させることができる。より小型の配列実験の場合、これは、解析の一部が、商品ハードウェアにおけるクライアントにより行われることも可能にする。 Since only a sub-sample of raw leads is required, this can reduce the amount of data that is transferred to make an initial diagnosis, such as identification of an infectious agent. For smaller array experiments, this also allows part of the analysis to be performed by the client in the product hardware.

ロースループット（low-throughput）デスクトップシーケンサー（または使い捨て配列決定ユニット）の開発と、より安価なＧＰＵまたはＦＰＧＡユニットの登場により、本技法は、配列決定データのリアルタイムまたはほとんどリアルタイムの一次解析を可能にする。 With the development of low-throughput desktop sequencers (or disposable sequencing units) and the emergence of cheaper GPU or FPGA units, the technique enables real-time or near real-time primary analysis of sequencing data. .

アルゴリズム
一態様において、本発明は、生物配列の可能性の高い供給源を同定する方法であって、
ａ）供給源から配列またはショートリードのサブセットをサンプリングするステップと、
ｂ）サブセットからの配列をｋ−ｍｅｒに断片化するステップと、
ｃ）前記サブセットからのｋ−ｍｅｒを、参照配列のｋ−ｍｅｒを含むデータベースに対して問い合わせるステップと、
ｄ）いずれの参照が、ｋ−ｍｅｒを含有するか決定するステップと、
ｅ）可能性の高い供給源参照の記述を返すステップと
を含む方法に関する。 Algorithm In one aspect, the present invention is a method for identifying a likely source of a biological sequence comprising:
a) sampling a sequence or a subset of short reads from a source;
b) fragmenting sequences from the subset into k-mers;
c) querying a k-mer from said subset against a database containing the k-mer of the reference sequence;
d) determining which references contain k-mers;
e) returning a description of a likely source reference.

用語「供給源に由来する配列」は、生物配列を含む試料から得られる配列を示すために使用される。試料は、環境試料、患者等の対象に由来する試料、犯罪現場に由来する試料、食物試料、水試料その他となり得る。試料は、最先端のＤＮＡ／ＲＮＡまたはタンパク質単離および配列決定方法に付される。その結果は、該試料に特徴的な配列（リードとも呼ばれる）のセットである。配列は、典型的には、ある特定の区間内のランダムな長さである。配列は、また、典型的には、ランダムに重複している。供給源配列と呼ばれる試料に由来する配列のそれぞれを本発明の方法に付すことができる。 The term “sequence from a source” is used to indicate a sequence obtained from a sample containing a biological sequence. The sample can be an environmental sample, a sample derived from a subject such as a patient, a sample derived from a crime scene, a food sample, a water sample, or the like. Samples are subjected to state-of-the-art DNA / RNA or protein isolation and sequencing methods. The result is a set of sequences (also called reads) characteristic of the sample. The sequence is typically a random length within a particular interval. The sequences are also typically overlapping randomly. Each of the sequences derived from the sample called the source sequence can be subjected to the method of the present invention.

本発明における用語「参照」は、データベースに保存されている配列の記述子を含む。参照の典型例は、特定の種または品種または分離株の完全ゲノム配列である。参照は、特定の種または種の特定の状態のトランスクリプトームまたはプロテオームからなることもできる。種のトランスクリプトームおよびプロテオームは、年齢および環境条件に応答して経時的に変化し得るが、例えば、種のゲノム配列は、程度の差はあるが、経時的に一定であり続ける。データベースは、参照に関する追加的な情報を保存することができる。 The term “reference” in the present invention includes an array descriptor stored in a database. A typical example of a reference is the complete genomic sequence of a particular species or variety or isolate. A reference can also consist of a transcriptome or proteome of a particular species or a particular state of a species. Species transcriptomes and proteomes can change over time in response to age and environmental conditions, for example, species genomic sequences remain constant over time, to varying degrees. The database can store additional information about the reference.

本発明の方法は、アミノ酸配列ならびにＤＮＡおよびＲＮＡ配列等のヌクレオチド配列等、いかなる生物配列情報に適用することもできる。好ましい実施形態において、配列は、ＤＮＡ配列である。 The method of the present invention can be applied to any biological sequence information such as amino acid sequences and nucleotide sequences such as DNA and RNA sequences. In a preferred embodiment, the sequence is a DNA sequence.

その最も広範な態様において、本発明は、問い合わせまたは供給源配列に由来するｋ−ｍｅｒの存在の同定のみに頼る。この場合、アルゴリズムからの出力は、参照と、参照において同定されたヒットの相当する数のリストである。しかし、ヒトゲノムや特に一部の植物ゲノム等、一部のゲノムの規模のために、多くのｋ−ｍｅｒは、これらの非常に大型のゲノムに偶然に存在し得る。したがって、好ましい実施形態において、問い合わせは、参照配列におけるｋ−ｍｅｒの位置の決定をさらに含む。これは、存在および位置が、参照配列における問い合わせｋ−ｍｅｒの連続性の決定に使用されることを可能にする。これにより、存在および局所性の両方に基づき、スコアとして問い合わせがより正確になる、あるいは参照におけるｋ−ｍｅｒの近似連続性を使用することができる。 In its broadest aspect, the present invention relies solely on the identification of the presence of a k-mer derived from a query or source sequence. In this case, the output from the algorithm is a reference and a list of corresponding numbers of hits identified in the reference. However, because of the size of some genomes, such as the human genome and especially some plant genomes, many k-mers can coincide by chance in these very large genomes. Thus, in a preferred embodiment, the query further comprises determining the position of the k-mer in the reference sequence. This allows the presence and position to be used to determine the continuity of the query k-mer in the reference sequence. This makes the query more accurate as a score based on both presence and locality, or the approximate continuity of k-mers in the reference can be used.

よって、本発明の好ましい実施形態は、生物配列の可能性の高い供給源を同定する方法であって、
ａ）供給源から配列またはショートリードのサブセットをサンプリングするステップと、
ｂ）サブセットからの配列をｋ−ｍｅｒに断片化するステップと、
ｃ）前記サブセットからの１種または複数のｋ−ｍｅｒを、参照配列のｋ−ｍｅｒを含む第１のコレクションに対して問い合わせるステップと、
ｄ）前記サブセットからの１種または複数のｋ−ｍｅｒを、参照配列におけるｋ−ｍｅｒの位置を含む第２のコレクションに対して問い合わせるステップと、
ｅ）いずれの参照がｋ−ｍｅｒを含有するか決定するステップと、
ｆ）可能性の高い供給源参照の記述を返すステップと
を含み、参照配列のｋ−ｍｅｒを含むコレクションが、参照配列におけるｋ−ｍｅｒの位置を含むコレクションとは別々である方法に関する。 Thus, a preferred embodiment of the present invention is a method for identifying a likely source of a biological sequence comprising:
a) sampling a sequence or a subset of short reads from a source;
b) fragmenting sequences from the subset into k-mers;
c) interrogating one or more k-mers from said subset against a first collection comprising k-mers of a reference sequence;
d) interrogating one or more k-mers from the subset against a second collection that includes the position of the k-mer in the reference sequence;
e) determining which references contain k-mers;
f) returning a description of a likely source reference, wherein the collection containing the k-mer of the reference sequence is separate from the collection containing the position of the k-mer in the reference sequence.

本発明のさらにより好ましい実施形態において、参照配列におけるｋ−ｍｅｒの位置を含む第２のコレクションに対する問い合わせは、参照配列のｋ−ｍｅｒを含む第１のコレクションにおいて所定のｋ−ｍｅｒが見出された（すなわち、存在する）場合にのみ行われる（図２を参照）。 In an even more preferred embodiment of the present invention, the query for the second collection containing the position of the k-mer in the reference sequence finds the given k-mer in the first collection containing the k-mer of the reference sequence. Only if it is (ie, exists) (see FIG. 2).

本発明の好ましい実施形態において、上述のステップａ）〜ｆ）が使用される場合、所定のｋ−ｍｅｒの存在および位置は、その後のｋ−ｍｅｒの問い合わせに先立ち決定される。よって、本発明の好ましい実施形態は、生物配列の可能性の高い供給源を同定する方法であって、
ａ）供給源から配列またはショートリードのサブセットをサンプリングするステップと、
ｂ）サブセットからの配列をｋ−ｍｅｒに断片化するステップと、
ｃ）前記サブセットからのｋ−ｍｅｒを、参照配列のｋ−ｍｅｒを含む第１のコレクションに対して問い合わせるステップと、
ｄ）前記サブセットからの前記ｋ−ｍｅｒを、参照配列におけるｋ−ｍｅｒの位置を含む第２のコレクションに対して問い合わせるステップと、
ｅ）いずれの参照がｋ−ｍｅｒを含有するか決定するステップと、
ｆ）可能性の高い供給源参照の記述を返すステップと
を含み、参照配列のｋ−ｍｅｒを含むコレクションが、参照配列におけるｋ−ｍｅｒの位置を含むコレクションとは別々である方法に関する。 In the preferred embodiment of the present invention, if steps a) to f) above are used, the presence and location of a given k-mer is determined prior to subsequent k-mer queries. Thus, a preferred embodiment of the present invention is a method for identifying a likely source of a biological sequence comprising:
a) sampling a sequence or a subset of short reads from a source;
b) fragmenting sequences from the subset into k-mers;
c) querying a k-mer from said subset against a first collection containing a k-mer of a reference sequence;
d) querying the k-mer from the subset against a second collection containing the position of the k-mer in a reference sequence;
e) determining which references contain k-mers;
f) returning a description of a likely source reference, wherein the collection containing the k-mer of the reference sequence is separate from the collection containing the position of the k-mer in the reference sequence.

本発明の注目すべき特色の１つは、配列決定から得られる配列のサブセットのみが、データベースの問い合わせに使用されることである。この構成は、非常に大型のゲノムが配列決定され問い合わせされる際の律速ステップとなり得るデータの移行を最小化する。よって、配列のサブセットは、少なくとも１％、例えば、少なくとも２％、例えば、少なくとも４％、例えば、少なくとも５％、例えば、少なくとも６％、例えば、少なくとも７．５％、例えば、少なくとも１０％、例えば、少なくとも１５％、例えば、少なくとも２５％、例えば、少なくとも３０％、例えば、少なくとも３５％、例えば、少なくとも４０％、例えば、少なくとも５０％の離散した配列を含むことができる。 One notable feature of the present invention is that only a subset of sequences obtained from sequencing is used for database queries. This configuration minimizes data migration that can be the rate-limiting step when very large genomes are sequenced and queried. Thus, a subset of sequences is at least 1%, such as at least 2%, such as at least 4%, such as at least 5%, such as at least 6%, such as at least 7.5%, such as at least 10%, such as , At least 15%, such as at least 25%, such as at least 30%, such as at least 35%, such as at least 40%, such as at least 50%.

本発明の特徴の１つは、ｋ−ｍｅｒ問い合わせが、問い合わせおよび参照ｋ−ｍｅｒの間の正確なマッチの決定を含むことである。 One feature of the present invention is that the k-mer query includes determining an exact match between the query and the reference k-mer.

供給源配列またはショートリードが問い合わせされる場合、好ましくは、問い合わせは、少なくとも１種の供給源配列に由来するあらゆるｋ−ｍｅｒの問い合わせを含む。この構成は、連続性または近似連続性の最良の計算を可能にする。好ましくは、少なくとも５０種の供給源配列、例えば、少なくとも１００、例えば、少なくとも１５０、例えば、少なくとも２００、例えば、少なくとも２５０、例えば、少なくとも３００、例えば、少なくとも４００、例えば、少なくとも５００、例えば、少なくとも７５０、例えば、少なくとも１０００、例えば、少なくとも１５００、例えば、少なくとも２０００、例えば、少なくとも２５００、例えば、少なくとも５０００種以上の配列に由来するあらゆるｋ−ｍｅｒが問い合わせされる。問い合わせされる供給源配列の正確な数は、とりわけ、ネットワークおよび計算能力、時間的制約、統計的要件および完全供給源配列のサイズおよび異なる参照に対する供給源の関係性により決定される。 When a source sequence or short lead is queried, preferably the query includes any k-mer query derived from at least one source sequence. This configuration allows the best calculation of continuity or approximate continuity. Preferably, at least 50 source arrays, such as at least 100, such as at least 150, such as at least 200, such as at least 250, such as at least 300, such as at least 400, such as at least 500, such as at least 750. Any k-mer from, for example, at least 1000, such as at least 1500, such as at least 2000, such as at least 2500, such as at least 5000 sequences, is queried. The exact number of source sequences queried is determined, inter alia, by the network and computational power, time constraints, statistical requirements and the size of the complete source sequence and the relationship of the source to different references.

実施例において実証される通り、各供給源配列は、好ましくは、供給源生物、変種、品種または分離株の特徴的なフィンガープリントを与えるための、所定の最小の長さのものである。ヌクレオチド配列である供給源配列の場合、供給源配列は、好ましくは、少なくとも５０ヌクレオチド塩基、より好ましくは、少なくとも７５ヌクレオチド塩基、例えば、７５〜２００ヌクレオチド塩基等、例えば、７５ヌクレオチド塩基〜１００ヌクレオチド塩基または１００ヌクレオチド塩基〜１２５ヌクレオチド塩基または１２５ヌクレオチド塩基〜１５０ヌクレオチド塩基または１５０ヌクレオチド塩基〜１７５ヌクレオチド塩基または１７５ヌクレオチド塩基〜２００ヌクレオチド塩基、さらにより好ましくは、例えば、少なくとも１００ヌクレオチド塩基、例えば、１００〜３００ヌクレオチド塩基、例えば、１００ヌクレオチド塩基〜１５０ヌクレオチド塩基または１５０ヌクレオチド塩基〜２００ヌクレオチド塩基または２００ヌクレオチド塩基〜２５０ヌクレオチド塩基または２５０ヌクレオチド塩基〜３００ヌクレオチド塩基、例えば、少なくとも１００ヌクレオチド塩基、例えば、１００ヌクレオチド塩基、例えば、２００ヌクレオチド塩基等、例えば、少なくとも２５０ヌクレオチド塩基、例えば、３００ヌクレオチド塩基、例えば、４００ヌクレオチド塩基、少なくとも５００以上のヌクレオチド塩基のものである。 As demonstrated in the examples, each source sequence is preferably of a predetermined minimum length to provide a characteristic fingerprint of the source organism, variant, variety or isolate. In the case of a source sequence that is a nucleotide sequence, the source sequence is preferably at least 50 nucleotide bases, more preferably at least 75 nucleotide bases, such as 75-200 nucleotide bases, such as 75 nucleotide bases to 100 nucleotide bases. Or 100 nucleotide bases to 125 nucleotide bases or 125 nucleotide bases to 150 nucleotide bases or 150 nucleotide bases to 175 nucleotide bases or 175 nucleotide bases to 200 nucleotide bases, even more preferably, for example, at least 100 nucleotide bases, such as 100 to 300. Nucleotide bases, such as 100 nucleotide bases to 150 nucleotide bases or 150 nucleotide bases to 200 nucleotide bases or 2 0 nucleotide bases to 250 nucleotide bases or 250 nucleotide bases to 300 nucleotide bases, such as at least 100 nucleotide bases such as 100 nucleotide bases such as 200 nucleotide bases, such as at least 250 nucleotide bases such as 300 nucleotide bases such as 400 nucleotide bases, at least 500 or more nucleotide bases.

多くの実際的な実行において、配列の１種のサブセットが、最初に問い合わせされる。これが、十分に高い確実性による参照の決定に十分でない場合、本方法は、配列の１種または複数のさらなるサブセットを選択するステップと、これらを、本発明の方法のステップａ）〜ｅ）またはａ）〜ｆ）に付すステップとをさらに含むことができる。 In many practical implementations, a subset of an array is queried first. If this is not sufficient to determine a reference with a sufficiently high certainty, the method comprises selecting one or more further subsets of the sequence and these comprising steps a) to e) of the method of the invention or a) to f) may be further included.

原則的に、本方法は、いかなるサイズのｋ−ｍｅｒまたはｋ個組の使用も可能にする。しかし、好ましい実施形態において、ｋ−ｍｅｒのサイズは、４で割ることができる。したがって、ｋ−ｍｅｒは、サイズ４、８、１２、１６、２０、２４、２８、３２、３６、４０、４４、４８、５２、５６、６０、６４以上のものとなり得る。より好ましくは、ｋ−ｍｅｒは、１６から６４の間、より好ましくは、１６から３２の間の長さのものとなり得る。より長いｋ−ｍｅｒは、本方法の配列決定エラーに対する感度をより高め、より短いｋ−ｍｅｒは、ランダムヒットの数を増加させ、これにより、ノイズを生じる。 In principle, the method allows the use of any size k-mer or k-tuple. However, in a preferred embodiment, the k-mer size can be divided by four. Thus, a k-mer can be of size 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64 or more. More preferably, the k-mer can be of a length between 16 and 64, more preferably between 16 and 32. A longer k-mer makes the method more sensitive to sequencing errors, and a shorter k-mer increases the number of random hits, thereby producing noise.

一実施形態において、ｋ−ｍｅｒは連続しており、好ましくは、データベースに保存されているｋ−ｍｅｒは、参照配列全体を網羅するために連続している。 In one embodiment, the k-mers are contiguous, preferably the k-mers stored in the database are contiguous to cover the entire reference sequence.

好ましくは、供給源配列に由来するｋ−ｍｅｒは、重複しており、少なくとも１、例えば、少なくとも２、例えば、少なくとも３、例えば、少なくとも４、例えば、少なくとも５、例えば、少なくとも６以上の塩基またはアミノ酸ずつ漸増する。これは、配列にわたる幅ｋのウィンドウのスライドに相当する。ウィンドウは、配列にわたり１、２個以上の塩基／アミノ酸ずつスライドさせることができる。例えば、一塩基突然変異／エラーのどちらかの側におけるｋ−ｍｅｒを問い合わせにおいて同定できるであろうことから、供給源配列から重複する漸増ｋ−ｍｅｒを作成することにより、本方法は、配列決定エラーまたは点突然変異に対する感度が低くなる。したがって、連続性は、より高い精度で算出することができる。 Preferably, the k-mers from the source sequence are overlapping and are at least 1, such as at least 2, such as at least 3, such as at least 4, such as at least 5, such as at least 6 bases or Gradually increase amino acids. This corresponds to sliding a window of width k across the array. The window can be slid by 1, 2 or more bases / amino acids across the sequence. For example, by creating an overlapping incremental k-mer from the source sequence, the method will determine the sequencing since a k-mer on either side of the single nucleotide mutation / error could be identified in the query. Less sensitive to errors or point mutations. Therefore, continuity can be calculated with higher accuracy.

供給源配列における互いに素な部分配列の連結に起因する互いに素なｋ−ｍｅｒの使用も可能である。 It is also possible to use disjoint k-mers due to concatenation of disjoint subsequences in the source array.

好ましくは、本方法において、所定の配列に由来するｋ−ｍｅｒが、データベースに対して問い合わせされて、１種または複数の参照配列におけるｋ−ｍｅｒの存在と、前記１種または複数の参照配列におけるｋ−ｍｅｒの位置を決定する。データベース使用を最適化するために、位置は、好ましくは、ｋ−ｍｅｒがデータベースに存在する場合にのみ問い合わせされる。 Preferably, in the method, a k-mer derived from a given sequence is queried against a database to determine the presence of a k-mer in one or more reference sequences and the one or more reference sequences. Determine the position of the k-mer. To optimize database usage, the location is preferably queried only if k-mers are present in the database.

問い合わせの定量的評価を可能にするために、本方法は、同定された参照配列のスコアを算出するステップを含み、該スコアは、所定の参照配列に見出される１種または複数の配列に由来するｋ−ｍｅｒの数に相関する。このスコアは、例えば、供給源配列の長さで割ることができる。同定された参照のさらなるスコアを算出することができ、該さらなるスコアは、参照配列に見出される１種または複数の配列に由来するｋ−ｍｅｒの連続性に相関する。例えば、スコアは、データベースにおいて見出される１種の供給源配列に由来するｋ−ｍｅｒおよびデータベースにおける１種の参照配列において見出されるｋ−ｍｅｒの最長の配列のパーセンテージとなり得る。 To allow quantitative evaluation of the query, the method includes calculating a score for the identified reference sequence, the score being derived from one or more sequences found in the given reference sequence. Correlates to the number of k-mers. This score can be divided, for example, by the length of the source sequence. A further score for the identified reference can be calculated, which correlates to the continuity of the k-mer from one or more sequences found in the reference sequence. For example, the score can be the percentage of the longest sequence of k-mers from one source sequence found in the database and the k-mer found in one reference sequence in the database.

同様に、同定された参照配列毎に、同定された参照のスコアを算出することができ、該スコアは、供給源に由来するｋ−ｍｅｒのサブセットにも存在する参照配列におけるｋ−ｍｅｒの数に相関する。一例は、供給源配列において見出されるデータベースにおける１種の参照に由来するｋ−ｍｅｒのパーセンテージとなり得る。多くの実際の適用において、満足のいく確実性を得るために、数百種の供給源配列が問い合わせおよびスコアリングされる。このスコアは、同定されたｋ−ｍｅｒの連続性に基づくスコアを含むこともできる。 Similarly, for each identified reference sequence, a score for the identified reference can be calculated, which is the number of k-mers in the reference sequence that are also present in the subset of k-mers from the source. Correlate with An example can be the percentage of k-mers from one reference in the database found in the source sequence. In many practical applications, hundreds of source sequences are queried and scored to obtain satisfactory certainty. This score may also include a score based on the continuity of the identified k-mer.

これらのスコアは、好ましくは、別個の供給源配列毎に算出され、例えば、１種の供給源配列に由来するあらゆるｋ−ｍｅｒが問い合わせされ、前記供給源配列の１個または複数のスコアが算出される。好ましくは、本方法は、第２の供給源配列、好ましくは、第３の供給源配列等に由来するあらゆるｋ−ｍｅｒの問い合わせをさらに含む。異なる供給源配列のスコアは、例えば、供給源配列の長さによりこれらを秤量することにより組み合わせることができる。 These scores are preferably calculated for each separate source sequence, for example, any k-mer from one source sequence is queried and one or more scores for the source sequence are calculated. Is done. Preferably, the method further comprises any k-mer query derived from a second source sequence, preferably a third source sequence or the like. The scores of different source sequences can be combined, for example, by weighing them according to the length of the source sequence.

本発明の一実施形態において、リードのために作成されたあらゆるｋ−ｍｅｒが処理されたら、参照においてマッチした近接位置の数を使用して、マッチの最大のクラスター、すなわち、あらゆるマッチする参照にわたる同じリードに起源をもつマッチするｋ−ｍｅｒの最大の濃度を単離する。かかるクラスター毎に、所定の参照配列の計数に、クラスターにおけるｋ−ｍｅｒの数を加えることにより、計数が算出される。所定の試料に由来する２種以上のリードにわたり本方法が反復される場合、先のリードから得られた参照配列の計数に、クラスターにおけるｋ−ｍｅｒの数を加えることにより、計数をアップデートすることができる。すなわち、該参照のｋ−ｍｅｒの数を加えることにより、計数をアップデートすることができ、既に計数されたｋ−ｍｅｒのリストがアップデートされる。続いて、次の配列またはリードを処理することができる。マッチすることが判明したｋ−ｍｅｒの計数が関連付けされた参照のリストが得られる。ペア＜参照、計数＞毎に、計数を、問い合わせセットにおける特有のｋ−ｍｅｒの数で割り、所定の参照によりマッチした問い合わせたサブセットにおけるＤＮＡの量の大雑把なスコアを得る。問い合わせたサブセットが、配列に完全にマッチする場合、該スコアは、１となり、そうでなければ、これはより小さくなる；例えば、問い合わせたサブセットが、２種の参照の等しい割合の混合物である場合、両方の参照のスコアは、０．５前後となるであろう。該計数は、参照のサイズ（または参照配列における特有のｋ−ｍｅｒの数）で割って、問い合わせたサブセットによって表される参照の画分の大雑把なスコアを得ることもできる；該第２のスコアは、マッチする参照の選別および最大の参照へのバイアスの回避に役立つ。最終スコアは、例えば、スコア毎に等しい加重が使用された、これら２スコアの加重和である。 In one embodiment of the present invention, once every k-mer created for a lead has been processed, the number of close positions matched in the reference is used to span the largest cluster of matches, i.e. every matching reference. Isolate the highest concentration of matching k-mer originating from the same lead. For each such cluster, the count is calculated by adding the number of k-mers in the cluster to the count of a given reference sequence. If the method is repeated over two or more reads from a given sample, update the count by adding the number of k-mers in the cluster to the reference sequence count obtained from the previous read. Can do. That is, the count can be updated by adding the number of reference k-mers, and the list of already counted k-mers is updated. Subsequently, the next array or read can be processed. A list of references associated with the k-mer counts found to match is obtained. For each pair <reference, count>, divide the count by the number of unique k-mers in the query set to get a rough score of the amount of DNA in the queried subset matched by a given reference. If the queried subset exactly matches the sequence, the score will be 1, otherwise it will be smaller; for example, if the queried subset is a mixture of equal proportions of the two references , Both reference scores will be around 0.5. The count can also be divided by the size of the reference (or the number of unique k-mers in the reference sequence) to obtain a rough score for the fraction of the reference represented by the queried subset; the second score Helps to select matching references and avoid bias to the largest reference. The final score is, for example, a weighted sum of these two scores, using an equal weight for each score.

本発明の一実施形態において、予め選択された数の供給源配列を問い合わせし、結果を返す。しかし、他の実施形態において、データベース問い合わせは、定義済みの統計的確率により参照生物が同定されたら中止することができる。同様に、ｋ−ｍｅｒの定義済みの画分が、データベースにおいて見出されない、またはさらなる供給源配列により伸長される、あるいはスコアが緩和パラメータにより算出される場合、データベース問い合わせは中止することができる。これは、ジャンク配列、多くの配列決定エラーを有する配列または完全な未知配列の場合に生じ得る。 In one embodiment of the present invention, a preselected number of source arrays are queried and the results are returned. However, in other embodiments, the database query can be stopped once the reference organism has been identified with a defined statistical probability. Similarly, if a defined fraction of k-mer is not found in the database, or is extended by additional source sequences, or the score is calculated by the relaxation parameter, the database query can be aborted. This can occur in the case of junk sequences, sequences with many sequencing errors, or completely unknown sequences.

問い合わせプロセスからの出力は、前記スコアまたは複数のスコアのうち１種または複数に従ってランク付けされた可能性の高い供給源参照のリストとなり得る。データベース出力の他の例として、１種または複数の可能性の高い参照に関する次の情報のうち１種または複数が挙げられる：可能性の高い参照の分類学的名称、前記可能性の高い参照の近縁、前記参照の供給源、遺伝連鎖情報、ＳＮＰに関する情報、配列における遺伝子の位置およびアノテーション。 The output from the query process can be a list of likely source references ranked according to one or more of the score or scores. Other examples of database output include one or more of the following information about one or more likely references: the taxonomic name of the likely reference, of the likely reference Relatedness, source of the reference, genetic linkage information, information about SNPs, gene location and annotation in the sequence.

特定の実施形態において、データベースは、最も可能性の高い参照の配列を出力し、好ましくは、データベースは、最も可能性の高い参照種の完全ゲノム配列を出力する。これにより、使用者は、最先端のアライメントアルゴリズムを使用して最も可能性の高い種の完全ゲノム配列に対して供給源配列をアライメントし、突然変異または挿入または染色体異常、異常性もしくは異状が存在するかさらに調査することができる。しかし、本発明の一実施形態において、本発明の方法は、例えば、Ｓｍｉｔｈ−Ｗａｔｅｒｍａｎアルゴリズム［１４］、ＢＬＡＳＴ［１］、ＢＬＡＴ［５］、Ｂｏｗｔｉｅ、ＢＷＡ、ＳＨＲｉＭＰ［１６］または当業者に公知の他のアライメントアルゴリズム等、例えば、スコアリングマトリクスを使用するアライメントアルゴリズム等、配列データに関するアライメントアルゴリズムの使用を含まない。 In certain embodiments, the database outputs the most likely reference sequence, and preferably the database outputs the complete genome sequence of the most likely reference species. This allows the user to align the source sequence with the most likely species complete genome sequence using a state-of-the-art alignment algorithm and present mutations or insertions or chromosomal abnormalities, abnormalities or anomalies You can investigate further. However, in one embodiment of the present invention, the method of the present invention is, for example, a Smith-Waterman algorithm [14], BLAST [1], BLAT [5], Bowtie, BWA, SHRiMP [16] or known to those skilled in the art. It does not include the use of alignment algorithms for sequence data, such as other alignment algorithms, for example, alignment algorithms that use a scoring matrix.

微生物配列が問い合わせされる場合等、多くの場合、データベースは、多くの密接に関係する配列、例えば、同じ種の異なる分離株に由来する配列を含むことができる。かかる事例において、非常に類似した配列を有する参照からの結果は、出力においてグループ化することができる。これは、使用者が、より少量で存在する別の種または異なる種に由来する挿入ＤＮＡの小片をより容易に同定することも可能にできる。 In many cases, such as when microbial sequences are queried, a database can contain many closely related sequences, eg, sequences from different isolates of the same species. In such cases, the results from references with very similar sequences can be grouped in the output. This can also allow the user to more easily identify small pieces of inserted DNA from another species or a different species present in smaller amounts.

多くの場合、試料は種の混合集団を含有し、全ゲノムの配列決定は、複数の種に由来するゲノムＤＮＡの混合物をもたらすであろう。このような場合、本方法は、第１の反復における最も豊富な参照の同定等、本方法の数回の反復の実行を含み得る。第２の反復において、最も豊富な種に由来する配列は、データベースを問い合わせる前に供給源配列から除去することができる、あるいは本方法は、該種に由来するさらなる結果の無視を含み得る。 In many cases, the sample contains a mixed population of species, and whole genome sequencing will result in a mixture of genomic DNA from multiple species. In such cases, the method may include performing several iterations of the method, such as identifying the most abundant reference in the first iteration. In the second iteration, sequences from the most abundant species can be removed from the source sequence prior to querying the database, or the method can include ignoring further results from the species.

あるいは、本発明の方法の１回の反復からの出力は、同定された全参照の情報およびスコアを含むことができる。この場合のスコアは、異なる参照間のパーセンテージ分布を含むことができる。 Alternatively, the output from a single iteration of the method of the invention can include information and scores for all identified references. The score in this case can include a percentage distribution between different references.

本実施形態は、ウイルス挿入、導入遺伝子または別の細菌種に由来する挿入等、挿入の参照を同定するために使用することもできる。 This embodiment can also be used to identify insertion references, such as viral insertions, transgenes or insertions from another bacterial species.

多くの実施形態において、使用者は最初に、１種の参照に由来する配列またはショートリードが、試料に存在することを知ることになり、続くタスクは、試料に存在する任意の他の配列またはショートリードの可能性の高い参照を同定することになる。これは、試料がヒトＤＮＡおよび潜在的病原体に由来するＤＮＡの両方を含有する診断の場合となり得る。他の例として、試料が食物供給源（例えば、サラダ、トマト、キュウリ、特定の種に由来する肉）に由来するＤＮＡを含有することが公知の、食物試料における有害細菌の同定が挙げられ、タスクは、いずれかの混入ＤＮＡの存在および同一性を同定することである。かかる方法において、本方法は、定義済み参照に由来する配列とアライメントする供給源配列を最初に除去することを含み得る。あるいは、本方法は、１種または複数の定義済み参照に由来するｋ−ｍｅｒの無視を含み得る。 In many embodiments, the user first knows that a sequence or short read from one reference is present in the sample, and the subsequent task is any other sequence present in the sample or This will identify references that are likely to be short leads. This can be the case for diagnostics where the sample contains both human DNA and DNA from potential pathogens. Other examples include the identification of harmful bacteria in food samples known to contain DNA from food sources (eg, salads, tomatoes, cucumbers, meat from certain species), The task is to identify the presence and identity of any contaminating DNA. In such a method, the method can include first removing a source sequence that aligns with a sequence derived from a defined reference. Alternatively, the method may include ignoring k-mers from one or more predefined references.

一実施形態において、本方法は、核酸シーケンサーから得られる未加工のリードのサンプリングおよび問い合わせを含む。 In one embodiment, the method includes sampling and querying raw leads obtained from a nucleic acid sequencer.

診断目的のためにシーケンサーからのショートリードまたは未加工のリード等、同定するためのＤＮＡデータの問い合わせセットを有する場合、本出願人らは、包括的参照データベースに対するあらゆるリードのマッピングまたはアライメントに存する総当たり（brute-force）アプローチが、２つの主要な不利益を有すると考慮する：第１に、配列決定設備から計算センターに移行される、数百メガバイトまたは数ギガバイトもの多さのデータ、第２に、タスクの実行に必要とされる計算資源が著しいこと。参照コレクションが、１０，０００種のＥ．ｃｏｌｉサイズの細菌を含有すること、また、ＢＷＡおよびｂｏｗｔｉｅ２等、最適化されたアライナが、２５０Ｍ塩基の未加工の配列決定データ（ゲノムが４Ｍ塩基のサイズである場合、平均カバー度における約６０×）を処理するために３０秒を要することを仮定すると、これはＣＰＵにおいて３日半を要するであろうが、複数のＣＰＵにおいて自明に並列化することができる。ゲノムのかかる連結の精密化を行うことができるが、但し、増え続ける量のメモリ、最初の参照ゲノムにマッピング位置を割り当てるための後処理計算と、ショートリードアライナが落ち着かないことが多い、近縁のゲノムが参照される際に必然的な複数のマッチの要求を代償とする。ＦＭ−インデックスを使用した、サイズｕの参照における長さｐの文字列のｎ発生の位置づけの時間計算量は、上界Ｏ（ｐ＋ｎｌｏｇε ｕ）を有し、これは、ｌｏｇεにおける項により、計算量は、参照のサイズが増加するにつれて徐々に増していくが、高度に類似したゲノムの数と共に直線的に増していくことを意味する。本出願人らのアプローチは、莫大な参照データベースの展望を包含し、１台のコンピュータの全ＲＡＭにおけるその維持を試みない。 If we have a query set of DNA data to identify, such as short or raw leads from a sequencer for diagnostic purposes, Applicants will be responsible for the total mapping or alignment of any lead to a comprehensive reference database. Consider the brute-force approach to have two major disadvantages: first, hundreds of megabytes or as many as several gigabytes of data being transferred from the sequencing facility to the computing center; In addition, the computational resources required to execute the task are significant. The reference collection has 10,000 E.C. E. coli-sized bacteria, and optimized aligners such as BWA and bowtie2 have a raw sequencing data of 250 M bases (if the genome is 4 M bases in size, the average coverage is about 60 × Assuming that it takes 30 seconds to process), this would take three and a half days on the CPU, but it can be trivially parallelized on multiple CPUs. Such genome ligation can be refined, but with an ever-increasing amount of memory, post-processing calculations to assign mapping positions to the first reference genome, and short read aligners often do not settle The cost of multiple matches that are inevitable when the genome is referenced. The time complexity of positioning n occurrences of a string of length p in a reference of size u using an FM-index has an upper bound O (p + n log ε u), which is calculated by a term in log ε The amount means increasing gradually as the size of the reference increases, but increasing linearly with the number of highly similar genomes. Applicants' approach encompasses an enormous reference database perspective and does not attempt to maintain it in the entire RAM of a single computer.

データベース
一態様において、本発明は、参照配列のｋ−ｍｅｒを含むデータベースであって、
ａ．参照配列からのｋ−ｍｅｒの第１のコレクションと、
ｂ．参照配列における各ｋ−ｍｅｒの位置の第２のコレクションと
を含むデータベースに関する。 Database In one aspect, the present invention is a database comprising a k-mer of reference sequences comprising:
a. A first collection of k-mers from a reference sequence;
b. And a second collection of each k-mer position in the reference sequence.

データベースアーキテクチャは、添付の実施例において図解されている通り、供給源配列に由来するｋ−ｍｅｒの非常に迅速な問い合わせを可能にし、数秒の間に結果が返され得ることを実証する。 The database architecture, as illustrated in the accompanying examples, allows a very quick query of k-mers from the source sequence, demonstrating that results can be returned in seconds.

データベースは、所定の参照に関連する全長配列、および／または前記参照の供給源および／または前記参照の１種もしくは複数の分類学的記述子に関する情報をさらに含むことができる。保存することができる追加的な情報は、ＤＮＡ配列においてアノテートされる遺伝子に関する情報である。 The database may further include information about a full-length sequence associated with a given reference, and / or a source of the reference and / or one or more taxonomic descriptors of the reference. Additional information that can be stored is information about genes that are annotated in the DNA sequence.

データベースを構築する場合、ｋ−ｍｅｒは、各特有のｋ−ｍｅｒに特有のキーを割り当てるハッシュ関数に付すことができる。他の可能性は、探索木またはハッシュ関数および探索木の組合せを含む。特有のキーは、ｋ−ｍｅｒが存在するこれらの参照に関する情報に関連し得る。 When building a database, a k-mer can be attached to a hash function that assigns a unique key to each unique k-mer. Other possibilities include search trees or hash functions and search tree combinations. A unique key may relate to information about these references where the k-mer exists.

第２のコレクションにおいて、第２のコレクションにおける各特有のｋ−ｍｅｒは、キーとして使用することもでき、ハッシュ表、探索木またはこれらの組合せにより、存在するのであれば、各参照におけるｋ−ｍｅｒの位置に関する情報に関連付けることもできる。このコレクションは、コード配列、調節配列等、配列のいずれかのアノテーションへの関連等、ｋ−ｍｅｒが存在する位置に関するさらなる情報を含むことができる。 In the second collection, each unique k-mer in the second collection can also be used as a key, and if present by a hash table, search tree, or combination thereof, the k-mer in each reference. Can also be associated with information about the location of This collection can contain further information about the location where the k-mer is located, such as the coding sequence, regulatory sequences, etc., the association of the sequence to any annotation.

配列、コード配列、調節配列のいずれかのアノテーションへの関連、可能性の高い参照の分類学的名称、前記可能性の高い参照の近縁、前記参照の供給源、さらなる関係する参照の群、参照が得られた場所（土壌、海、腸、下水管等）、参照配列が得られた時期、分類学的分類、近縁種、参照配列がダウンロードされたデータベースに関する情報（例えば、ＮＣＢＩ、ＥＢＩ／Ｓａｎｇｅｒ）または他の情報等、所定のｋ−ｍｅｒが存在する参照配列に関する１種または複数のさらなる情報は、ＳＱＬデータベース等、本発明に係る参照配列に関する情報の情報検索に追加的に使用することができる別々のデータベースにおいて保存することもできる。 Sequence, coding sequence, association to any of the regulatory sequences, the taxonomic name of the likely reference, the closeness of the likely reference, the source of the reference, a group of further related references, Information about where the reference was obtained (soil, sea, intestine, sewer, etc.), when the reference sequence was obtained, taxonomic classification, related species, and database where the reference sequence was downloaded (eg NCBI, EBI / Sanger) or other information, such as one or more additional information about the reference sequence in which a given k-mer exists, is additionally used for information retrieval of information about the reference sequence according to the present invention, such as an SQL database. It can also be stored in a separate database.

用語により、「さらなる関係する配列の群」は、土壌、海、腸、下水管等、類似の環境において採取された試料に由来する配列を意味する。 By terminology “an additional group of related sequences” means sequences derived from samples taken in similar environments, such as soil, sea, intestines, sewers, and the like.

よって、本発明の一実施形態において、参照配列のｋ−ｍｅｒを含むデータベースは、
ａ）参照配列からのｋ−ｍｅｒの第１のコレクションと、
ｂ）参照配列における各ｋ−ｍｅｒの位置の第２のコレクション。
ｃ）参照識別子と、記述ライン（ｄｅｓｃｒｉｐｔｉｏｎｌｉｎｅ）、データの供給源、可能性の高い参照の分類学的名称、前記可能性の高い参照の近縁、前記参照の供給源、さらなる関係する参照の群の情報、参照が得られた場所（土壌、海、腸、下水管等）、参照配列が得られた時期、分類学的分類、近縁種、参照配列がダウンロードされたデータベースに関する情報（例えば、ＮＣＢＩ、ＥＢＩ／Ｓａｎｇｅｒまたは他のデータベース）からなる群から選択される１または複数の情報とを有する第３のコレクションまたはデータベース
を含む。 Thus, in one embodiment of the present invention, the database containing the k-mer of the reference sequence is
a) a first collection of k-mers from a reference sequence;
b) A second collection of each k-mer position in the reference sequence.
c) Reference identifier and description line, source of data, taxonomic name of the likely reference, proximity of the likely reference, source of the reference, further related reference Information about the group, where the reference was obtained (soil, sea, intestine, sewer, etc.), when the reference sequence was obtained, taxonomic classification, related species, information about the database where the reference sequence was downloaded (eg , NCBI, EBI / Sanger or other databases) and a third collection or database having one or more information selected from the group consisting of:

好ましい実施形態において、図１に示す通り、ｋ−ｍｅｒの第１のコレクションは、各ｋ−ｍｅｒ（データベースにおけるキー）に、該ｋ−ｍｅｒを有する参照に相当する識別子のリストを関連付けるキー値保存またはＮｏＳＱＬデータベース（例えば、ＫｙｏｔｏＣａｂｉｎｅｔ）である。参照配列におけるｋ−ｍｅｒの位置の第２のコレクションは、キー値保存またはＮｏＳＱＬデータベース、例えば、ＫｙｏｔｏＣａｂｉｎｅｔにおいて保存することもできる（図１を参照）。参照間の関連、識別子ならびに記述ラインおよびデータの供給源等の情報部分は、別々のＳＱＬデータベースにおいて保存される。 In the preferred embodiment, as shown in FIG. 1, the first collection of k-mers is a key value store that associates each k-mer (key in the database) with a list of identifiers corresponding to references having that k-mer. Or it is a NoSQL database (for example, KyotoCabinet). A second collection of k-mer positions in the reference sequence can also be stored in a key value store or a NoSQL database, eg, KyotoCabinet (see FIG. 1). Information portions such as associations between references, identifiers and description lines and sources of data are stored in separate SQL databases.

データベースにおけるｋ−ｍｅｒの長さは、適切なルックアップを仮定するが、好ましくは、供給源配列におけるｋ−ｍｅｒの長さにマッチする。しかし、データベースにおけるｋ−ｍｅｒは、好ましくは、重複していない。重複ｋ−ｍｅｒの使用は、データ処理時間を増加させるであろう。 The length of the k-mer in the database assumes an appropriate lookup, but preferably matches the length of the k-mer in the source sequence. However, the k-mers in the database are preferably non-overlapping. The use of duplicate k-mers will increase data processing time.

本発明において、データベースにおける参照配列のインデックス付けされたｋ−ｍｅｒは、重複または非重複となり得る。好ましい実施形態において、インデックス付けされた参照配列のｋ−ｍｅｒは、非重複である。当業者であれば、類似のスコアリング原理が、参照配列における非重複または重複ｋ−ｍｅｒのインデックス付けされたデータベースに使用され得ることを認められよう。 In the present invention, the indexed k-mer of the reference sequence in the database can be duplicated or non-duplicated. In a preferred embodiment, the k-mer of the indexed reference sequence is non-overlapping. One skilled in the art will recognize that similar scoring principles can be used for indexed databases of non-overlapping or overlapping k-mers in the reference sequence.

ｋ−ｍｅｒでインデックス付けされたサイズｕの参照における長さｐの文字列のｎ発生の位置づけの時間計算量は、ｋインデックス付けおよびルックアップにツリーまたはハッシュが使用される場合、Ｏ（ｐ＋ｎｌｏｇｕ）またはＯ（ｐ＋ｎ）の計算量を有する。 The time complexity of positioning n occurrences of a string of length p in a reference of size u indexed by k-mer is O (p + n log) if a tree or hash is used for k indexing and lookup. u) or O (p + n).

これは、ｋ−ｍｅｒが重複し、少なくとも１、例えば、少なくとも２、例えば少なくとも３、例えば、少なくとも４、例えば、少なくとも５、例えば、少なくとも６以上の塩基またはアミノ酸ずつ漸進する実施形態を除外しない。 This does not exclude embodiments where the k-mers overlap and are progressively advanced by at least 1, such as at least 2, such as at least 3, such as at least 4, such as at least 5, such as at least 6 bases or amino acids.

好ましい実施形態において、所定の参照の完全ゲノム配列は、ｋ−ｍｅｒに断片化され、データベースにアップロードされる。所定の参照のトランスクリプトームまたは所定の参照のプロテオームのみに基づきデータベースを構築することも考え得る。 In a preferred embodiment, the complete genomic sequence of a given reference is fragmented into k-mers and uploaded to a database. It is also conceivable to build a database based solely on a given reference transcriptome or given reference proteome.

目的が、単に、供給源配列の可能性の高い参照を同定することである場合、データベースは、完全である必要はない。特定の参照に由来するゲノムＤＮＡのランダムな選択を提供すれば十分となり得る。選択は、非ランダムとなることもでき、例えば、繰り返しＤＮＡおよびいわゆるジャンクＤＮＡのストレッチを除外する。 If the goal is simply to identify a likely reference for a source sequence, the database need not be complete. It may be sufficient to provide a random selection of genomic DNA from a particular reference. The selection can also be non-random, for example, excluding repetitive DNA and so-called junk DNA stretches.

生物配列、タンパク質、ＲＮＡ、ＤＮＡの種類毎に、あらゆる利用できる情報を含有する１種のデータベースを構築することができる。他の実施形態において、特化したデータベースは、目的が、単に供給源配列に由来する所定の参照配列の有無を同定することである場合等、特化した目的のために構築することができる。例えば、データベースは、ヒト、動物、哺乳動物、鳥類、魚類、真菌、昆虫、植物、細菌、古細菌、ウイルスおよび／またはプラスミドに由来する配列情報を含むことができる。十分に高いスコアでマッチする参照を見出さない場合、データベースのネットワークは、１個のサーバーによって１個または数個の他に送られるリードに関する要求により構築することもできる。 One type of database containing all available information can be constructed for each type of biological sequence, protein, RNA, DNA. In other embodiments, specialized databases can be constructed for specialized purposes, such as when the purpose is simply to identify the presence or absence of a given reference sequence derived from a source sequence. For example, the database can include sequence information from humans, animals, mammals, birds, fish, fungi, insects, plants, bacteria, archaea, viruses and / or plasmids. If no matching reference is found with a sufficiently high score, the database network can also be built with requests for one or several other leads sent by one server.

スピードを損なうことなくハードウェア資源の最適な使用を為すために、データベースは、数個の異なるサーバーに保存されるサブデータベース（sub-database）に分けることができる。 In order to make optimal use of hardware resources without compromising speed, the database can be divided into sub-databases that are stored on several different servers.

他の実施形態において、データベースは、門、綱、目、科、属および種から選択される１種もしくは複数の分類学的記述子、または供給源、分布、起源および通常の検索頻度等の１種もしくは複数の環境的記述子に従ってサブデータベースへと組織化される。 In other embodiments, the database is one or more taxonomic descriptors selected from gates, classes, eyes, families, genera and species, or one such as source, distribution, origin and normal search frequency. Organized into sub-databases according to species or multiple environmental descriptors.

データベースは、図１に説明されている通りに構築し、キー値保存（例えば、ＢＳＤＤＢ、ＫｙｏｔｏＣａｂｉｎｅｔ、ＬｅｖｅｌＤＢ、ＭｏｎｇｏＤＢその他）として公知のデータベースエンジンを使用して保存することができる。よって、本発明の一実施形態において、データベースは、ＢＳＤＤＢ、ＫｙｏｔｏＣａｂｉｎｅｔ、ＬｅｖｅｌＤＢ、ＭｏｎｇｏＤＢからなる群から選択されるキー値保存を使用して保存される。 The database can be constructed as described in FIG. 1 and stored using a database engine known as key value storage (eg, BSDDB, KyotoCabinet, LevelDB, MongoDB, etc.). Thus, in one embodiment of the present invention, the database is stored using key value storage selected from the group consisting of BSDDB, KyotoCabinet, LevelDB, and MongoDB.

アルゴリズムの適用
本発明の方法およびシステムは、試料において見出されるＤＮＡの可能性の高い供給源を同定する必要がある数多くの適用において使用することができる。 Algorithm Applications The methods and systems of the present invention can be used in many applications where it is necessary to identify a likely source of DNA found in a sample.

診断
内科的治療法において、感染の可能性の高い供給源を迅速に同定する必要がある。これは、本発明に係る方法を使用して行うことができる。これにより、最も有効な様式で最小の副作用により感染を処置するであろう、適した処置を選択することができる。 Diagnosis In medical treatment, it is necessary to quickly identify sources that are likely to be infected. This can be done using the method according to the invention. This allows the selection of a suitable treatment that will treat the infection with minimal side effects in the most effective manner.

さらに別の診断適用は、がん細胞におけるウイルス挿入の同定に関する。この適用において、未加工のリードにおいて得られる配列から完全ヒト配列をフィルターにかけること、あるいはデータベースにおいて同定されるあらゆるヒトのヒットを単純に無視することが有利となり得る。これは、ヒトゲノムにおける相対的に小型のウイルス挿入の同定を可能にするであろう。 Yet another diagnostic application relates to the identification of viral insertions in cancer cells. In this application, it may be advantageous to filter the complete human sequence from the sequence obtained in the raw reads, or simply ignore any human hits identified in the database. This will allow the identification of relatively small viral insertions in the human genome.

生物テロ防御
生物テロ防御適用において、遭遇した感染性または病原性因子の種の速くて信頼できる同定の必要がある。本発明は、供給源の予備的知識がない状態で、供給源の迅速な同定の可能性を提供する。本発明の方法は、病原体の種の予備的知識がない状態で、種の識別を可能にする。 Bioterrorism defense In bioterrorism defense applications, there is a need for fast and reliable identification of species of infectious or virulence factors encountered. The present invention provides the possibility of rapid identification of a source without prior knowledge of the source. The method of the present invention allows species identification without prior knowledge of the species of the pathogen.

生物テロ防御におけるさらなる適用は、例えば、毒性導入遺伝子が挿入されたトランスジェニック病原体の同定を含む。データベースは、有利には、最先端のプラスミドに由来する配列情報も含有する。これは、挿入の隣接領域の容易な同定を可能にする。導入遺伝子が、データベースに見出される生物に由来する場合、導入遺伝子の供給源を同定することも可能になる。このような場合、データベースは、病原体の名称、導入遺伝子が由来する生物の名称、導入遺伝子にコードされる遺伝子および導入遺伝子の挿入に使用されたプラスミドを返すことができる。 Further applications in bioterrorism defense include, for example, the identification of transgenic pathogens into which a toxic transgene has been inserted. The database advantageously also contains sequence information derived from state-of-the-art plasmids. This allows easy identification of adjacent regions of insertion. If the transgene is derived from an organism found in the database, it is also possible to identify the source of the transgene. In such a case, the database can return the name of the pathogen, the name of the organism from which the transgene is derived, the gene encoded by the transgene and the plasmid used to insert the transgene.

食物安全性および品質
食物における潜在的に有害な感染を同定するための現在の方法は、時間がかかる（感染性生物の単離および成長に基づく）、あるいは感染の供給源の事前の知識を必要とする（ＰＣＲに基づく方法）。本方法は、そのいずれも必要とせず、権限を持つ者および製造業者が、ゲノムＤＮＡを単純に単離し、ＤＮＡを配列決定し、本発明の方法を操作することができるシステムに未加工のリードをアップロードすることを可能にする。 Food safety and quality Current methods for identifying potentially harmful infections in food are time consuming (based on the isolation and growth of infectious organisms) or require prior knowledge of the source of the infection (PCR-based method). This method does not require any of them, and the raw lead into a system where authorized persons and manufacturers can simply isolate genomic DNA, sequence the DNA, and operate the method of the present invention. Allows you to upload.

食物の試料における細菌、真菌またはウイルスを探す場合、細菌、真菌またはウイルスに由来する配列のみを含有するデータベースの画分を問い合わせることが有利となり得る。このようにして、食物（野菜、果実、肉）に由来するいかなるゲノム配列もデータベースに存在しないものとして同定され、これにより、本方法の性能を改善するであろう。 When looking for bacteria, fungi or viruses in a sample of food, it can be advantageous to query the fraction of the database that contains only sequences derived from bacteria, fungi or viruses. In this way, any genomic sequence derived from food (vegetables, fruits, meat) will be identified as not present in the database, thereby improving the performance of the method.

他の適用として品質管理が挙げられる。可能な適用の１つは、ひき肉、パテ、調理済みの食事、インスタント食品等、肉の種の同定である。牛肉またはラム肉等の高価な肉が、豚肉等のより安価な肉に置き換えられたまたは「希釈された」、不正を試みた数多くの例がある。 Another application is quality control. One possible application is the identification of meat species such as minced meat, patties, cooked meals, and instant foods. There are numerous examples of fraud attempts where expensive meat such as beef or lamb has been replaced or “diluted” by cheaper meat such as pork.

他の可能な品質管理適用として、ブドウ、リンゴ、ジャガイモ等、植物の変種の決定が挙げられる。 Other possible quality control applications include the determination of plant varieties such as grapes, apples and potatoes.

さらに他の可能性として、水質の管理が挙げられる。 Yet another possibility is water quality management.

衛生および予防法
本発明は、クリーニング手順に関連して採取された試料におけるＤＮＡの供給源の迅速な同定を可能にすることによる衛生管理の可能性を提供する。さらなる適用は、混入の可能性の高い供給源の同定を含み、これにより、特定の感染病原体の排除に最も適した衛生学的技法の適用を可能にする。 Hygiene and prevention methods The present invention provides the possibility of hygiene management by allowing rapid identification of the source of DNA in a sample taken in connection with a cleaning procedure. Further applications include the identification of sources that are likely to be contaminated, thereby allowing the application of hygienic techniques that are best suited for elimination of specific infectious agents.

項目
次に、任意で番号を振った項目１から５６として本発明を説明するが、これらは、本発明の実施形態として考慮されたい。本発明は、添付の特許請求の範囲を参照してさらに定義される。 Items The present invention will now be described as optionally numbered items 1 through 56, which should be considered as embodiments of the present invention. The invention is further defined with reference to the appended claims.

１．生物配列の可能性の高い供給源を同定する方法であって、
ａ）供給源から配列またはショートリードのサブセットをサンプリングするステップと、
ｂ）サブセットからの配列をｋ−ｍｅｒに断片化するステップと、
ｃ）前記サブセットからのｋ−ｍｅｒを、参照配列のｋ−ｍｅｒを含むデータベースに対して問い合わせるステップと、
ｄ）いずれの参照がｋ−ｍｅｒを含有するか決定するステップと、
ｅ）可能性の高い供給源参照の記述を返すステップと
を含む方法。 1. A method for identifying a likely source of a biological sequence comprising:
a) sampling a sequence or a subset of short reads from a source;
b) fragmenting sequences from the subset into k-mers;
c) querying a k-mer from said subset against a database containing the k-mer of the reference sequence;
d) determining which references contain k-mers;
e) returning a description of the likely source reference.

２．生物配列またはショートリードが、アミノ酸配列である、項目１に記載の方法。 2. Item 2. The method according to Item 1, wherein the biological sequence or the short read is an amino acid sequence.

３．生物配列またはショートリードが、ＤＮＡまたはＲＮＡ配列である、項目１に記載の方法。 3. Item 2. The method according to Item 1, wherein the biological sequence or the short read is a DNA or RNA sequence.

４．ｋ−ｍｅｒ問い合わせが、問い合わせおよび参照ｋ−ｍｅｒの間の正確なマッチの決定を含む、前記項目のいずれかに記載の方法。 4). The method of any preceding item, wherein the k-mer query comprises determining an exact match between the query and the reference k-mer.

５．問い合わせステップが、参照配列におけるｋ−ｍｅｒの位置を決定するステップをさらに含む、前記項目のいずれかに記載の方法。 5. The method according to any of the preceding items, wherein the querying step further comprises the step of determining the position of the k-mer in the reference sequence.

６．存在および位置が使用されて、参照配列における問い合わせｋ−ｍｅｒの連続性を決定する、前記項目のいずれかに記載の方法。 6). A method according to any of the preceding items, wherein presence and position are used to determine the continuity of the query k-mer in the reference sequence.

７．問い合わせが、少なくとも１種の供給源配列またはショートリード、好ましくは、少なくとも５０、例えば、少なくとも１００、例えば、少なくとも１５０、例えば、少なくとも２００、例えば、少なくとも２５０、例えば、少なくとも３００、例えば、少なくとも４００、例えば、少なくとも５００、例えば、少なくとも７５０、例えば、少なくとも１０００、例えば、少なくとも１５００、例えば、少なくとも２０００、例えば、少なくとも２５００、例えば、少なくとも５０００種以上の配列に由来するあらゆるｋ−ｍｅｒの問い合わせを含む、前記項目のいずれかに記載の方法。 7). The query is at least one source sequence or short lead, preferably at least 50, such as at least 100, such as at least 150, such as at least 200, such as at least 250, such as at least 300, such as at least 400, For example, including any k-mer query derived from at least 500, such as at least 750, such as at least 1000, such as at least 1500, such as at least 2000, such as at least 2500, such as at least 5000 sequences. A method according to any of the preceding items.

８．供給源配列が、少なくとも５０塩基、好ましくは、少なくとも１００塩基、例えば、少なくとも１５０塩基、例えば、少なくとも２００塩基、例えば、少なくとも２５０塩基、例えば、少なくとも３００塩基、例えば、少なくとも４００、少なくとも５００以上の塩基のヌクレオチド配列である、前記項目のいずれかに記載の方法。 8). The source sequence is at least 50 bases, preferably at least 100 bases, such as at least 150 bases, such as at least 200 bases, such as at least 250 bases, such as at least 300 bases, such as at least 400, at least 500 bases. The method according to any one of the preceding items, wherein the nucleotide sequence is:

９．配列のサブセットが、少なくとも１％、例えば、少なくとも２％、例えば、少なくとも４％、例えば、少なくとも５％、例えば、少なくとも６％、例えば、少なくとも７．５％、例えば、少なくとも１０％、例えば、少なくとも１５％、例えば、少なくとも２５％、例えば、少なくとも３０％、例えば、少なくとも３５％、例えば、少なくとも４０％、例えば、少なくとも５０％の離散した配列を含む、前記項目のいずれかに記載の方法。 9. A subset of sequences is at least 1%, such as at least 2%, such as at least 4%, such as at least 5%, such as at least 6%, such as at least 7.5%, such as at least 10%, such as at least The method according to any of the preceding items, comprising 15%, such as at least 25%, such as at least 30%, such as at least 35%, such as at least 40%, such as at least 50%, discrete sequences.

１０．配列の１種または複数のさらなるサブセットを選択するステップと、これらを項目１のステップａ）〜ｅ）に付すステップとをさらに含む、前記項目のいずれかに記載の方法。 10. A method according to any of the preceding items, further comprising selecting one or more additional subsets of the sequence and subjecting them to steps a) -e) of item 1.

１１．サブセットが、ランダムであるまたはフィルターをかけられている、前記項目のいずれかに記載の方法。 11. The method according to any of the preceding items, wherein the subset is random or filtered.

１２．ｋ−ｍｅｒが、サイズ４、８、１２、１６、２０、２４、２８、３２、３６、４０、４４、４８、５２、５６、６０、６４以上のものである、前記項目のいずれかに記載の方法。 12 Any of the preceding items wherein the k-mer is of size 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64 or greater the method of.

１３．ｋ−ｍｅｒが、連続している、前記項目のいずれかに記載の方法。 13. The method according to any of the preceding items, wherein the k-mers are continuous.

１４．ｋ−ｍｅｒが、重複しており、少なくとも１、例えば、少なくとも２、例えば、少なくとも３、例えば、少なくとも４、例えば、少なくとも５、例えば、少なくとも６以上の塩基またはアミノ酸ずつ漸増する、前記項目のいずれかに記載の方法。 14 Any of the preceding items, wherein the k-mers are overlapping and gradually increase by at least 1, such as at least 2, such as at least 3, such as at least 4, such as at least 5, such as at least 6 bases or amino acids. The method of crab.

１５．ｋ−ｍｅｒが、互いに素な部分配列の連結である、前記項目のいずれかに記載の方法。 15. The method according to any one of the preceding items, wherein k-mer is a disjoint partial sequence.

１６．所定の配列に由来するｋ−ｍｅｒが、データベースに対して問い合わせされて、１種または複数の参照配列におけるｋ−ｍｅｒの存在および前記１種または複数の参照配列におけるｋ−ｍｅｒの位置を決定する、前記項目のいずれかに記載の方法。 16. A k-mer from a given sequence is queried against a database to determine the presence of the k-mer in one or more reference sequences and the position of the k-mer in the one or more reference sequences. A method according to any of the preceding items.

１７．位置が、ｋ−ｍｅｒが存在する場合にのみ問い合わせされる、項目１６に記載の方法。 17. Item 17. The method of item 16, wherein the location is queried only if k-mer is present.

１８．返された参照のスコアが算出される、前記項目のいずれかに記載の方法。 18. A method according to any of the preceding items, wherein a score of the returned reference is calculated.

１９．同定された参照配列のスコアが算出され、スコアが、所定の参照配列に見出される１種または複数の配列に由来するｋ−ｍｅｒの数に相関する、前記項目のいずれかに記載の方法。 19. A method according to any of the preceding items, wherein a score for the identified reference sequence is calculated and the score correlates with the number of k-mers from one or more sequences found in a given reference sequence.

２０．同定された参照のスコアが算出され、スコアが、参照配列に見出される１種または複数の配列に由来するｋ−ｍｅｒの局所的濃度の平均による連続性または近似連続性に相関する、前記項目のいずれかに記載の方法。 20. A score for the identified reference is calculated and the score correlates to an average or continuous continuity of local concentrations of k-mers from one or more sequences found in the reference sequence The method according to any one.

２１．同定された参照のスコアが算出され、スコアが、供給源に由来するｋ−ｍｅｒのサブセットにも存在する参照配列におけるｋ−ｍｅｒの数に相関する、前記項目のいずれかに記載の方法。 21. A method according to any of the preceding items, wherein a score for the identified reference is calculated and the score correlates to the number of k-mers in the reference sequence that are also present in the subset of k-mers from the source.

２２．可能性の高い供給源参照が、前記スコアまたは複数のスコアに従ってランク付けされる、項目１８〜２１のいずれかに記載の方法。 22. 22. A method according to any of items 18-21, wherein the likely source references are ranked according to the score or scores.

２３．１種の供給源配列またはショートリードに由来するあらゆるｋ−ｍｅｒが問い合わせされ、前記供給源配列またはショートリードの１種または複数のスコアが算出される、前記項目のいずれかに記載の方法。 23. A method according to any of the preceding items, wherein every k-mer from one source sequence or short read is queried and one or more scores of said source sequence or short lead are calculated. .

２４．第２の供給源配列またはショートリード、好ましくは、第３の供給源配列またはショートリード等に由来するあらゆるｋ−ｍｅｒを問い合わせるステップをさらに含む、項目２３に記載の方法。 24. 24. The method of item 23, further comprising interrogating any k-mer derived from a second source sequence or short lead, preferably a third source sequence or short lead.

２５．定義済みの統計的確率により参照生物が同定されたら、データベース問い合わせを中止することができる、前記項目のいずれかに記載の方法。 25. The method according to any of the preceding items, wherein the database query can be aborted once the reference organism has been identified with a defined statistical probability.

２６．ｋ−ｍｅｒの定義済みの画分が、データベースに見出されない場合、データベース問い合わせを中止することができる、前記項目のいずれかに記載の方法。 26. A method according to any of the preceding items, wherein the database query can be aborted if a defined fraction of k-mer is not found in the database.

２７．データベースが、１種または複数の可能性の高い参照に関する次の情報：可能性の高い参照の分類学的名称、前記可能性の高い参照の近縁、前記参照の供給源、さらなる関係する参照の群のうち１種または複数を出力する、前記項目のいずれかに記載の方法。 27. The database includes the following information about one or more likely references: taxonomic names of likely references, close relatives of the likely references, sources of the references, further relevant references The method according to any of the preceding items, wherein one or more of the groups are output.

２８．データベースが、最も可能性の高い参照の配列を出力し、好ましくは、データベースが、最も可能性の高い参照種の完全ゲノム配列を出力する、前記項目のいずれかに記載の方法。 28. A method according to any of the preceding items, wherein the database outputs the most likely reference sequence, preferably the database outputs the complete genomic sequence of the most likely reference species.

２９．非常に類似した配列を有する参照からの結果またはさらなる関係する参照からの結果が、出力においてグループ化される、前記項目のいずれかに記載の方法。 29. The method according to any of the preceding items, wherein results from references having very similar sequences or results from further related references are grouped in the output.

３０．第１の反復において、最も豊富な参照を同定し、供給源配列またはショートリードから前記最も豊富な参照に由来する配列を除去するステップ等、方法の数回の反復が行われる、前記項目のいずれかに記載の方法。 30. In any of the preceding items, several iterations of the method are performed, such as identifying the most abundant reference in the first iteration and removing the sequence from the most abundant reference from the source sequence or short read. The method of crab.

３１．第２の反復において、２番目に最も豊富な参照を同定し、前記２番目に最も豊富な参照に由来する配列を除去するステップ等をさらに含む、項目３０に記載の方法。 31. 31. The method of item 30, further comprising identifying in the second iteration the second most abundant reference, removing sequences from the second most abundant reference, and the like.

３２．第２の反復において、挿入の可能性の高い参照を同定するステップをさらに含む、項目３０に記載の方法。 32. 32. The method of item 30, further comprising identifying a reference with a high probability of insertion in a second iteration.

３３．定義済みの参照に由来する配列とアライメントする供給源配列を最初に除去するステップをさらに含む、前記項目のいずれかに記載の方法。 33. The method of any preceding item, further comprising first removing a source sequence that aligns with a sequence derived from a defined reference.

３４．ある１供給源配列またはショートリードに由来する定義済みの数のｋ−ｍｅｒが、データベースに存在しない場合、前記供給源配列またはショートリードに由来するｋ−ｍｅｒを無視するステップを含む、前記項目のいずれかに記載の方法。 34. Including ignoring a k-mer from a source sequence or short lead if a defined number of k-mers from a source sequence or short read is not present in the database. The method according to any one.

３５．問い合わせが、１種または複数の定義済みの参照に由来するｋ−ｍｅｒの無視を含む、前記項目のいずれかに記載の方法。 35. The method according to any of the preceding items, wherein the query comprises ignoring k-mers from one or more predefined references.

３６．未加工の配列が核酸シーケンサーから得られると、問い合わせされる、前記項目のいずれかに記載の方法。 36. A method according to any of the preceding items, wherein the raw sequence is queried when obtained from a nucleic acid sequencer.

３７．参照配列のｋ−ｍｅｒを含む、データベースであって、
ａ．参照配列からのｋ−ｍｅｒの第１のコレクションと、
ｂ．参照配列における各ｋ−ｍｅｒの位置の第２のコレクションと
を含むデータベース。 37. A database containing k-mers of reference sequences,
a. A first collection of k-mers from a reference sequence;
b. A database including a second collection of each k-mer position in the reference sequence.

３８．所定の参照に関連する全長配列、および／または前記参照の供給源、および／または前記参照の１種もしくは複数の分類学的記述子に関する情報をさらに含む、項目３７に記載のデータベース。 38. 38. The database of item 37, further comprising information about a full-length sequence associated with a given reference, and / or a source of the reference, and / or one or more taxonomic descriptors of the reference.

３９．データベースにおけるｋ−ｍｅｒが、各特有のｋ−ｍｅｒに特有のキーを割り当てるハッシュ関数に付される、項目３７〜３８のいずれかに記載のデータベース。 39. 39. The database according to any of items 37 to 38, wherein a k-mer in the database is attached to a hash function that assigns a unique key to each unique k-mer.

４０．第１のコレクションにおける各特有のｋ−ｍｅｒが、ｋ−ｍｅｒが存在するこれらの参照に関する情報へのベクトルによって関連付けされる、項目３７〜３９のいずれかに記載のデータベース。 40. 40. The database of any of items 37-39, wherein each unique k-mer in the first collection is related by a vector to information about those references in which the k-mer exists.

４１．第２のコレクションにおける各特有のｋ−ｍｅｒが、存在する場合、各参照におけるその位置に関する情報へのベクトルによって関連付けされる、項目３７〜４０のいずれかに記載のデータベース。 41. 41. The database of any of items 37-40, wherein each unique k-mer in the second collection, if present, is related by a vector to information about its location in each reference.

４２．ｋ−ｍｅｒが、長さ４、８、１２、１６、２０、２４、２８、３２、３６、４０、４４、４８、５２、５６、６０、６４以上のものである、項目３７〜４１のいずれかに記載のデータベース。 42. Any of items 37-41, wherein the k-mer has a length of 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64 or more The database described in Crab.

４３．ｋ−ｍｅｒが、非重複である、項目３７〜４２のいずれかに記載のデータベース。 43. The database according to any of items 37 to 42, wherein k-mers are non-overlapping.

４４．ｋ−ｍｅｒが重複し、少なくとも１、例えば、少なくとも２、例えば、少なくとも３、例えば、少なくとも４、例えば、少なくとも５、例えば、少なくとも６以上の塩基またはアミノ酸ずつ漸増する、項目３７〜４３のいずれかに記載のデータベース。 44. Any of Items 37-43, wherein the k-mers overlap and are gradually increased by at least 1, such as at least 2, such as at least 3, such as at least 4, such as at least 5, such as at least 6 bases or amino acids. The database described in.

４５．各参照の完全配列に由来するｋ−ｍｅｒを含む、項目３７〜４４のいずれかに記載のデータベース。 45. 45. A database according to any of items 37 to 44, comprising a k-mer derived from the complete sequence of each reference.

４６．ヒト、動物、哺乳動物、鳥類、魚類、真菌、昆虫、植物、細菌、古細菌、ウイルスおよび／またはプラスミドに由来する配列情報を含む、項目３７〜４６のいずれかに記載のデータベース。 46. 47. Database according to any of items 37 to 46, comprising sequence information derived from humans, animals, mammals, birds, fish, fungi, insects, plants, bacteria, archaea, viruses and / or plasmids.

４７．数個の異なるサーバーに保存されるサブデータベースに分けられる、項目３７〜４６のいずれかに記載のデータベース。 47. 47. A database according to any of items 37 to 46, divided into sub-databases stored on several different servers.

４８．門、綱、目、科、属および種から選択される１種もしくは複数の分類学的記述子、または供給源、分布、起源および過去の問い合わせ頻度等の１種もしくは複数の環境的記述子に従ってサブデータベースへと組織化される、項目３７〜４７のいずれかに記載のデータベース。 48. According to one or more taxonomic descriptors selected from gates, classes, eyes, families, genera and species, or one or more environmental descriptors such as source, distribution, origin and past query frequency 48. A database according to any of items 37-47, organized into a sub-database.

４９．入力デバイスと、中央処理ユニットと、メモリと、出力デバイスとを含む、供給源配列の可能性の高い供給源を同定するためのデータ処理システムであって、前記データ処理システムが、実行されると項目１〜３６に記載の方法を実施させる命令シーケンスを表すデータを内部に保存し、メモリが、項目３７〜４９のいずれかに記載のデータベースをさらに含むデータ処理システム。 49. A data processing system for identifying a likely source of a source arrangement comprising an input device, a central processing unit, a memory, and an output device, wherein the data processing system is executed 50. A data processing system in which data representing an instruction sequence for performing the method according to items 1 to 36 is stored internally, and the memory further includes a database according to any of items 37 to 49.

５０．データベースが、サーバーに保存され、入力および出力デバイスが、クライアントであり、クライアントおよびサーバーが、データ通信接続を介して接続されている、項目４９に記載のシステム。 50. 50. The system of item 49, wherein the database is stored on a server, the input and output devices are clients, and the client and server are connected via a data communication connection.

５１．クライアントが、パーソナルコンピュータ、固定型ＰＣ、ポータブルＰＣ、スマートフォン等の携帯型計算デバイスから選択される、項目４９〜５０のいずれかに記載のシステム。 51. 51. The system according to any one of items 49 to 50, wherein the client is selected from a portable computing device such as a personal computer, a fixed type PC, a portable PC, and a smartphone.

５２．クライアントは、クライアントが、供給源配列のサブセットをサンプリングし、これらをｋ−ｍｅｒに断片化し、これらをサーバーに伝達することを可能にする命令シーケンスを含む、項目４９〜５１のいずれかに記載のシステム。 52. 51. A client according to any of items 49-51, wherein the client comprises a sequence of instructions that allows the client to sample a subset of the source array, fragment them into k-mers and communicate them to the server. system.

５３．クライアントが、クライアントが、サーバーからクライアントへと伝達された配列に基づき、１種または複数のより大型の配列への供給源配列のアセンブリを実行することを可能にする命令シーケンスをさらに含む、項目４９〜５２に記載のシステム。 53. Item 49 further comprising an instruction sequence that allows the client to perform assembly of the source array into one or more larger arrays based on the sequences communicated from the server to the client. The system according to -52.

５４．データ接続を介して配列決定装置に接続されている、項目４９〜５３のいずれかに記載のシステム。 54. 54. A system according to any of items 49 to 53, connected to the sequencing device via a data connection.

５５．実行されると項目１〜３６に記載の方法を実施させる命令シーケンスを含有するコンピュータソフトウェア製品。 55. A computer software product containing a sequence of instructions that, when executed, causes the method of items 1-36 to be performed.

５６．実行されると項目１〜３６に記載の方法を実施させる命令シーケンスを含有する集積回路製品。 56. An integrated circuit product containing a sequence of instructions that, when executed, causes the method of items 1-36 to be performed.

ｋ−ｍｅｒによる配列の迅速な同定
そこで、本出願人らは、ＤＮＡまたはＲＮＡの可能性の高い起源を迅速に指し示すことができ、ＤＮＡシーケンサーから得られた未加工のリードにおいて直接作業することができる新規方法、Ｔａｐｉｒを提示する。本出願人らのシステムは、公知ＤＮＡを参照するサーバーと、認定しようとするＤＮＡデータを有するクライアントに存する。使用を実証するために、本出願人らは、数千種の細菌ゲノム、ファージゲノム、ファージおよびプラスミドと共に、ヒトゲノム、マウスゲノム、Ａ．ｔｈａｌｉａｎａおよび真菌、古細菌に由来する様々な配列を参照した。本出願人らは、ウェブブラウザにおいてランできるクライアントも実行し、これはポータブル計算デバイスからギガ塩基のデータを処理することができる。本方法は、ｋ−ｍｅｒのインデックス付けと、サーバーへの限られた量のデータの移行に頼る。これは、Ａｎｄｒｏｉｄスマートフォンから数秒以内でそのタスクを行うことができ、サーバーと通信する中程度の量の帯域幅を消費し、本出願人らの知る限りにおいて、いずれかの現存するツールとは異なり使用に単純さをもたらす。これは、配列決定ランにおけるルーチンの即時品質検査のために本出願人らのコア設備において使用されており、http://tapir.cbs.dtu.dkにおいて利用することができる。 Rapid identification of sequences by k-mer Thus, applicants can quickly point to a likely source of DNA or RNA and work directly on raw reads obtained from DNA sequencers. A new method, Tapir, is presented. Applicants' system resides in a server that references known DNA and a client that has DNA data to be certified. To demonstrate use, Applicants have along with thousands of bacterial genomes, phage genomes, phages and plasmids, as well as human genomes, mouse genomes, A.I. Reference was made to various sequences derived from thaliana and fungi, archaea. Applicants also run a client that can run in a web browser, which can process gigabase data from a portable computing device. The method relies on k-mer indexing and migration of a limited amount of data to the server. It can perform its task within seconds from an Android smartphone, consumes a moderate amount of bandwidth to communicate with the server, and to the best of Applicants' knowledge, differs from any existing tool Brings simplicity to use. This is used in Applicants' core facility for routine immediate quality inspection in sequencing runs and is available at http://tapir.cbs.dtu.dk.

序文
ＤＮＡの配列決定は、これを重ねて主張することそれ自体が、絶対的に陳腐なコメントとなるほどに、過去１０年間にわたってますます手頃なものとなった［１３］。今日のハイエンドシーケンサーは、数種類のヒトゲノムまたは数百種類の細菌の均等物を処理する容量を有し、また、次世代のシーケンサーが、既に利用できるようになり始めており、これは必要とされる初期投資がさらに少なく、配列決定容積に及ぶ柔軟性をもたらす。完全細菌分離株の配列決定は、一日がかりの仕事であるが、直ぐに数時間の仕事となるであろう。ナノポア配列決定［１２］に関する近年の発表は、ＤＮＡを直接的に配列決定することができる、配列決定デバイスが使い捨て型となるため前例のない低レベルの資本投資の、ＵＳＢから電源供給されるデバイスを提示した。この将来的な製品の背後に存在する会社であるＯｘｆｏｒｄＮａｎｏｐｏｒｅは、２０１２年にリリースを発表した［８］。ＤＮＡの抽出は、相対的に単純な手順であり、ＤＮＡ配列決定が、直ぐに分子生物学におけるルーチンで安価な手順となるであろうことが予見できる。患者は、ルーチンに配列決定され、感染病原体の大流行は、それらのＤＮＡによって追跡され、水および食物の品質も、ＤＮＡ配列決定によりモニターされるであろう。 Preface DNA sequencing has become increasingly affordable over the past decade to the point that it is an all-or-nothing comment in itself [13]. Today's high-end sequencers have the capacity to process several human genomes or hundreds of bacterial equivalents, and the next generation of sequencers is already available, which is the initial required Less investment and flexibility over the sequencing volume. Sequencing a complete bacterial isolate is a day-to-day task, but will soon be a few hours of work. Recent announcements on nanopore sequencing [12] show that USB powered devices with an unprecedented low level of capital investment that can sequence DNA directly, making the sequencing device disposable. Presented. Oxford Nanopore, a company behind this future product, announced its release in 2012 [8]. DNA extraction is a relatively simple procedure and it can be foreseen that DNA sequencing will soon be a routine and inexpensive procedure in molecular biology. Patients are routinely sequenced, infectious pathogen pandemics will be followed by their DNA, and water and food quality will also be monitored by DNA sequencing.

分析論の側面において、Ｓｍｉｔｈ−Ｗａｔｅｒｍａｎアルゴリズム［１４］等、先駆的ツールによる配列の局所的アライメントは、バイオインフォマティクスの礎石であった。問い合わせおよび参照のコレクションの間に適用されると、これは、アライメントのランク付けを可能にし、研究者に、既に現存する配列とのその類似性から、新たに配列決定されたＤＮＡまたはＲＮＡの起源および機能を推論させる。この方法論は、時に不正確であることが批判にさらされたが［２、１１］、その人気は、依然として疑いの余地がなく、公開データベースにおける多数の機能的アノテーションは、「配列相同性による」との言及を有する。しかし、データベースに保存記録されている現存する参照と新たに得られたＤＮＡとのアライメントは、依然として、相対的に要求が多い計算的タスクである。ＢＬＡＳＴ［１］および後のＢＬＡＴ［５］は、スピードを改善したが、未だに、現在利用できる配列の数により、公知配列のプールに対する新たな配列の検索は、ウェブ検索エンジンがほとんど即時に結果を返す時代において、相対的に長い時間を要し得る。２つだけ名前を挙げるとすれば、Ｂｏｗｔｉｅ［６］およびＢＷＡ［７］等、ショートリード配列決定のために設計された新たなツールが以来開発されたが、これらのツールは、所定の参照に対してあらゆる配列決定リードをアライメントするために設計されている。スピードを達成するために、かかるツールは、メモリへと参照のインデックスをロードし、これにより、取り扱うことができる参照ＤＮＡの量を限定する。 In analytical aspects, local alignment of sequences with pioneering tools such as the Smith-Waterman algorithm [14] was the cornerstone of bioinformatics. When applied during a collection of queries and references, this allows alignment ranking and allows researchers to derive the origin of newly sequenced DNA or RNA from its similarity to already existing sequences. And infer function. Although this methodology has sometimes been criticized for being inaccurate [2, 11], its popularity remains unquestionable, and numerous functional annotations in public databases are “by sequence homology”. And have a reference. However, alignment of existing references stored in the database with newly obtained DNA remains a relatively demanding computational task. BLAST [1] and later BLAT [5] have improved speed, but due to the number of sequences currently available, searching for a new sequence against a pool of known sequences will result in web search engines almost immediately. It can take a relatively long time to return. New tools designed for short read sequencing have since been developed, such as Bowtie [6] and BWA [7], to name just two, but these tools have been Designed to align any sequencing read with respect to. In order to achieve speed, such tools load a reference index into memory, thereby limiting the amount of reference DNA that can be handled.

本出願人らは、問い合わせ配列および参照のコレクションの間に絶対的な最良のアライメントを見出すことを計算的に要求するタスクと、問い合わせ配列のセットからの大部分がマッチする参照を迅速に同定することとの間にギャップを観察した。本出願人らが知る限り、ＤＮＡシーケンサーから出てきたリード等、短いＤＮＡまたはＲＮＡ配列のセットを採取し、セットが代表する完全ゲノムまたは個々の遺伝子のいずれかの参照のリストを返すような、単純なツールは存在しない。これを行うために、本出願人らは、数秒以内にＤＮＡ配列の供給源を幾分正確に同定するために、ＢＬＡＴおよびＳＳＡＨＡ［９、１０］の両方におけるアライメントシードならびにＭＵＳＣＬＥ［３］におけるｋ−ｍｅｒ計数とは別個の仕方でｋ−ｍｅｒを使用することを提案する。 Applicants quickly identify a task that computationally requires finding the absolute best alignment between a collection of query sequences and a reference, and a reference that matches most from the set of query sequences A gap was observed between them. To the best of our knowledge, such as taking a set of short DNA or RNA sequences, such as reads coming out of a DNA sequencer, returning a list of references to either the complete genome or individual genes that the set represents, There is no simple tool. To do this, Applicants have identified alignment seeds in both BLAT and SSAHA [9, 10] and kUSCLLE [3] in order to identify the source of DNA sequences somewhat accurately within seconds. We propose to use k-mer in a manner separate from -mer counting.

材料と方法
ＥＢＩおよびＮＣＢＩから利用できる公表されているゲノム、コンティグ、プラスミドおよび個々の遺伝子をダウンロードして、本出願人らの参照ＤＮＡとした。各参照配列を重複ｋ−ｍｅｒに分割し、あらゆる参照にわたるあらゆるｋ−ｍｅｒに対し、キー値保存またはＮｏＳＱＬデータベース（本出願人らは、ＫｙｏｔｏＣａｂｉｎｅｔ［４］を使用した）を作成し、各ｋ−ｍｅｒ（データベースにおけるキー）に、該ｋ−ｍｅｒを有する参照に相当する識別子のリストを関連付けた（図１）。本出願人らは、これを存在データベースと呼んだ。同様に、ｋ−ｍｅｒが見出される参照における位置を、本出願人らが位置データベースと呼ぶ場所に保存した（図１）。記述ラインおよびデータの供給源等、参照識別子および情報の間の関連を、別々のＳＱＬデータベースに保存した。 Materials and Methods Published genomes, contigs, plasmids and individual genes available from EBI and NCBI were downloaded to our reference DNA. Each reference sequence is divided into overlapping k-mers, and for every k-mer across every reference, a key value store or NoSQL database (Applicants used KyotoCabinet [4]) is created, and each k-mer A list of identifiers corresponding to references having the k-mer is associated with mer (key in the database) (FIG. 1). Applicants called this the presence database. Similarly, the position in the reference where the k-mer was found was stored in a location that we called the position database (FIG. 1). Associations between reference identifiers and information, such as description lines and data sources, were stored in separate SQL databases.

短い問い合わせ配列またはリードのセットをスコアリングするために、本出願人らは、それらのランダム試料を通して反復する（図２）。配列毎に、本出願人らは、配列にわたり幅ｋのウィンドウをスライドさせることにより得られる、連続したｋ−ｍｅｒにわたって反復する。ｋ−ｍｅｒ毎に、これが前に計数されておらず、存在データベースに見出される場合、本出願人らは、参照の位置を問い合わせる。リードのあらゆるｋ−ｍｅｒが処理されたら、本出願人らは、参照においてマッチした近接位置の数を調べ、あらゆるマッチする参照にわたり同じリードに起源をもつマッチするｋ−ｍｅｒの最大の濃度である、マッチの最大のクラスターのみを考慮する。かかるクラスター毎に、本出願人らは、恐らく以前に該参照に加えられた数に、ｋ−ｍｅｒの数を加え、既に計数されたｋ−ｍｅｒのリストをアップデートする。続いて、次の配列またはリードを処理する。本出願人らは、マッチすることが判明したｋ−ｍｅｒの計数が関連付けされた参照のリストを得る。ペア＜参照、計数＞毎に、計数を、問い合わせセットにおける特有のｋ−ｍｅｒの数で割り、所定の参照によりマッチした問い合わせにおけるＤＮＡの量の大雑把なスコアを得る。問い合わせセットが、配列に完全にマッチする場合、該スコアは、１となり、そうでなければ、これはより小さくなるであろう；例えば、問い合わせセットが、２種の参照の等しい割合の混合物である場合、両方の参照のスコアは、０．５前後となるであろう。該計数をまた、参照のサイズ（参照配列における特有のｋ−ｍｅｒの数）で割り、問い合わせによって表される参照の画分の大雑把なスコアを得る；この第２のスコアは、マッチする参照の選別および最大の参照へのバイアスの回避に役立つ。最終スコアは、これら２種のスコアの加重和であり、デフォルトは、等しい加重である。問い合わせセットが大型である場合、例えば、本出願人らが、ＤＮＡ配列決定ランから得られるあらゆるリードを考慮する場合、該セットのランダム試料のみを使用する。 To score a short query sequence or set of reads, Applicants iterate through their random samples (Figure 2). For each array, Applicants iterate over successive k-mers, obtained by sliding a window of width k across the array. For each k-mer, if this has not been previously counted and is found in the presence database, Applicants query the location of the reference. Once every k-mer of a lead has been processed, Applicants examine the number of adjacent positions matched in the reference and are the highest concentration of matching k-mer that originates from the same lead across every matching reference Only consider the largest cluster of matches. For each such cluster, Applicants will update the list of already counted k-mers, perhaps adding the number of k-mers to the number previously added to the reference. Subsequently, the next array or read is processed. Applicants obtain a list of references with associated k-mer counts found to match. For each pair <reference, count>, the count is divided by the number of unique k-mers in the query set to get a rough score of the amount of DNA in the query matched by a given reference. If the query set exactly matches the sequence, the score will be 1, otherwise it will be smaller; for example, the query set is a mixture of equal proportions of the two references In that case, the score of both references would be around 0.5. The count is also divided by the size of the reference (the number of unique k-mers in the reference sequence) to get a rough score of the fraction of the reference represented by the query; this second score is the matching reference Helps avoid screening and bias to maximum reference. The final score is a weighted sum of these two scores, and the default is equal weight. If the query set is large, for example, if we consider every read obtained from a DNA sequencing run, only use a random sample of that set.

サービスの使用を容易にするために、ウェブブラウザにおけるページとしてランするＨＴＭＬ５／Ｊａｖａｓｃｒｉｐｔ（登録商標）クライアントを実行した。書き出しの時点において、Ｆｉｒｅｆｏｘ１５．０は、あらゆる必要とされる特色を実行する唯一のブラウザであり、本出願人らは、Ｌｉｎｕｘ（登録商標）、ＭａｃＯＳＸ、ＭｉｃｒｏｓｏｆｔＷｉｎｄｏｗｓ（登録商標）およびＡｎｄｒｏｉｄ４．０において作業するために検査した。 To facilitate the use of the service, an HTML5 / Javascript® client running as a page in a web browser was run. At the time of export, Firefox 15.0 is the only browser that performs all the required features, and Applicants have been using Linux (R), Mac OS X, Microsoft Windows (R), and Android. Tested to work at 4.0.

配列決定データにおける細菌を同定するために本来設計された本出願人らのシステムをベンチマーク評価するために、本出願人らは、７４７種の細菌ゲノムである、２０１２年の初めにＥＢＩから利用できる細菌に由来するあらゆる配列を反復的に採取した。ＤＮＡシーケンサーからのリードをシミュレートするために、ゲノム毎に、本出願人らは、ゲノム配列からランダムな恐らく重複する部分配列を作成した；長さ５０、１００、１５０、２００および２５０塩基の部分配列を使用した。本出願人らは、現実の試料における配列決定エラーのクラスおよび規則的な（punctual）突然変異の存在の両方をシミュレートするために、０％（エラーなし）、１％、５％および１０％の率による塩基の均一なランダム置換も導入した。ゲノム、長さおよび置換率毎に、１００種の部分配列またはリードのランダム試料を採取し、このサンプリングを１０回繰り返した。 To benchmark Applicants' system originally designed to identify bacteria in sequencing data, Applicants are available from EBI at the beginning of 2012, which is a 747 bacterial genome Any sequence derived from bacteria was collected repeatedly. To simulate reads from a DNA sequencer, for each genome, Applicants created random, possibly overlapping subsequences from the genomic sequence; portions of length 50, 100, 150, 200 and 250 bases An array was used. Applicants have identified 0% (no errors), 1%, 5% and 10% to simulate both the class of sequencing errors and the presence of punctual mutations in real samples. A uniform random substitution of the base with a rate of Random samples of 100 partial sequences or reads were taken for each genome, length, and replacement rate, and this sampling was repeated 10 times.

結果
細菌ゲノム毎に、本出願人らは、１００種のランダムなシミュレートされたリードを採取し、本出願人らの方法を使用して、他の参照の中から、これらの細菌ゲノムを含むデータベースに対してこれらをスコアリングし、２５種の最良のスコアのリストにおける問い合わせゲノムのランクを記録する。平均ランクおよびランクの標準偏差を図４に示す。 Results For each bacterial genome, we collect 100 random simulated reads and include these bacterial genomes from among other references using our method. These are scored against the database and the rank of the query genome in the list of 25 best scores is recorded. The average rank and the standard deviation of the rank are shown in FIG.

平均ランクが１に近いほど、スコアリングはより優れ、ランクの標準偏差が小さいほど、サンプリング効果に対する合理性は低くなる。各個々のパネルに書き出される見逃しランクの数は、２５種の最高のスコアに存在しなかったゲノムの数に相当する。 The closer the average rank is to 1, the better the scoring, and the smaller the standard deviation of the rank, the less rational the sampling effect. The number of missed ranks written out in each individual panel corresponds to the number of genomes that were not present in the 25 highest scores.

５０塩基の長さのリードでは、性能は最適に満たないが、低い置換率の上位５種における、およびより高い置換率の上位１５種におけるときの９７％から９９％の間の問い合わせゲノムにより、既に、１００塩基のリードによる劇的な改善が存在する。最大２５０塩基までのリードの長さの増加は、平均ランクにおけるより高い置換率のマイナス効果の補償を助けた。 With a 50 base long read, the performance is less than optimal, but with a query genome between 97% and 99% in the top 5 species with low substitution rates and in the top 15 species with higher substitution rates, Already there is a dramatic improvement with 100 base reads. Increasing lead lengths up to 250 bases helped compensate for the negative effects of higher substitution rates in the average rank.

本出願人らが使用した長さおよび置換率の範囲は、Ｉｌｌｕｍｉｎａ（約０．１〜１％の誤り率による１００塩基）、ＬｉｆｅＴｅｃｈｎｏｌｏｇｉｅｓのＳＯＬｉＤ５５００（０．０１％の誤り率による７５ｎｔリード）、ＩｏｎＴｏｒｒｅｎｔＰＧＭ（１％の誤り率による２００〜３００塩基）、またはＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅ（１５％の誤り率による３，０００塩基）等、次世代配列決定プラットフォームから得られる範囲に匹敵する。本出願人らの方法は、これらの範囲内で優れた性能を示し、本出願人らは、より長いリードの代用を提供するために使用される技法であるペアエンド（paired-end）配列決定の支持を加えることによりさらに性能を向上させることが実施されると考えている。本出願人らの方法は、塩基置換等、配列決定エラーに対し相対的に感度が低いと思われ、本出願人らの検査問い合わせに予想される低ランクは、置換率が増加するにつれて最小に影響された。 The length and substitution rate ranges used by the Applicants are Illumina (100 bases with about 0.1-1% error rate), Life Technologies' SOLiD 5500 (75 nt read with 0.01% error rate) , Ion Torrent PGM (200-300 bases with 1% error rate), or Pacific Bioscience (3,000 bases with 15% error rate), comparable to the range obtained from next generation sequencing platforms. Applicants' method has shown superior performance within these ranges, and Applicants have used paired-end sequencing, a technique used to provide longer lead substitutions. It is believed that further performance will be implemented by adding support. Applicants 'method appears to be relatively insensitive to sequencing errors, such as base substitutions, and the low rank expected for Applicants' test queries is minimized as the substitution rate increases. Affected.

ＮｏＳＱＬデータベースの使用のおかげで、本出願人らは、ゲノムデータがますます豊富になるにつれてのスケールアップを予測し、相対的に手頃なコンピュータシステムにおける参照のますます大型化するコレクションのインデックス付けおよび問い合わせが可能となり続ける。 Thanks to the use of the NoSQL database, Applicants anticipate scaling up as genomic data becomes increasingly rich, indexing increasingly large collections of references in relatively affordable computer systems and Inquiries continue to be possible.

本出願人らの方法の使用を容易にするために、本出願人らは、ブラウザに基づくクライアントを開発した。本出願人らは、未加工のＦＡＳＴＱファイル最大２Ｇｂのサイズにより検査し、これをモニターして、ＲＡＭにおける２００Ｍｂ強のみを使用し、２０秒未満で結果を返した。 To facilitate the use of Applicants' method, Applicants have developed a browser-based client. Applicants examined by size of raw FASTQ file up to 2Gb and monitored it, using only over 200Mb in RAM and returning results in less than 20 seconds.

結論
ＴＡＰＩＲの根底にある概念は、幾分単純である。ＤＮＡデータベースのサイズの増加が、少なくとも１０年間にわたり発表および観察されてきたが、ＤＮＡ配列決定技術における近年の発達は、データの素早く手頃な作成を現実のものとした。本出願人らは、あらゆる公知のＤＮＡに対する実験的に得られたＤＮＡ配列のマッチングが、バイオインフォマティクスにおける最も重要な課題の１つであることを主張する。本明細書において、本出願人らは、インターネットウェブ検索大手（giants）が一般向けに使用できるようにしたものにマッチするスピードおよび容易さでこれを行うことができることを示す。デスクトップＤＮＡシーケンサーによる、患者における感染、生物テロ防御または食物安全性等のリアルタイムサーベイランス等、タスクを考慮する場合、本出願人らの方法は、検索空間を絞り込み、より高度な解析方法を後に行うことができる、最初期ステップを提供する。 Conclusion The concept underlying TAPIR is somewhat simple. While an increase in the size of the DNA database has been published and observed for at least a decade, recent developments in DNA sequencing technology have made the creation of data fast and affordable. Applicants claim that the matching of experimentally obtained DNA sequences to any known DNA is one of the most important challenges in bioinformatics. In this specification, Applicants show that this can be done with speed and ease that matches what Internet web search giants have made available to the general public. When considering tasks such as infection in patients, real-time surveillance such as bioterrorism defense or food safety with a desktop DNA sequencer, Applicants' method should narrow down the search space and perform more advanced analysis methods later Provides an initial step that can be done.

（実施例２）
本実施例において、細菌、ウイルス、ファージ、プラスミドと共に、ヒト、マウス、植物、真菌および古細菌に由来する数万種のゲノムおよびゲノム領域が参照された。本出願人らは、ウェブブラウザにおいてランするクライアントも実行し、サーバーと通信する中程度の量の帯域幅を消費しながら、数秒以内に商品ポータブル計算デバイスから数ギガバイトの未加工の配列決定データを処理および同定するためのクライアントの使用を実証した。よって、本実施例において、未加工のリードに由来するＤＮＡの同定が、検索エンジンの問い合わせと同じほどに容易となり得ることが示される。 (Example 2)
In this example, tens of thousands of genomes and genomic regions derived from humans, mice, plants, fungi and archaea were referenced along with bacteria, viruses, phages and plasmids. Applicants also run a client that runs in a web browser and consumes a moderate amount of bandwidth to communicate with the server, but within a few seconds, get several gigabytes of raw sequencing data from a commodity portable computing device. Demonstrated the use of the client to process and identify. Thus, this example shows that the identification of DNA from raw leads can be as easy as a search engine query.

参照の包括的コレクションに対する問い合わせＤＮＡ配列のセットのマッチング
アライメントプログラムを調べる主観的な仕方は、これらを２種の主要カテゴリーに分割することである：その一方は、公知参照のコレクションに対する１個の問い合わせ配列のマッピングに全力を尽くすことであり（例えば、ＢＬＡＳＴ）、もう一方は、可能な限り迅速に１個の指定の参照に対する多数の短い配列のマッピングを試みることである（例えば、ｂｏｗｔｉｅまたはＢＷＡ）。本出願人らは、多数の短い配列のために優れた参照を同定することができる中間的アプローチを提案する；本出願人らは、参照配列のコレクションに対し複数の配列をマッチさせ、いずれの参照が、問い合わせセットにおいて最も表されるか採択する。 Querying a comprehensive set of references Matching a set of DNA sequences A subjective way to look at an alignment program is to divide these into two main categories: one query for a collection of known references Try to do everything to map the sequence (eg, BLAST), and the other is to try to map many short sequences to one specified reference as quickly as possible (eg, bowtie or BWA). . Applicants propose an intermediate approach that can identify good references for a large number of short sequences; Applicants match multiple sequences against a collection of reference sequences, Adopt which reference is best represented in the query set.

本実施例において提示されているアプローチは、ｋ−ｍｅｒのインデックス付けにおけるいかなる選択ステップを含まず、この特色は、配列のコレクションから構築する際の複雑性を大幅に単純化する。これは、空間を犠牲にし、情報価値が低い可能性があるｋ−ｍｅｒをインデックス付けするが、次の利益により相殺される：プロセスは、参照のコレクションの合計サイズにおいて直線的であり、自明に並列化され得る。これは、あらゆる公知ＤＮＡのインデックス付けを最終的に妥当なものとする（インターネットにおけるあらゆるドキュメントのウェブ検索エンジンのインデックス付けに類似）。 The approach presented in this example does not involve any selection step in k-mer indexing, and this feature greatly simplifies the complexity of building from a collection of sequences. This sacrifices space and indexes k-mers that may be low in information value, but is offset by the following benefits: The process is linear in the total size of the collection of references and is self-evident Can be parallelized. This ultimately makes any known DNA indexing valid (similar to the web search engine indexing of any document on the Internet).

この実施例において、本出願人らのアルゴリズムは、ｋ−ｍｅｒの単なる計数以上のことを行うが、完全マッピングまたはアライメントのいずれも実行しない。アルゴリズムは、各リードの文脈内におけるｋ−ｍｅｒのマッチングを考慮に入れると共に、マッチするｋ−ｍｅｒ同士を互いに近くにクラスター形成させる。 In this example, Applicants' algorithm does more than just count k-mers, but does not perform either full mapping or alignment. The algorithm takes into account k-mer matching within the context of each lead and clusters matching k-mers close together.

本実施例において、図５に示す通り、本出願人らは、インデックス付けのために非重複ｋ−ｍｅｒを使用した一方で、問い合わせにおいて重複ｋ−ｍｅｒを使用したが、本出願人らは、これを実施詳細として考慮し、マッチする参照にスコアを与えるために同じ指針を維持しながら、インデックス付けのために重複ｋ−ｍｅｒを、問い合わせにおいて非重複ｋ−ｍｅｒを容易に使用することができる。 In this example, as shown in FIG. 5, while we used non-overlapping k-mers for indexing, we used overlapping k-mers in the query, Considering this as an implementation detail, one can easily use duplicate k-mers for indexing and non-duplicate k-mers in queries while maintaining the same guidelines for scoring matching references. .

ｋ−ｍｅｒを使用してインデックス付けされたサイズｕの参照における長さｐの文字列のｎ発生の位置づけの時間計算量は、ｋインデックス付けおよびルックアップにツリーまたはハッシュが使用される場合、Ｏ（ｐ＋ｎｌｏｇｕ）またはＯ（ｐ＋ｎ）の計算量を有する。 The time complexity of positioning n occurrences of a string of length p in a reference of size u indexed using k-mer is O if the tree or hash is used for k indexing and lookup. It has a calculation amount of (p + n log u) or O (p + n).

診断目的のためにシーケンサーからの未加工のリード等、ＤＮＡデータの問い合わせセットを同定する際に、本出願人らは、包括的参照データベースに対するあらゆるリードのマッピングに存する総当たりアプローチが、２つの主要な不利益を有すると考慮する：配列決定設備から計算センターへと移行される数百メガバイトまたは数ギガバイトの多さのデータと、タスクの実行に必要な計算資源が著しいこと。参照コレクションが、１０，０００種のＥ．ｃｏｌｉサイズの細菌を含有し、ＢＷＡおよびｂｏｗｔｉｅ２等の最適化されたアライナが、２５０Ｍ塩基の未加工の配列決定データ（ゲノムが４Ｍ塩基のサイズである場合、平均カバー度における約６０×）を処理するのに３０秒間を要すると仮定すると、ＣＰＵにおいて３日半を要するであろうが、複数のＣＰＵにおいて自明に並列化することができる。 In identifying a query set of DNA data, such as raw reads from a sequencer for diagnostic purposes, Applicants have identified the two brute force approaches that exist in mapping every lead to a comprehensive reference database. Have hundreds of megabytes or gigabytes of data transferred from the sequencing facility to the computing center and significant computational resources required to perform the task. The reference collection has 10,000 E.C. E. coli-sized bacteria and optimized aligners such as BWA and bowtie2 process raw sequencing data of 250M bases (approximately 60x in average coverage when the genome is 4M bases in size) Assuming that it takes 30 seconds to do this, it would take three and a half days for the CPU, but it can be trivially parallelized with multiple CPUs.

時間計算量に加えて、データ移行は、２５０Ｍ塩基のＤＮＡとなり、参照を保持するデータセンターに配列決定データを移動させた。ｋ−ｍｅｒに基づく本出願人らのアプローチは、リードのマッピングまたはＳＮＰ呼び出し、あるいはさらには鋳型に基づくｄｅ−ｎｏｖｏアセンブリ等、詳細な調査を、参照の小型のセットへと低下させる。性能を評価する際に、本出願人らは、正しい答えが、５種の提案したマッチのセット内に存在する場合、最初に単に検索を成功と考慮することを任意で選んだ。いずれが最良にマッチするか正確に同定するための、これらの参照に対するあらゆるリードのマッピングのタスクは、上述の試料当たり３日半の予見にもかかわらず、同じＣＰＵにおいて１２分間で実行することができる、あるいは強力なマルチコアアーキテクチャを取得した場合さらに短時間で実行することができる。あらゆるゲノムの移行は、約２０Ｍ塩基のＤＮＡを表し、これは、３Ｇモバイルインターネット接続により容易に実行することができる。本出願人らのアプローチは、Ｉｏｎｂｕｓ［１５］等のモバイル配列決定設備に、遠隔地において重大な診断または科学的タスクを実行させることができる。プラスミド、ビルレンス遺伝子、ウイルスまたは細菌の混合物等、より小型の領域の存在のために、マッピングされていないリードが存在する場合、これらのリードを同様に処理し、数回の反復により完全な内容を同定することができる。 In addition to the time complexity, the data transfer resulted in 250 M base DNA and moved the sequencing data to the data center holding the reference. Applicants' approach based on k-mer reduces detailed investigations, such as lead mapping or SNP calls, or even template-based de-novo assembly, to a small set of references. In assessing performance, Applicants optionally chose to consider the search as a success first simply if the correct answer exists in the set of five proposed matches. The task of mapping all leads to these references to accurately identify which one matches best can be performed in the same CPU in 12 minutes, despite the three and a half days foreseen per sample described above. It can be executed in a shorter period of time if a powerful multi-core architecture can be obtained. Every genome transfer represents about 20 M bases of DNA, which can be easily performed with a 3G mobile internet connection. Applicants' approach allows a mobile sequencing facility such as Ion bus [15] to perform critical diagnostic or scientific tasks at a remote location. If there are unmapped reads due to the presence of smaller regions, such as plasmids, virulence genes, viruses or bacterial mixtures, treat these reads in the same way and complete the contents by several iterations. Can be identified.

ベンチマークの構築
配列決定データにおける細菌を同定するために本来設計された本出願人らのシステムをベンチマーク評価するために、本出願人らは、およそ２０１２年の初めにＥＢＩデータベースから利用できる細菌に由来するあらゆる配列を反復的に採取したが、これはすなわち、７４７種の細菌ゲノムであり、これらに加えて参照の完全データベースは、次のものを含有した：ＮＣＢＩに由来する細菌参照、ファージおよびウイルス、プラスミドならびにヒトゲノム（下表１を参照）。表１は、２０１２年の初めにおけるゲノム参照のスナップショット（参照の供給源および数）を示す。参照は、完全ゲノムまたはプラスミド、およびコンティグまたは遺伝子等のゲノム断片の混合物である。
Benchmark Construction To benchmark our system, originally designed to identify bacteria in sequencing data, we derived from bacteria available from the EBI database at the beginning of 2012. Every sequence that was collected was repetitively collected, ie, 747 bacterial genomes, plus a complete database of references contained: Bacteria references, phages and viruses from NCBI , Plasmids as well as the human genome (see Table 1 below). Table 1 shows genome reference snapshots (reference sources and numbers) at the beginning of 2012. A reference is a complete genome or plasmid and a mixture of genomic fragments such as contigs or genes.

ＤＮＡシーケンサーから得られるリードをシミュレートするために、ゲノム毎に、本出願人らは、ゲノム配列からランダムな恐らく重複する部分配列を作成した；長さ５０、１００、１５０、２００および２５０塩基の部分配列を使用した。本出願人らは、現実の試料における配列決定エラーのクラスおよび規則的な突然変異の存在の両方をシミュレートするために、０％（エラーなし）、１％、５％および１０％の率による塩基の均一なランダム置換も導入した。ゲノム、長さおよび置換率毎に、１００種の部分配列またはリードのランダム試料を採取し、このサンプリングを５回繰り返した。 To simulate the reads obtained from a DNA sequencer, for each genome, Applicants created random, possibly overlapping subsequences from the genome sequence; 50, 100, 150, 200 and 250 bases in length. A partial sequence was used. Applicants have rates of 0% (no errors), 1%, 5% and 10% to simulate both the class of sequencing errors and the presence of regular mutations in real samples. A uniform random substitution of the base was also introduced. Random samples of 100 partial sequences or reads were taken for each genome, length and replacement rate, and this sampling was repeated 5 times.

本出願人らの目的は、いかなる公知ＤＮＡが試料中に存在するかを見出すことができるか、あるいは配列決定エラーまたは突然変異等の不確実性を計数する場合、ゲノムが十分に近いか評価することである。 Our objective is to find out what known DNA is present in the sample, or to assess whether the genome is close enough when counting uncertainties such as sequencing errors or mutations That is.

予測性能
細菌ゲノム毎に、本出願人らは、１００種のランダムなシミュレートされたリードを採取し、本出願人らの方法を使用して、他の細菌、ファージ、植物、真菌、ウイルスおよび哺乳動物に由来する配列およびゲノムのより大型のコレクションの中から、該細菌ゲノムを含むデータベースに対してこれらをスコアリングし、２５種の最良のマッチする参照のリストにおける問い合わせゲノムのランクを記録した。検査細菌ゲノム毎の結果の可変性を評価するために、これをゲノム毎に５回繰り返し、平均ランクおよびランクの標準偏差を図９に提示する。 Predictive performance For each bacterial genome, we collect 100 random simulated reads and use our method to identify other bacteria, phage, plants, fungi, viruses and From a larger collection of sequences and genomes from mammals, these were scored against a database containing the bacterial genome, and the rank of the query genome in the list of 25 best matching references was recorded. . To evaluate the variability of results for each test bacterial genome, this was repeated 5 times per genome and the average rank and standard deviation of the ranks are presented in FIG.

性能は、５０ヌクレオチドの長さのリードでは相対的に低かったが、本出願人らは、リードの長さを増加させた場合に劇的な改善を観察し、配列決定された塩基における長さ１００のリードは、既に最大の性能に近かった。最良の結果は、正しいゲノムが、低い置換率を有する上位５種における、およびより高い置換率を有する上位１５種における、より低い誤り率のときの９７％を超えて結果のリストに存在することを示している。リードの長さを２５０塩基まで増加させることは、増加する誤り率のマイナス効果の補償を助けた。同定のために送られたランダム試料におけるリードの数の増加は、多くの効果を持たなかった。図７を参照：１００種のリードは、少量のデータであるが、多数の事例におけるＤＮＡの同定に十分であると思われる。 Although the performance was relatively low for 50 nucleotide long reads, Applicants observed a dramatic improvement when increasing the length of the read, and the length at the sequenced base The 100 leads were already close to maximum performance. The best result is that the correct genome is in the results list above 97% at the lower error rate in the top 5 species with low replacement rates and in the top 15 species with higher replacement rates Is shown. Increasing the lead length to 250 bases helped compensate for the negative effects of increasing error rate. Increasing the number of reads in random samples sent for identification did not have much effect. See FIG. 7: The 100 reads are a small amount of data but appear to be sufficient for DNA identification in many cases.

先に詳述されている通り、本出願人らの方法は、提案されたマッチのセット内の正しい参照を返すことを目標とし、これを為すことにより、総当たりアプローチが計算的に要求する手順による探索を必要とする検索空間を単純化する。全２５種の解析のランは、徹底検索と比較しても依然として有意であるため、上位５種の結果内の問い合わせ配列を見出すよう自身に制限を課すことは、ほぼ確実に必要以上に厳密であるが、本方法が既に、答え候補の非常に小型のセット内で正しい答えを返すことができることを指摘する。 As detailed above, Applicants' method aims to return the correct reference in the proposed set of matches, and by doing this, the procedure that the brute force approach requires computationally. Simplify the search space that requires searching by. Since all 25 analysis runs are still significant compared to the thorough search, it is almost certainly more rigorous than necessary to restrict yourself to find the query sequence in the top 5 results. It is pointed out that the method can already return the correct answer within a very small set of answer candidates.

反復性検索および同定の文脈において、正しく正確な系統またはゲノム参照ではないとしても、正しい細菌種の指摘は、既に相対的に成功した答えであると考慮することができる。図６は、本出願人らの同定手順が、５０ヌクレオチドを上回るリードにより非常に優れた性能を示すことを示す。 In the context of iterative search and identification, pointing to the correct bacterial species can already be considered a relatively successful answer, even if it is not a correct and accurate lineage or genomic reference. FIG. 6 shows that Applicants' identification procedure performs very well with reads greater than 50 nucleotides.

本出願人らが使用した長さおよび置換率の範囲は、Ｉｌｌｕｍｉｎａ（約０．１〜１％の誤り率による最大で１５０塩基）、ＬｉｆｅＴｅｃｈｎｏｌｏｇｉｅｓのＳＯＬｉＤ５５００（０．０１％の誤り率による最大で７５ｎｔのリード）、ＩｏｎＴｏｒｒｅｎｔＰＧＭ（１％の誤り率による最大で２００〜３００塩基）またはＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅ（１５％の誤り率による３，０００塩基）等、次世代配列決定プラットフォームから得られる範囲に匹敵する。本出願人らの方法は、これらの範囲内で優れた性能を示し、本出願人らは、ペアエンド配列決定（より長いリードの代用を提供するために使用される技法）の支持を加えることによりさらに増加する性能を予測する。本出願人らの方法は、塩基置換等、配列決定エラーに対し相対的に感度が低いと思われ、本出願人らの検査問い合わせに予想される低ランクは、置換率が増加するにつれて最小に影響された。 The length and substitution rate ranges used by the Applicants are Illumina (up to 150 bases with an error rate of about 0.1-1%), Life Technologies SOLiD 5500 (maximum with an error rate of 0.01%). 75 nt lead), Ion Torrent PGM (up to 200-300 bases with 1% error rate) or Pacific Bioscience (3,000 bases with 15% error rate) Comparable. Applicants' method has shown superior performance within these ranges, and Applicants have added support for paired-end sequencing, a technique used to provide longer lead substitutions. Predict further increasing performance. Applicants 'method appears to be relatively insensitive to sequencing errors, such as base substitutions, and the low rank expected for Applicants' test queries is minimized as the substitution rate increases. Affected.

本出願人らは、ウイルスおよび細菌分離株からメタゲノミクス混合物に及ぶ試料に由来するＩｏｎＴｏｒｒｅｎｔＰＧＭからの配列決定データにおいてもアプローチを試みた。同じ種の複数の系統等、インデックス付けされた参照のコレクションにおける非常に類似したゲノムは、正しい参照ゲノムよりも低いランクにより密接に関係するゲノムを有する確率を増加させることにより、性能の劣化に寄与し得る。これは、正確な参照ではなく種を考慮する場合の性能の増加によって確認され、第２の反復において曖昧さをなくすことができる中等度の不自由である。最後に、本出願人らは、単離された実体ではなくリードの文脈内のｋ−ｍｅｒを考慮したため、多様な哺乳動物に由来する試料からの配列決定により非常に有望な結果を得た。そして近い将来、これらを確実に同定することを予測する。 Applicants have also attempted approaches in sequencing data from Ion Torrent PGM derived from samples ranging from viral and bacterial isolates to metagenomic mixtures. Very similar genomes in a collection of indexed references, such as multiple strains of the same species, contribute to performance degradation by increasing the probability of having a more closely related genome than the correct reference genome Can do. This is confirmed by an increase in performance when considering species rather than exact references, and is a moderate inconvenience that can eliminate ambiguity in the second iteration. Finally, we considered very promising results from sequencing from samples from a variety of mammals because we considered k-mers in the context of reads rather than isolated entities. And we expect to identify them reliably in the near future.

計算性能
サーバー：
サーバーにおけるメモリ使用は、ディスクに基づくキー値保存を使用することにより最小に維持することができ、チューニング性能は、これらを、それをランするコンピュータにおいて利用できるメモリにキャッシュすることにより達成することができる。ＮｏＳＱＬデータベースの使用のおかげで、本出願人らは、ゲノムデータがますます豊富になるにつれての優れたスケールアップも予測し、相対的に手頃なコンピュータシステムにおける参照のますます大型化するコレクションのインデックス付けおよび問い合わせが可能となり続ける。 Compute performance server:
Memory usage on the server can be kept to a minimum by using disk-based key value storage, and tuning performance can be achieved by caching them in memory available on the computer that runs it. it can. Thanks to the use of the NoSQL database, Applicants also anticipate superior scale-up as genomic data becomes more and more rich, and an index of an increasingly large collection of references in relatively affordable computer systems Dates and inquiries continue to be possible.

本実行では、インデックス付けシステムおよびサーバーの両方は、Ｐｙｔｈｏｎにおいて実行され、４４Ｇ塩基の参照ＤＮＡのインデックス付けは、８コア（ＩｎｔｅｌＸｅｏｎ、２．９３ＧＨｚ）を使用して数時間で行われ、１着信試料の処理は、数秒間を要する。有意な加速は、Ｃに移動されたボトルネック等、最適化努力により達成することができるが、必要が明らかになった場合、さらなるコアを捧げることによりさらなる要求の取り扱いにおける網羅的性能を増加させることも可能である。 In this run, both the indexing system and the server are run in Python, and the 44G base reference DNA is indexed in a few hours using 8 cores (Intel Xeon, 2.93GHz) Sample processing takes several seconds. Significant acceleration can be achieved through optimization efforts, such as a bottleneck moved to C, but if the need becomes clear, dedicating additional cores increases the comprehensive performance in handling additional requirements. It is also possible.

クライアント：
本出願人らの方法の使用を容易にするために、本出願人らは、http://tapir.cbs.dtu.dkにおいてアクセスすることができるＪａｖａｓｃｒｉｐｔ（登録商標）およびＨＴＭＬ５特色を使用して、ブラウザに基づくクライアントを開発した。クライアントは、現在、最新のＦｉｒｅｆｏｘリリース（バージョン１５以上）において稼働している。 client:
To facilitate the use of Applicants' method, Applicants have used Javascript® and HTML5 features that can be accessed at http://tapir.cbs.dtu.dk. Developed a browser-based client. The client is currently running on the latest Firefox release (version 15 or higher).

２．５３ＧＨｚにおいて達成されるＩｎｔｅｌＣｏｒｅｉ５ＣＰＵによる相対的に中程度のラップトップにおいてランするＦｉｒｅｆｏｘにより、最大２ＧｂのサイズのＦＡＳＴＱファイルにおける未加工のリードは、３０秒未満で処理することができ、ファイルが小さいほど最速となり、ＲＡＭにおける３００Ｍｂ弱を使用して、サーバーとの通信に数秒間を要した。 With Firefox running on a relatively medium laptop with Intel Core i5 CPU achieved at 2.53 GHz, raw leads in FASTQ files up to 2 Gb in size can be processed in less than 30 seconds, The smaller the file, the faster it was, and it took several seconds to communicate with the server using less than 300Mb of RAM.

本出願人らは、コンソールに基づくコマンドラインツールをさらに実行して、本出願人らのアルゴリズムおよびその後のアライメントを行った。実行は、一般的なソフトウェアリポジトリにおいて利用できる：https://bitbucket.org/lgautier/dnasnout-client。実行は、フェッチ参照ゲノムに本出願人らのアルゴリズムを使用し、ｂｏｗｔｉｅ２によりこれらのインデックス付けおよびあらゆるリードのマッピングを行う。１０種の上位リードを考慮する場合、完全反復は１分未満を要し、事例の９８％において１回の反復で十分である。ブラウザの迅速な開発により、本出願人らは、ウェブブラウザのみを使用してデスクトップ配列決定ランにより疫学研究室が行うものと類似のワークフローを実行することが間もなく可能になると予測する。 Applicants further executed a console-based command line tool to perform Applicants' algorithm and subsequent alignment. Execution is available in a general software repository: https://bitbucket.org/lgautier/dnasnout-client. The implementation uses Applicants' algorithm on the fetch reference genome and performs these indexing and mapping of any reads by bowtie2. When considering 10 top leads, a complete iteration takes less than a minute, and one iteration is sufficient for 98% of cases. With the rapid development of browsers, Applicants anticipate that it will soon be possible to perform workflows similar to those performed by epidemiological laboratories using desktop sequencing runs using only a web browser.

考察
本出願人らは、あらゆる公知ＤＮＡに対する実験的に得られたＤＮＡ配列のマッチングが、バイオインフォマティクスにおける最も重要な課題の１つであることを主張する。本明細書において、本出願人らは、インターネットウェブ検索大手が一般向けに使用できるようにしたものにマッチするスピードおよび容易さでこれを行うことができることを示してきた。患者における感染、生物テロ防御または食物安全性等、リアルタイムサーベイランス等のタスクを考慮する場合、ＩｏｎＴｏｒｒｅｎｔＰＧＭまたはＩｌｌｕｍｉｎａＭｉＳｅｑ等、今日のデスクトップＤＮＡシーケンサーは、既にタスクに耐えることができ、本出願人らの方法は、ＤＮＡ配列決定を行う研究室と計算設備との間で大量の未加工のデータを移行させる必要がなく、検索空間を絞り込み、より高度な解析方法を後に局所的に行うことができる最初期ステップを提供する。 Discussion Applicants argue that the matching of experimentally obtained DNA sequences to any known DNA is one of the most important challenges in bioinformatics. In this specification, applicants have shown that this can be done with speed and ease that matches what Internet web search giants have made available to the general public. When considering tasks such as real-time surveillance, such as infection, bioterrorism defense or food safety in patients, today's desktop DNA sequencers such as Ion Torrent PGM or Illumina MiSeq can already withstand the task, This method eliminates the need to transfer a large amount of raw data between a DNA sequencing laboratory and a computing facility, narrows the search space, and allows more advanced analysis methods to be performed locally later. Provides initial steps.

方法
ゲノム参照の供給源：
ＥＢＩおよびＮＣＢＩから利用できる、公表されているゲノム、コンティグ、プラスミドおよび個々の遺伝子をダウンロードして、本出願人らの参照ＤＮＡとした。参照の正確な組成は、時間と共に拡大しつつあるが、本出願人らは本実施例に使用したスナップショットを表１にリストアップした。 Sources of method genome references:
Published genomes, contigs, plasmids and individual genes available from EBI and NCBI were downloaded as our reference DNA. While the exact composition of the reference is expanding over time, Applicants listed the snapshots used in this example in Table 1.

参照のインデックス付け：
各参照配列を非重複ｋ−ｍｅｒに分割し、あらゆる参照に及ぶあらゆるｋ−ｍｅｒに対し、キー値保存またはＮｏＳＱＬデータベース（本出願人らはＫｙｏｔｏＣａｂｉｎｅｔ［４］を使用）を作成し、各ｋ−ｍｅｒ（データベースにおけるキー）に、該ｋ−ｍｅｒを有する参照に相当する識別子のリストを関連付けた。本出願人らは、これを存在データベースと呼んだ。同様に、本出願人らが位置データベースと呼ぶものに、ｋ−ｍｅｒが見出される参照における位置を保存した。これが満足のいく結果を生じたため、また、４の倍数がビットパッキングに良く適していたため、ｋは、１６に等しくなるよう選んだ。記述ラインおよびデータの供給源等、参照識別子および情報間の関連は、別々のＳＱＬデータベースに保存した。 Reference indexing:
Each reference sequence is divided into non-overlapping k-mers, and for every k-mer that spans every reference, a key value store or NoSQL database (we use KyotoCabinet [4]) is created, and each k-mer A mer (key in the database) is associated with a list of identifiers corresponding to references having the k-mer. Applicants called this the presence database. Similarly, we stored the location in the reference where the k-mer was found in what we call the location database. Since this yielded satisfactory results and because multiples of 4 were well suited for bit packing, k was chosen to be equal to 16. Associations between reference identifiers and information, such as description lines and data sources, were stored in separate SQL databases.

スコアリング：
短い問い合わせ配列またはリードのセットをスコアリングするために、本出願人らは、これらのランダム試料を通して反復した。試料サイズが大きいほど、これはより確かに正確になるであろう。配列毎に、本出願人らは、配列にわたり幅ｋのウィンドウをスライドさせることにより得られる、連続したｋ−ｍｅｒにわたり反復した。ｋ−ｍｅｒ毎に、これが以前に計数されておらず、存在データベースに見出される場合、本出願人らは、参照の位置を問い合わせた。リードのあらゆるｋ−ｍｅｒが処理されたら、本出願人らは、参照においてマッチした近接位置の数を調べ、あらゆるマッチする参照にわたる同じリードに起源をもつマッチするｋ−ｍｅｒの最大の濃度である、マッチの最大のクラスターのみを考慮した。かかるクラスター毎に、本出願人らは、該参照に対し恐らく以前に加えられた数にｋ−ｍｅｒの数を加え、既に計数されたｋ−ｍｅｒのリストをアップデートした。続いて、次の配列またはリードを処理した。あらゆるリードが処理されたら、マッチすることが判明したｋ−ｍｅｒの計数が関連付けされた参照のリストが得られる。ペア＜参照、計数＞毎に、計数を、問い合わせセットにおける特有のｋ−ｍｅｒの数で割り、所定の参照によりマッチした問い合わせにおけるＤＮＡの量の大雑把なスコアを得た。図解されているスコアリング原理により、問い合わせセットが、配列と完全にマッチしている場合、該スコアは、１となり、そうでなければ、これはより小さくなるであろう；例えば、問い合わせセットが、２種の参照の等しい割合の混合物である場合、両方の参照のスコアは、０．５前後となるであろう。該計数をまた、参照のサイズで割り、問い合わせによって表される参照の画分の大雑把なスコアを得る；この第２のスコアは、マッチする参照の選別および最大の参照へのバイアスの回避に役立つ。等しい加重を使用して、これら２スコアの加重和として最終スコアを算出した。問い合わせセットが大型である場合、例えば、本出願人らが、ＤＮＡ配列決定ランから得られるあらゆるリードを考慮する場合、該セットのランダム試料のみを使用する。 Scoring:
Applicants repeated through these random samples to score a short query sequence or set of reads. The larger the sample size, the more accurate this will be. For each array, Applicants iterated over successive k-mers, obtained by sliding a window of width k across the array. For each k-mer, if this was not previously counted and found in the presence database, Applicants queried the location of the reference. Once every k-mer of a lead has been processed, Applicants examine the number of adjacent positions matched in the reference and are the highest concentration of matching k-mer originating from the same lead across every matching reference Only the largest cluster of matches was considered. For each such cluster, Applicants updated the list of previously counted k-mers, adding the number of k-mers, possibly to the number previously added to the reference. Subsequently, the next sequence or read was processed. Once every lead has been processed, a list of references associated with k-mer counts found to match is obtained. For each pair <reference, count>, the count was divided by the number of unique k-mers in the query set to give a rough score of the amount of DNA in the query matched by a given reference. According to the illustrated scoring principle, if the query set exactly matches the sequence, the score will be 1, otherwise it will be smaller; If it is a mixture of equal proportions of the two references, the score for both references will be around 0.5. The count is also divided by the size of the reference to get a rough score for the fraction of the reference represented by the query; this second score helps to select matching references and avoid bias to the largest reference . The final score was calculated as a weighted sum of these two scores using equal weights. If the query set is large, for example, if we consider every read obtained from a DNA sequencing run, only use a random sample of that set.

クライアントの実行：
サービスの使用を容易にするために、本出願人らは、ウェブブラウザにおいてページとしてランするＨＴＭＬ５／Ｊａｖａｓｃｒｉｐｔ（登録商標）クライアントを実行した。本研究のために、Ｆｉｒｅｆｏｘバージョン１５を使用し、本出願人らは、Ｌｉｎｕｘ（登録商標）、ＭａｃＯＳＸ、ＭｉｃｒｏｓｏｆｔＷｉｎｄｏｗｓ（登録商標）（様々なラップトップおよびデスクトップ）と共にＡｎｄｒｏｉｄ４．０（タブレットＡＳＵＳＴＦ１０１−本出願人らは、ハイエンドスマートフォンからも作業できると予測する）において作業するためにこれを検査した。しかし、当業者であれば、他の適したブラウザも同様に有用となり得ることを認められよう。クライアントは、容易な評価ならびに現存するワークフローおよびパイプラインにおける統合のためのＰｙｔｈｏｎライブラリおよびコマンドラインツールとしても実行される。 Run the client:
To facilitate use of the service, Applicants have run an HTML5 / Javascript® client that runs as a page in a web browser. For this study, Firefox version 15 was used and the applicants were using Android 4.0 (tablet) with Linux (R), Mac OS X, Microsoft Windows (R) (various laptops and desktops). ASUS TF101-Applicants have tested this to work in (expect that they can also work from high-end smartphones). However, one skilled in the art will recognize that other suitable browsers may be useful as well. The client also runs as a Python library and command line tool for easy evaluation and integration in existing workflows and pipelines.

他の技術仕様：
ＫｙｏｔｏＣａｂｉｎｅｔ等、ライブラリへの結合の例外において、サーバーサイドにおいてＰｙｔｈｏｎバージョン２．７．３を使用して、あらゆる実行を行った。ウェブ適用は、マイクロフレームワーク（micro-framework）Ｆｌａｓｋを使用しており、ｌｉｇｈｔｔｐによって提供されている。クライアントサイドライブラリおよびコマンドラインツールをＰｙｔｈｏｎバージョン３．３のために開発した。 Other technical specifications:
With the exception of binding to the library, such as KyotoCabinet, all executions were performed on the server side using Python version 2.7.3. Web application uses a micro-framework Flash and is provided by lighttp. A client side library and command line tool was developed for Python version 3.3.

当業者であれば、アルゴリズムまたはアルゴリズムの部分の実行が、例えば、Ｃプログラミング言語等、他の適した一般に公知のプログラミング言語において行うことができ、これが、問い合わせに使用される時間を減少させることにより本方法の性能を改善し得ることを認められよう。 One skilled in the art can execute the algorithm or portions of the algorithm in other suitable commonly known programming languages such as, for example, the C programming language, by reducing the time used for the query. It will be appreciated that the performance of the method can be improved.

参考文献
References

Claims

A method of identifying a likely source of a biological sequence, such as a short lead,
a) sampling a sequence or a subset of short reads from a source;
b) fragmenting sequences from the subset into k-mers;
c) interrogating one or more k-mers from the subset against a first collection containing k-mers of a reference sequence;
d) interrogating one or more k-mers from the subset against a second collection that includes the position of the k-mer in the reference sequence;
e) determining which references contain the one or more k-mers;
f) returning a description of a likely source reference, wherein the first collection containing the k-mer of the reference sequence is the second collection containing the position of the k-mer in the reference sequence How to be separate.

The method of claim 1, which does not include the use of an alignment algorithm on the sequence data, such as an alignment algorithm that uses a scoring matrix.

The method according to any of the preceding claims, wherein the querying step further comprises the step of determining the position of the k-mer in the reference sequence.

A method according to any preceding claim, wherein presence and position are used to determine the continuity of the query k-mer in the reference sequence.

The method according to any of the preceding claims, wherein the biological sequence is an amino acid sequence.

The method according to claim 1, wherein the biological sequence is a DNA or RNA sequence.

A method according to any preceding claim, wherein the k-mer query comprises determining an exact match between the query and the reference k-mer.

The query is at least one source sequence or short lead, preferably at least 50, such as at least 100, such as at least 150, such as at least 200, such as at least 250, such as at least 300, such as at least 400, For example, including any k-mer query derived from at least 500, such as at least 750, such as at least 1000, such as at least 1500, such as at least 2000, such as at least 2500, such as at least 5000 sequences. A method according to any preceding claim.

The source sequence is at least 50 bases, preferably at least 100 bases, such as at least 150 bases, such as at least 200 bases, such as at least 250 bases, such as at least 300 bases, such as at least 400, at least 500 or more. The method according to any of the preceding claims, wherein the method is a nucleotide sequence of a base.

Said subset of sequences is at least 1%, such as at least 2%, such as at least 4%, such as at least 5%, such as at least 6%, such as at least 7.5%, such as at least 10%, such as A method according to any preceding claim, comprising at least 15%, such as at least 25%, such as at least 30%, such as at least 35%, such as at least 40%, such as at least 50%, discrete sequences. .

The method according to any of the preceding claims, further comprising selecting one or more further subsets of the sequence and subjecting them to steps a) to f) according to claim 1.

The method according to any of the preceding claims, wherein the subset is random or filtered.

Any of the preceding claims, wherein the k-mer is of size 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64 or more. The method described in 1.

The method according to claim 1, wherein the k-mer is continuous.

The claim, wherein the k-mers are overlapping and gradually increase by at least 1, such as at least 2, such as at least 3, such as at least 4, such as at least 5, such as at least 6 bases or amino acids. The method in any one of.

A method according to any preceding claim, wherein the k-mer is a disjoint subsequence linkage.

A k-mer from a given sequence is queried against a database to determine the presence of the k-mer in one or more reference sequences and the position of the k-mer in the one or more reference sequences. A method according to any of the preceding claims.

The method of claim 17, wherein a location is queried only if the k-mer is present.

The method according to any of the preceding claims, wherein the score of the returned reference is calculated.

The method according to any of the preceding claims, wherein a score for the identified reference sequence is calculated and the score correlates with the number of k-mers from one or more sequences found in a given reference sequence. .

A score of the identified reference is calculated, said score correlating to the continuity by the average or approximate continuity of the local concentration of k-mers from one or more sequences found in the reference sequence, said claim A method according to any of the paragraphs.

8. The score of an identified reference is calculated, wherein the score correlates with the number of k-mers in a reference sequence that is also present in the subset of k-mers from the source. the method of.

23. A method according to any of claims 19 to 22, wherein likely source references are ranked according to the score or scores.

The method according to any of the preceding claims, wherein every k-mer from one source sequence or short read is queried and one or more scores of the source sequence or short lead are calculated.

A method according to any of the preceding claims, wherein a count of k-mers that match the reference sequence is obtained.

The method according to any of the preceding claims, wherein the score is obtained by dividing the count of k-mers matching the reference sequence by the number of unique k-mers in the queried subset.

27. A method according to claims 24 to 26, wherein a score is obtained by dividing the count of k-mers that match against a reference sequence by the size of the reference sequence.

28. The method of claims 24 to 27, wherein the score of the reference sequence is calculated as a weighted sum of the scores of claims 26 and 27.

A method according to any of the preceding claims, further comprising the step of querying any k-mer from the second source sequence, preferably the third source sequence.

The method according to any of the preceding claims, wherein the database query can be stopped once a reference organism has been identified with a defined statistical probability.

The method according to any of the preceding claims, wherein the database query can be aborted if a defined fraction of k-mers is not found in the database.

The database includes the following information about one or more likely references: sequence, coding sequence, regulatory sequence annotation, taxonomic name of the likely reference, likely reference The reference source, the group of further related references, the place where the reference was obtained (soil, sea, intestine or sewer, etc.), the time the reference sequence was obtained, the taxonomic classification, The method according to any one of the preceding claims, wherein one or more of information on a database (NCBI or EBI / Sanger database, etc.) related to a database in which a reference sequence is downloaded is output.

The method according to any of the preceding claims, wherein the database outputs the most likely reference sequence, and preferably the database outputs the complete genomic sequence of the most likely reference species.

A method according to any preceding claim, wherein results from references having very similar sequences or results from further related references are grouped in the output.

A method according to any of the preceding claims, wherein several iterations of the method are performed, e.g. in the first iteration the most abundant reference is identified and the most abundant reference from a source sequence or short read. A method of removing sequences derived from a reference.

36. The method of claim 35, further comprising identifying a second rich reference in a second iteration, removing sequences derived from the second rich reference, and the like.

38. The method of claim 36, further comprising identifying a reference with a high probability of insertion in the second iteration.

8. A method according to any preceding claim, further comprising first removing source sequences that align with sequences from a defined reference.

Any of the preceding claims, comprising ignoring a k-mer from the source sequence if a defined number of k-mers from one source sequence is not present in the database. The method described.

The method according to any of the preceding claims, wherein the query comprises ignoring k-mers from one or more predefined references.

A method according to any preceding claim, wherein the raw sequence as obtained from the nucleic acid sequencer is queried.

A method according to any preceding claim, wherein adaptive sampling is used.

A database for use in a method as defined by claims 1-42 comprising a k-mer of a reference sequence comprising:
a) a first collection of k-mers from a reference sequence;
b) a database comprising: a second collection of each k-mer position in the reference sequence.

44. The database of claim 43, further comprising information regarding a full-length sequence associated with a given reference, and / or a source of the reference, and / or one or more taxonomic descriptors of the reference.

45. A database according to any of claims 43 to 44, wherein a k-mer in the database is attached to a hash function that assigns a unique key to each unique k-mer.

46. A database according to any of claims 43 to 45, wherein each unique k-mer in the first collection is related by a vector to information about those references in which the k-mer exists.

47. A database according to any of claims 43 to 46, wherein each unique k-mer in the second collection, if present, is related by a vector to information about its position in each reference.

Reference identifier and description line, source of data, sequence, coding sequence, regulatory sequence annotation, likely taxonomic name, close reference to the likely reference, supply of the reference Source, additional group of related references, where the reference was obtained (soil, sea, intestine, sewer, etc.), when the reference sequence was obtained, taxonomic classification, related species, download the reference sequence 48. A database according to any of claims 43 to 47, further comprising a third collection or database having a type of information selected from the group consisting of information about the created database (such as NCBI or EBI / Sanger database) .

49. The k-mer has a length of 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64 or more. A database according to any of the above.

50. The database according to any one of claims 43 to 49, wherein the k-mers do not overlap.

51. The method of claim 43-50, wherein the k-mers overlap and are gradually increased by at least 1, such as at least 2, such as at least 3, such as at least 4, such as at least 5, such as at least 6 bases or amino acids. A database as described in any of the above.

52. A database according to any of claims 43 to 51, comprising a k-mer derived from the complete sequence of each reference.

53. A database according to any of claims 43 to 52, comprising sequence information derived from humans, animals, mammals, birds, fish, fungi, insects, plants, bacteria, archaea, viruses and / or plasmids.

54. Database according to any of claims 43 to 53, divided into sub-databases stored on several different servers.

According to one or more taxonomic descriptors selected from gates, classes, eyes, families, genera and species, or one or more environmental descriptors such as source, distribution, origin and past query frequency 55. A database according to any of claims 43 to 54, organized into sub-databases.

A data processing system for identifying a likely source of a source array comprising an input device, a central processing unit, a memory, and an output device, wherein the data processing system is executed A data processing system, wherein data representing an instruction sequence for performing the method according to claims 1 to 42 is stored therein, and the memory further comprises a database according to any of claims 43 to 55.

57. The system of claim 56, wherein the database is stored on a server, the input device and output device are clients, and the client and server are connected via a data communication connection.

58. A system according to any of claims 56 to 57, wherein the client is selected from a portable computing device such as a personal computer, a fixed PC, a portable PC, a smartphone or the like.

59. Any of claims 56 to 58, wherein the client includes an instruction sequence that allows the client to sample a subset of a source array, fragment them into k-mers and communicate them to the server. The system described in Crab.

The client further includes an instruction sequence that enables the client to perform assembly of a source array into one or more larger arrays based on the array communicated from the server to the client. 60. The system of claim 56-59, comprising.

61. A system according to any of claims 56 to 60, connected to a sequencing device via a data connection.

43. A computer software product containing an instruction sequence that, when executed, causes the method of claims 1-42 to be performed.

43. An integrated circuit product containing an instruction sequence that, when executed, causes the method of claims 1-42 to be performed.