JP2020509473A

JP2020509473A - Compact representation method and apparatus for biological information data using a plurality of genome descriptors

Info

Publication number: JP2020509473A
Application number: JP2019542715A
Authority: JP
Inventors: コソバルチ，モハメド; アルベルティ，クラウディオ; ゾイア、ジョルジョ; レンジ、ダニエル
Original assignee: ゲノムシスエスエー
Priority date: 2016-10-11
Filing date: 2018-02-14
Publication date: 2020-03-26
Anticipated expiration: 2038-02-14
Also published as: JP2020509474A; EA201991906A1; JP7362481B2

Abstract

ゲノムシーケンシング装置によって生成されたゲノムシーケンスデータの圧縮のための方法及び装置である。シーケンスリードは、既存のリファレンスシーケンス又は構築されたリファレンスシーケンスに対してアライメントすることによりコード化され、コード化のプロセスは、データクラスへのリードの分類と、それに続く多重化された記述子ブロックによる各クラスのコード化から構成される。特定のソースモデルとエントロピーコーダは、データが分割される各データクラス、及び関連する各記述子ブロックに使用される。【選択図】図２１A method and apparatus for compressing genomic sequence data generated by a genomic sequencing device. Sequence reads are coded by aligning against an existing or constructed reference sequence, the process of coding being based on the classification of reads into data classes, followed by multiplexed descriptor blocks. It consists of coding for each class. A specific source model and entropy coder is used for each data class into which the data is divided, and each associated descriptor block. [Selection diagram] FIG.

Description

本開示は、既知の従来技術の表現方法では利用できない新しい機能を提供することにより、利用される記憶領域を削減し、アクセス性能を改善するゲノムシーケンシングデータの新しい表現方法を提供する。
[関連出願の相互参照] The present disclosure provides a new representation of genomic sequencing data that provides new functionality not available with known prior art representations, thereby reducing storage space used and improving access performance.
[Cross-reference of related applications]

本出願は、２０１７年２月１４日に提出されたＰＣＴ／ＵＳ２０１７／０１７８４２及び２０１７年７月１１日に提出されたＰＣＴ／ＵＳ２０１７／０４１５９１の優先権及びその利益を主張する。 This application claims the benefit of PCT / US2017 / 017842, filed February 14, 2017 and PCT / US2017 / 041591, filed July 11, 2017, and their benefits.

ゲノムシーケンシングデータの適切な表現は、ゲノムバリアント呼び出し等の効率的なゲノム解析アプリケーションや、シーケンスデータとメタデータを処理することでさまざまな目的で実行されるその他すべての分析を可能にするために不可欠である。 Proper representation of genomic sequencing data is needed to enable efficient genomic analysis applications such as genomic variant calls and all other analyzes performed for various purposes by processing sequence data and metadata. It is essential.

ヒトゲノムのシーケンシングは、高スループット、低コストのシーケンシング技術の出現により、安価になって来ている。このような機会は、癌の診断及び治療から遺伝性疾患の同定に至るまで、抗体の同定のための病原体サーベイランスから、新しいワクチン、薬剤の作製、及び個別化された治療のカスタマイズに至るまで、いくつかの分野における新しい展望を開くものである。 Sequencing of the human genome has become cheaper with the advent of high-throughput, low-cost sequencing technology. Such opportunities range from pathogen surveillance for the identification of antibodies, from cancer diagnosis and treatment to the identification of genetic disorders, to the creation of new vaccines, drugs, and the customization of personalized treatments. It opens up new perspectives in several areas.

病院、ゲノミクスデータ分析プロバイダー、バイオインフォマティクス、及び大規模な生物データ保存センターは、ゲノム医療を世界規模にスケールアップすることを可能にする、安価で、迅速で、信頼性があり、相互接続されたゲノム情報処理ソリューションを探している。シーケンシングプロセスにおけるボトルネックの一つがデータの記憶になっており、圧縮形式でゲノムシーケンシングデータを表現する方法が益々研究されている。 Hospitals, genomics data analytics providers, bioinformatics, and large biological data storage centers are inexpensive, fast, reliable, and interconnected that enable genomic medicine to scale up globally Looking for a genome information processing solution. One of the bottlenecks in the sequencing process is data storage, and methods of expressing genomic sequencing data in a compressed format are being increasingly studied.

シーケンシングデータで最も使用されるゲノム情報の表示は、ＦＡＳＱ及びＳＡＭフォーマットの圧縮に基づいている。その目的は、従来から使用されているファイル形式（アライメントされていないデータとアライメントされたデータについては、それぞれＦＡＳＴＱとＳＡＭを使用）を圧縮することにある。このようなファイルは、プレーンテキスト文字で構成され、ＬＺ（LempelとZiv、最初の版を出版した作成者）方式（よく知られたｚｉｐ、ｇｚｉｐ等）等の汎用アプローチを使用して、上述のように圧縮される。ｇｚｉｐ等の汎用圧縮方式を使用する場合、圧縮の結果は通常、バイナリデータの単一のかたまりとなっている。このようなモノリシック形式の情報は、特に高スループットシーケンシングの場合のようにデータの量が非常に大きい場合、アーカイブ、転送、及び詳細化が非常に困難になる。ＢＡＭフォーマットは、ＳＡＭファイルによって伝達される実際のゲノム情報を抽出するよりもむしろ非効率的で、冗長なＳＡＭフォーマットの圧縮に焦点を当て、そして各データソースの特定の性質を利用するよりもむしろｇｚｉｐのような汎用テキストの圧縮アルゴリズムを採用するために、低い圧縮性能によって特徴付けられている（ゲノムデータ自体）。 The display of genomic information most used in sequencing data is based on FASQ and SAM format compression. The purpose is to compress the conventionally used file formats (for unaligned data and aligned data use FASTQ and SAM respectively). Such files are composed of plain text characters and described above using a generic approach such as the LZ (Lempel and Ziv, author of the first edition) (well-known zip, gzip, etc.) Compressed. When using a general purpose compression scheme such as gzip, the result of the compression is typically a single chunk of binary data. Such monolithic information becomes extremely difficult to archive, transfer, and refine, especially when the amount of data is very large, as in the case of high-throughput sequencing. The BAM format is inefficient, rather than extracting the actual genomic information conveyed by the SAM file, and focuses on the compression of the redundant SAM format, and rather than exploiting the specific properties of each data source It is characterized by low compression performance in order to adopt a general purpose text compression algorithm such as gzip (genomic data itself).

使用頻度は少ないが、ＢＡＭよりも効率的なゲノムデータ圧縮へのより洗練されたアプローチがＣＲＡＭである。ＣＲＡＭは、リファレンスに関する微分コード化を採用することにより効率的な圧縮を提供する（データソースの冗長性を部分的に活用する）。ただし、増分更新、ストリーミングのサポート、特定クラスの圧縮データへの選択的アクセス等の機能はまだ備わっていない。 A less sophisticated but more sophisticated approach to genomic data compression that is more efficient than BAM is CRAM. CRAM provides efficient compression by employing differential coding on the reference (partially exploits data source redundancy). However, features such as incremental updates, streaming support, and selective access to certain classes of compressed data are not yet available.

これらのアプローチでは、圧縮率が低くなり、データ構造が圧縮されると、ナビゲート及び操作が困難になる。単純な操作を実行したり、ゲノムデータセットの選択した領域にアクセスしたりする場合でも、大規模で厳格なデータ構造を処理する必要があるため、ダウンストリーム解析は非常に遅くなる可能性がある。ＣＲＡＭは、ＣＲＡＭレコードの概念に依存している。各ＣＲＡＭレコードは、再構成に必要なすべてのエレメントをコード化することにより、単一のマッピングされたリード又はマッピングされていないリードを表す。 With these approaches, the compression ratio is low and the data structures are compressed, which makes navigation and manipulation difficult. Even when performing simple operations or accessing selected regions of a genomic dataset, downstream analysis can be very slow due to the need to process large and rigorous data structures . CRAM relies on the concept of CRAM records. Each CRAM record represents a single mapped or unmapped lead by encoding all the elements required for reconstruction.

ＣＲＡＭには、本明細書に記載されている発明によって解決及び克服される、次の欠点と制限がある：
１．ＣＲＡＭは、特定の機能を共有するデータインデックスとデータサブセットへのランダムアクセスをサポートしていない。データのインデックスは仕様の範囲外であり（ＣＲＡＭの仕様ｖ．３．０のセクション１２を参照）、別のファイルとして実装される。対照的に、本明細書に記載されている本発明のアプローチは、コード化プロセスと統合されたデータ索引方法を採用し、コード化された（すなわち圧縮された）ビットストリームに索引が埋め込まれている。
２．ＣＲＡＭは、あらゆる種類のマッピングされたリード（完全に一致するリード、置換のみを伴うリード、挿入又は削除を伴うリード（「インデル（indels）」とも呼ばれる））を含むことができるコアデータブロックによって構築される。リファレンスシーケンスに関するマッピングの結果に従って、データの分類やクラス内のリードのグループ化の概念は無い。これは、特定の機能を持つリードのみが検索される場合でも、すべてのデータを検査する必要があることを意味する。このような制限は、コード化の前にクラスでデータを分類及び分割する、本発明により解決される。
３．ＣＲＡＭは、各リードを「ＣＲＡＭレコード」にカプセル化するという概念に基づいている。これは、特定の生物学的特徴（例えば：置換を伴うが「インデル（indels）」を伴わないリード、又は完全にマッピングされたリード）によって特徴づけられるリードを検索する場合、それぞれ完全な版の「記録」を検査する必要性を意味する。
対照的に、本発明では、別々の情報ブロックに別々にコード化されたデータクラスの概念があり、各リードをカプセル化するレコードの概念はない。これにより、各（ブロックの）リードをデコード化してその特徴を検査することなく、特定の生物学的特性（例えば：置換を伴うが「インデル（indels）」を伴わないリード、又は完全にマッピングされたリード）を有するリードのセットへのより効率的なアクセスが可能になる。
４．ＣＲＡＭレコードでは、各レコードフィールドは特定のフラグに関連付けられ、各ＣＲＡＭレコードには異なる種類のデータを含めることができるため、コンテキストの概念がなく、各フラグは常に同じ意味を持つ必要がある。このコード化メカニズムは冗長な情報を導入し、効率的なコンテキストベースのエントロピーコード化の使用を妨げる。
これに対し、本発明では、データを示すフラグは、データが属する情報を「ブロック」によって本質的に定義されるので、データを示すフラグの概念は存在しない。これは、使用されるべき記号の数が大幅に減少し、その結果、より効率的な圧縮に帰着する情報ソースのエントロピーが減少することを意味する。このような改善が可能なのは、異なる「ブロック」を使用することにより、エンコーダが、コンテキストに応じて異なる意味を有する各ブロックにわたって同じ記号を再利用することが可能になるためである。ＣＲＡＭでは、コンテキストの概念がなく、各ＣＲＡＭレコードに任意の種類のデータを含めることができるため、各フラグは常に同じ意味を持つ必要がある。
５．ＣＲＡＭの置換では、挿入と削除は異なる記述子、情報ソースのアルファベットのサイズを増加させ、より高い情報ソースのエントロピーをもたらすオプションを用いて表現される。対照的に、開示された発明のアプローチは、単一のアルファベット及び置換、挿入及び欠損のためのコード化を使用する。これはコード化とデコード化プロセスをより単純にし、コード化が高圧縮性能で特徴付けられるビットストリームを生じ、エントロピーの低いソースモデルを生成する。 CRAM has the following disadvantages and limitations that are solved and overcome by the invention described herein:
1. CRAM does not support random access to data indexes and data subsets that share certain functions. The index of the data is outside the scope of the specification (see section 12 of CRAM specification v.3.0) and implemented as a separate file. In contrast, the inventive approach described herein employs a data indexing method that is integrated with the encoding process, with the index embedded in the encoded (ie, compressed) bitstream. I have.
2. A CRAM is constructed with a core data block that can contain any type of mapped leads (exact matches, leads with only replacements, leads with insertions or deletions (also called "indels")). Is done. There is no concept of classifying data or grouping leads within classes according to the result of mapping for the reference sequence. This means that all data must be inspected, even if only leads with a particular function are searched. Such limitations are solved by the present invention, which classifies and divides data by class before coding.
3. CRAM is based on the concept of encapsulating each lead into a "CRAM record". This means that when searching for reads characterized by a particular biological feature (eg: reads with substitutions but without “indels” or fully mapped reads), the full version of each will be searched. It means the need to inspect the "record".
In contrast, in the present invention, there is the concept of a separately coded data class in a separate block of information and no record concept to encapsulate each lead. This allows for the reading of specific biological properties (eg: reads with substitutions but without "indels", or fully mapped More efficient access to the set of leads having the same read).
4. In a CRAM record, each record field is associated with a particular flag, and since each CRAM record can contain different types of data, there is no concept of a context and each flag must always have the same meaning. This coding mechanism introduces redundant information and precludes the use of efficient context-based entropy coding.
On the other hand, in the present invention, the flag indicating the data is essentially defined by the "block" of the information to which the data belongs, so that there is no concept of the flag indicating the data. This means that the number of symbols to be used is significantly reduced, and consequently the entropy of the information source, which results in more efficient compression. Such an improvement is possible because the use of different "blocks" allows the encoder to reuse the same symbols across blocks that have different meanings depending on the context. In a CRAM, since there is no concept of a context and each CRAM record can include any kind of data, each flag must always have the same meaning.
5. In CRAM replacement, insertions and deletions are represented using different descriptors, options that increase the size of the alphabet of the information source and result in higher entropy of the information source. In contrast, the disclosed inventive approach uses a single alphabet and coding for substitutions, insertions and deletions. This simplifies the coding and decoding process, yields a bitstream in which the coding is characterized by high compression performance, and produces a source model with low entropy.

本発明は、コード化されるべき冗長な情報が最小化され、選択的アクセス及び増分更新のためのサポートのような機能が圧縮ドメイン内で直接的に可能となるように、シーケンシングデータを分類及び分割することによってゲノムシーケンスを圧縮することを目的とする。 The present invention classifies the sequencing data so that redundant information to be encoded is minimized and features such as support for selective access and incremental updates are directly possible within the compression domain. And compressing the genome sequence by dividing.

請求項に係る以下の特徴は、その提供によって既存の従来技術の解決策の問題を解決する。 The following features according to the claims solve the problems of existing prior art solutions by their provision.

ヌクレオチドシーケンスのリードを含むゲノムシーケンスデータをコード化する方法であって、
前記リードを１つ以上のリファレンスシーケンスにアライメントさせ、それによってアライメントリードを作成し、
指定されたマッチング規則に従って、前記１つ以上のリファレンスシーケンスを使用して前記アライメントリードを分類し、それによってアライメントリードのクラスを作成し、
前記分類されたアライメントリードを記述子の複数のブロックとしてコード化し、
前記分類されたアライメントリードを前記記述子の多数のブロックとしてコード化することは、前記アライメントリードの前記クラスに従って前記記述子を選択することを含み、
前記記述子のブロックをヘッダ情報で構造化し、それにより連続したアクセスユニットを作成する。 A method for encoding genomic sequence data comprising a nucleotide sequence read, comprising:
Aligning said lead with one or more reference sequences, thereby creating an alignment lead;
Classifying said alignment reads using said one or more reference sequences according to a specified matching rule, thereby creating a class of alignment reads;
Encoding the classified alignment reads as multiple blocks of descriptors,
Encoding the classified alignment reads as multiple blocks of the descriptor includes selecting the descriptor according to the class of the alignment reads,
The block of descriptors is structured with header information, thereby creating a continuous access unit.

別の態様において、コード化方法は、前記指定されたマッチング規則を満たさない前記リードをマッピングされていないリードのクラスに分類することをさらに含み、
少なくともいくつかの前記マッピングされていないリードを使用してリファレンスシーケンスのセットを構築し、
前記マッピングされていないリードのクラスを、構築された前記リファレンスシーケンスのセットにアライメントし、
前記分類されたアライメントリードを記述子の複数のブロックとしてコード化し、
前記構築されたリファレンスシーケンスのセットをコード化し、
前記記述子のブロック及び前記コード化されたリファレンスシーケンスをヘッダ情報で構築し、それにより連続するアクセスユニットを作成する。 In another aspect, the coding method further comprises classifying the leads that do not satisfy the specified matching rule into a class of unmapped leads,
Constructing a set of reference sequences using at least some of the unmapped reads;
Aligning the unmapped class of reads with the constructed set of reference sequences,
Encoding the classified alignment reads as multiple blocks of descriptors,
Encoding the set of constructed reference sequences;
The block of descriptors and the coded reference sequence are constructed with header information, thereby creating successive access units.

別の多様において、コード化方法は、前記リファレンスシーケンスにミスマッチのないゲノムリードを第１番目の「クラスＰ」として分類することさらに含む。 In another variation, the encoding method further comprises classifying genomic reads without mismatches in the reference sequence as a first “Class P”.

別の態様において、前記コード化方法は、シーケンシング装置がいずれの「塩基」も呼び出すことができず、かつ各リードにおけるミスマッチの数が所定のしきい値を超えない位置においてのみミスマッチが見出される場合に、ゲノムリードを第２番目の「クラスＮ」として分類することをさらに含む。 In another embodiment, the encoding method finds a mismatch only at locations where the sequencing device cannot call any "bases" and the number of mismatches in each read does not exceed a predetermined threshold. In some cases, further comprising classifying the genomic read as a second "Class N".

別の態様において、前記コード化方法は、前記シーケンシング装置がいかなる「塩基」も呼び出すことができなかった位置でミスマッチが見つかった場合、ゲノムリードを第３番目の「クラスＭ」として識別することをさらに含み、「ｎタイプ」のミスマッチと名付けられ、及び／又はリファレンスシーケンスとは異なる「塩基」と呼ばれ、「ｓタイプ」のミスマッチと名付けられ、及び前記ミスマッチの数は、前記「ｎタイプ」のミスマッチ、前記「ｓタイプ」のミスマッチの数に対して所定のしきい値を超えず、しきい値は、「ｎタイプ」及び「ｓタイプ」のミスマッチの数を計算する関数（ｆ（ｎ，ｓ））で与えられる。 In another embodiment, the encoding method comprises identifying a genomic read as a third "Class M" if a mismatch is found at a position where the sequencing device was unable to call any "bases". Further referred to as "n-type" mismatches and / or referred to as "bases" different from the reference sequence, referred to as "s-type" mismatches, and wherein the number of mismatches is greater than the "n-type" mismatches. Does not exceed a predetermined threshold for the number of "s-type" mismatches, and the threshold is a function (f (f ()) that calculates the number of "n-type" and "s-type" mismatches. n, s)).

別の態様において、前記コード化方法は、前記「クラスＭ」と同じ種類のミスマッチが発生する可能性がある場合、ゲノムリードを第４番目の「クラスＩ」と識別することをさらに含み、少なくとも１つのミスマッチの類型：「挿入」（「ｉタイプ」）、「削除」（「ｄタイプ」）、ソフトクリップ（「ｃタイプ」）が加えられ、ここで、各タイプの前記ミスマッチの数は、対応する所定のしきい値を超えず、しきい値は、「ｎタイプ」、「ｓタイプ」、「ｉタイプ」、「ｄタイプ」及び「ｃタイプ」のミスマッチの数を計算する関数（ｗ（ｎ，ｓ，ｉ，ｄ，ｃ））で与えられる。 In another embodiment, the encoding method further comprises identifying the genomic read as a fourth "Class I" if a mismatch of the same type as the "Class M" is likely to occur, at least One type of mismatch: “insert” (“i-type”), “delete” (“d-type”), soft clip (“c-type”), where the number of mismatches of each type is Without exceeding the corresponding predetermined threshold, the threshold is a function (w) that calculates the number of "n-type", "s-type", "i-type", "d-type" and "c-type" mismatches. (N, s, i, d, c)).

別の態様において、前記コード化方法は、クラスＰ、Ｎ、Ｍ、Ｉのいずれの分類も見出さない全てのリードを含むものとして、ゲノムリードを第５番目の「クラスＵ」として識別することをさらに含む。 In another embodiment, the encoding method comprises identifying the genomic read as a fifth "class U" as including all reads that do not find any of the classes P, N, M, I. In addition.

別の態様において、前記コード化方法は、コード化された前記ゲノムシーケンスのリードはペアになっていることをさらに含む。 In another embodiment, the encoding method further comprises that the encoded reads of the genomic sequence are paired.

別の態様において、前記コード化方法は、前記分類することが、１つのリードがクラスＰ、Ｎ、Ｍ又はＩに属し、他のリードが「クラスＵ」に属するすべてのリードペアを含むものとして、ゲノムリードを第６番目の「クラスＨＭ」として識別することをさらに含む。 In another aspect, the method of encoding, wherein the classifying comprises all read pairs where one read belongs to class P, N, M or I and the other read belongs to "class U", The method further includes identifying the genomic read as a sixth "Class HM".

別の態様において、前記コード化方法は、前記２つのメイトのリードが同じクラス（Ｐ、Ｎ、Ｍ、Ｉ、Ｕのそれぞれ）に分類されているかどうかを識別し、前記ペアを同じ識別されたクラスに割り当て、
前記２つのメイトのリードが異なるクラスに分類されているかどうかを識別し、それらがいずれも「クラスＵ」に属していない場合、前記ペアのリードを次式に従って最も優先度の高いクラスに割り当て：
Ｐ＜Ｎ＜Ｍ＜Ｉ
ここで、「クラスＰ」の優先度が最も低く、「クラスＩ」の優先度が最も高く、
前記２つのメイトのリードのうち一方のみが「クラスＵ」に属すると分類されたかどうかを識別し、前記ペアのリードを「クラスＨＭ」のシーケンスに属すると分類すること、をさらに含む。 In another aspect, the encoding method identifies whether the reads of the two mates are classified into the same class (P, N, M, I, U, respectively), and identifies the pair as the same. Assigned to class,
Identify whether the two mate's leads are classified into different classes, and if neither belongs to "Class U", assign the paired leads to the highest priority class according to:
P <N <M <I
Here, "class P" has the lowest priority, "class I" has the highest priority,
Identifying whether only one of the two mates' leads is classified as belonging to "Class U" and classifying the paired leads as belonging to a "Class HM" sequence.

別の態様において、前記コード化方法は、リードＮ、Ｍ、Ｉの各クラスは、「ｎタイプ」のミスマッチの数（２９２）、関数ｆ（ｎ，ｓ）（２９３）及び関数ｗ（ｎ，ｓ，ｉ，ｄ，ｃ）（２９４）によって、各クラスＮ、Ｍ、Ｉに対してそれぞれ定義されたしきい値のベクトル（２９２、２９３、２９４）に従って、２つ以上のサブクラス（２９６、２９７、２９８）にさらに分割される。 In another aspect, the encoding method further comprises: wherein each class of leads N, M, and I has a number (292) of “n-type” mismatches, a function f (n, s) (293), and a function w (n, s, i, d, c) (294) and two or more subclasses (296, 297) according to the threshold vectors (292, 293, 294) defined for each class N, M, I, respectively. , 298).

前記２つのメイトのリードが同じサブクラスに分類されているかどうかを識別し、前記ペアを同じサブクラスに割り当て、
前記２つのメイトのリードが異なるクラスのサブクラスに分類されているかどうかを識別し、前記ペアを、次の式に従って、優先度の高い前記クラスに属する前記サブクラスに割り当て、
Ｎ＜Ｍ＜Ｉ
ここで、Ｎが最も優先度が低く、Ｉが最も優先度が高く、
前記２つのメイトのリードが同じクラスに分類されており、そのクラスがＮ、Ｍ、又はＩであるが、サブクラスが異なるかどうかを識別し、前記ペアを、次の式に従って、最も優先度が高いサブクラスに割り当てる、
Ｎ_１＜Ｎ_２＜・・・＜Ｎ_ｋ
Ｍ_１＜Ｍ_２＜・・・Ｍ_ｊ
Ｉ_１＜Ｉ_２＜・・・＜Ｉ_ｈ
ここで、最も高いインデックスが最も高い優先度を持つ、ことをさらに含む。 Identifying whether the leads of the two mates are classified in the same subclass, assigning the pair to the same subclass,
Identifying whether the leads of the two mates are classified into different classes of subclasses and assigning the pair to the subclass belonging to the higher priority class according to the following formula:
N <M <I
Here, N has the lowest priority, I has the highest priority,
Identify whether the two mates' leads are classified into the same class, which class is N, M, or I, but have different subclasses, and assign the pair to the highest priority according to the following formula: Assign to higher subclass,
N ₁ <N ₂ <... <N _k
M ₁ <M ₂ <... M _j
I ₁ <I ₂ <... <I _h
Here, it further includes that the highest index has the highest priority.

別の態様において、各リードのマッピング位置に関する情報は、ｐｏｓ記述子ブロックによってコード化される。 In another aspect, information regarding the mapping location of each lead is encoded by a pos descriptor block.

別の態様において、各リードのストランド性（すなわち、リードのシーケンスが由来するＤＮＡ鎖）に関する情報は、ｒｃｏｍｐ記述子ブロックによってコード化される。 In another aspect, information about the strandiness of each read (ie, the DNA strand from which the sequence of reads is derived) is encoded by an rcomp descriptor block.

別の態様において、ペアエンドリードのペアリング情報は、ｐａｉｒ記述子ブロックによってコード化される。 In another aspect, the pairing information of the paired-end read is encoded by a pair descriptor block.

別の態様において、前記リードが適切なペアでマッピングされているか否か、プラットフォーム／ベンダーの品質チェックの失敗、ＰＣＲ又は光学複製であること、又は補助的なアライメントであること、のような付加的なアライメント情報は、フラグ記述子ブロックによってコード化される。 In another embodiment, additional readings such as whether the reads are mapped in proper pairs, failure of platform / vendor quality check, PCR or optical replication, or auxiliary alignment. Alignment information is encoded by a flag descriptor block.

別の態様において、未知の塩基に関する情報は、ｍｍｉｓ記述子ブロックによってコード化される。 In another aspect, the information about the unknown base is encoded by a mmis descriptor block.

別の態様において、置換の位置に関する情報は、ｓｎｐｐ記述子ブロックによってコード化される。 In another aspect, the information regarding the location of the substitution is encoded by a snpp descriptor block.

別の態様において、置換の類型に関する情報は、特定のｓｎｐｔ記述子ブロックによってコード化される。 In another aspect, the information regarding the type of substitution is encoded by a particular snpt descriptor block.

別の態様において、ミスマッチの位置、置換、挿入又は削除に関する情報は、ｉｎｄｐ記述子ブロックによってコード化される。 In another aspect, the information about the location of the mismatch, substitution, insertion or deletion is encoded by an indp descriptor block.

別の態様において、置換、挿入、又は削除のようなミスマッチの類型に関する情報は、ｉｎｄｔ記述子ブロックによってコード化される。 In another aspect, information about the type of mismatch, such as a substitution, insertion, or deletion, is encoded by an indt descriptor block.

別の態様において、マッピングされたリードのクリップされた塩基に関する情報は、ｉｎｄｃ記述子ブロックによってコード化される。 In another aspect, information about the clipped bases of the mapped read is encoded by an indc descriptor block.

別の態様において、マッピングされていないリードに関する情報は、ｕｒｅａｄｓ記述子ブロックによってコード化される。 In another aspect, information about unmapped leads is encoded by a uleads descriptor block.

別の態様において、コード化に使用されるリファレンスシーケンスの種類に関する情報は、ｒｔｙｐｅ記述子ブロックによってコード化される。 In another aspect, information about the type of reference sequence used for coding is coded by an rtype descriptor block.

別の態様において、前記マッピングされたリードのマルチプルアライメントに関する情報は、ｍｍａｐ記述子ブロックによってコード化される。 In another aspect, the information regarding multiple alignment of the mapped reads is encoded by an mmap descriptor block.

別の態様において、前記同じリードのスプライスされたアライメント及びマルチプルアライメントに関する情報は、ｍｓａｒ記述子ブロック及びｍｍｐ記述子ブロックによってコード化される。 In another aspect, the information regarding the spliced alignment and the multiple alignment of the same read is encoded by an msar descriptor block and an mmp descriptor block.

別の態様において、リードのアライメントスコアに関する情報は、ｍｓｃｏｒｅ記述子ブロックによってコード化される。 In another aspect, information about the alignment score of the read is encoded by an mscore descriptor block.

別の態様において、リードが属するグループに関する情報が、「ｒｇｒｏｕｐ」記述子ブロックによってコード化される。 In another aspect, information about the group to which the lead belongs is coded by an “rgroup” descriptor block.

別の態様において、前記コード化方法は、前記記述子のブロックは、アライメントされたリードの各クラス及びサブクラスごとに１つのセクションを含むマスターインデックステーブルを含み、前記セクションは、マスターインデックステーブル及び前記アクセスユニットの両方でコード化しているデータの各クラス又はサブクラスの各アクセスユニットの第１のリードの前記１つ以上のリファレンスシーケンス上の前記マッピング位置をさらに含む。 In another aspect, the coding method further comprises the block of descriptors includes a master index table including one section for each class and subclass of aligned reads, the section comprising a master index table and the access table. It further includes the mapping location on the one or more reference sequences of the first read of each access unit of each class or subclass of data that is encoded in both units.

別の態様において、前記コード化方法は、前記記述子の前記ブロックが、使用される参照の種類（既存又は構築された）、及び前記リファレンスシーケンスにマッピングされない前記リードの前記セグメントに関する情報をさらに含む。 In another aspect, the encoding method further comprises information about the type of reference (existing or constructed) in which the block of the descriptor is used and the segment of the lead that is not mapped to the reference sequence. .

別の態様において、前記コード化方法は、前記リファレンスシーケンスは、置換、挿入、削除、及びクリッピングを適用することにより異なるリファレンスシーケンスに第１の変換がされ、記述子の多数のブロックとしての前記分類されたアライメントリードのコード化は前記変換されたリファレンスシーケンスを参照することをさらに含む。 In another aspect, the encoding method further comprises the reference sequence is first transformed to a different reference sequence by applying substitution, insertion, deletion, and clipping, and the classification as a number of blocks of descriptors. The coding of the alignment read further includes referencing the converted reference sequence.

別の態様において、前記コード化方法は、同じ変換が、全てのクラスのデータに対して使用される前記リファレンスシーケンスに適用されることをさらに含む。 In another aspect, the coding method further comprises applying the same transformation to the reference sequence used for all classes of data.

別の態様において、前記コード化方法は、異なる変換が、データの各クラスに対して使用される前記リファレンスシーケンスに適用されることをさらに含む。 In another aspect, the encoding method further comprises that a different transform is applied to the reference sequence used for each class of data.

別の態様において、前記コード化方法は、前記リファレンスシーケンスの変換が記述子のブロックとしてコード化され、ヘッダ情報で構造化され、それにより連続するアクセスユニットを作成することをさらに含む。 In another aspect, the coding method further comprises the transform of the reference sequence is coded as a block of descriptors and structured with header information, thereby creating a continuous access unit.

別の態様において、前記コード化方法は、前記分類されたアラメントリードの前記コード化及び記述子のブロックの多重化としての前記関連するリファレンスシーケンス変換は、特定の記述子ブロック及び特定のソースモデルに関連付けをするステップをさらに含む。 In another aspect, the encoding method further comprises: the encoding of the classified alignment read and the associated reference sequence transformation as a multiplex of descriptor blocks comprising a particular descriptor block and a particular source model. Further comprising the step of:

別の態様において、前記コード化方法は、前記エントロピーコーダは、コンテキスト適応算術コーダ、可変長コーダ又はゴロムコーダのうちいずれか１つであることをさらに含む。 In another aspect, the coding method further includes that the entropy coder is one of a context adaptive arithmetic coder, a variable length coder, and a Golomb coder.

本発明はさらに、コード化されたゲノムデータをデコード化する方法であって、
ヘッダ情報を用いて記述子の多重化されたブロックを抽出するために前記コード化されたゲノムデータを含むアクセスユニットを解析し、
１つ以上のリファレンスシーケンスに関する分類を定義する特定のマッチング規則に従ってリードを抽出するために、記述子の前記多重化されたブロックをデコード化することを含む、方法を提供する。 The present invention further provides a method of decoding encoded genomic data, comprising:
Analyzing the access unit containing the encoded genomic data to extract a multiplexed block of descriptors using the header information;
A method is provided that includes decoding the multiplexed block of descriptors to extract leads according to specific matching rules that define a classification for one or more reference sequences.

別の態様において、デコード化方法は、マッピングされていないゲノムリードのデコード化をさらに含む。 In another aspect, the decoding method further comprises decoding the unmapped genomic reads.

別の態様において、デコード化方法は、分類されたゲノムリードのデコード化をさらに含む。 In another aspect, the decoding method further comprises decoding the classified genomic reads.

別の態様において、デコード化方法は、関連する関連マッピング位置及びリードの各クラスに対して１つのセクションを含むマスターインデックステーブルをデコード化することをさらに含む。 In another aspect, the decoding method further includes decoding a master index table that includes one section for each class of associated associated mapping locations and leads.

別の態様において、デコード化方法は、使用されるリファレンスの種類：既存、変換、又は構築、に関連する情報をデコード化することをさらに含む。 In another aspect, the decoding method further comprises decoding information related to the type of reference used: existing, transformed, or constructed.

別の態様において、デコード化方法は、前記既存のリファレンスシーケンスに適用される１以上の変換に関連する情報をデコード化することをさらに含む。 In another aspect, the decoding method further comprises decoding information related to one or more transforms applied to the existing reference sequence.

別の態様において、デコード化方法は、ペアになっているゲノムリードをさらに含む。 In another embodiment, the decoding method further comprises paired genomic reads.

別の態様において、デコード化方法は、前記ゲノムデータがエントロピーデコード化される場合をさらに含む。 In another aspect, the decoding method further includes a case where the genomic data is entropy decoded.

本発明は、ゲノムシーケンスデータ２０９、ヌクレオチドシーケンスのリードを含む前記ゲノムシーケンスデータ２０９を圧縮するためのゲノムエンコーダ（２１０）をさらに
提供し、前記ゲノムエンコーダ（２１０）は、
前記リードを１つ以上のリファレンスシーケンスにアライメントさせ、それによりアライメントリードを作成するように構成された、アライナユニット（２０１）と、
構築されたリファレンスシーケンスを生成するように構成された構築された、リファレンス生成ユニット（２０２）と、
１つ以上の既存のリファレンスシーケンス又は構築されたリファレンスシーケンスを使用して、特定のマッチング規則に従って前記アライメントリードを分類し、それによってアライメントリード（２０８）のクラスを作成するように構成された、データ分類ユニット（２０４）と、
前記分類されたアライメントリードに従って前記記述子を選択することにより記述子のブロックとして前記分類されたアライメントリードをコード化するように構成された、１つ以上のブロックコード化ユニット（２０５〜２０７）と、
前記圧縮されたゲノムデータ及びメタデータを多重化するためのマルチプレクサ（２０１６）と、を含む。 The present invention further provides a genome encoder (210) for compressing the genome sequence data 209 including a genome sequence data 209 and a nucleotide sequence read, wherein the genome encoder (210) comprises:
An aligner unit (201) configured to align the lead with one or more reference sequences, thereby creating an alignment lead;
A constructed reference generation unit (202) configured to generate the constructed reference sequence;
Data configured to classify said alignment reads according to a particular matching rule using one or more existing or constructed reference sequences, thereby creating a class of alignment reads (208). A classification unit (204);
One or more block coding units (205-207) configured to code the classified alignment read as a block of descriptors by selecting the descriptor according to the classified alignment read; ,
A multiplexer (2016) for multiplexing the compressed genomic data and metadata.

別の態様において、ゲノムエンコーダは、既存のリファレンス及びデータクラス（２０８）を変換済みデータクラス（２０１８）に変換するように構成された、リファレンスシーケンス変換ユニット（２０１９）をさらに含む。 In another aspect, the genomic encoder further includes a reference sequence conversion unit (2019) configured to convert an existing reference and data class (208) to a converted data class (2018).

別の態様において、ゲノムエンコーダは、前記データ分類ユニット（２０４）が、データクラスＮ、Ｍ及びＩのサブクラスを生成するしきい値のベクトルで構成されたデータクラスＮ、Ｍ及びＩのエンコーダをさらに含む。 In another aspect, the genomic encoder further comprises an encoder of data classes N, M and I, wherein said data classification unit (204) is constituted by a vector of thresholds generating subclasses of data classes N, M and I. Including.

別の態様において、ゲノムエンコーダは、前記リファレンス変換ユニット（２０１９）は、データの全てのクラス及びサブクラスに対して同じリファレンス変換（３００）を適用することをさらに含む。 In another aspect, the genomic encoder further comprises the reference transform unit (2019) applying the same reference transform (300) to all classes and subclasses of data.

別の態様において、ゲノムエンコーダは、前記リファレンス変換ユニット（２０１９）は、データの異なるクラス及びサブクラスに対して異なるリファレンス変換（３０１、３０２、３０３）を適用することをさらに含む。 In another aspect, the genomic encoder further comprises the reference transform unit (2019) applying different reference transforms (301, 302, 303) to different classes and subclasses of data.

別の態様において、ゲノムエンコーダは、前述のコード化方法の全てを実行するのに適した機能をさらに含む。 In another aspect, the genomic encoder further comprises functions suitable for performing all of the above-described encoding methods.

本発明は、圧縮されたゲノムストリーム（２１１）を復元するためのゲノムデコーダー（２１８）をさらに提供し、前記ゲノムデコーダ（２１８）は、
圧縮されたゲノムデータとメタデータを逆多重化するためのデマルチプレクサ（２１０）と、
前記圧縮されたゲノムストリームを記述子のゲノムブロック（２１５）に構文解析するように構成された解析手段（２１２−２１４）と、
記述子のゲノムブロックをヌクレオチド（２１１）のシーケンスの分類されたリードにデコードするように構成された１つ以上のブロックデコーダ（２１６−２１７）と、
ヌクレオチドのシーケンスの非圧縮リードを生成するために、１つ以上のリファレンスシーケンス上のヌクレオチドのシーケンスの前記分類されたリードを選択的にデコード化するように構成されたゲノムデータクラスデコーダー（２１９）と、を含む。 The present invention further provides a genome decoder (218) for decompressing the compressed genome stream (211), wherein the genome decoder (218) comprises:
A demultiplexer (210) for demultiplexing the compressed genomic data and metadata;
Analysis means (212-214) configured to parse the compressed genome stream into genome blocks (215) of descriptors;
One or more block decoders (216-217) configured to decode genomic blocks of the descriptor into sorted reads of a sequence of nucleotides (211);
A genomic data class decoder configured to selectively decode said categorized reads of the sequence of nucleotides on one or more reference sequences to generate an uncompressed read of the sequence of nucleotides; ,including.

別の態様において、ゲノムデコーダは、リファレンス変換記述子（２１１２）をデコード化し、ゲノムデータクラスデコーダ（２１９）によって使用される変換済みのリファレンス（２１１４）を生成するように構成されたリファレンス変換デコーダ（２１１３）をさらに含む。 In another aspect, the genomic decoder (2112) decodes the reference translation descriptor (2112) and generates a translated reference (2114) for use by the genomic data class decoder (219). 2113).

別の態様において、ゲノムデコーダは、前記１つ以上のリファレンスシーケンスが、圧縮されたゲノムストリーム（２１１）に記憶されることをさらに含む。 In another aspect, the genomic decoder further comprises that the one or more reference sequences are stored in a compressed genomic stream (211).

別の態様において、ゲノムデコーダは、前記１以上のリファレンスシーケンスが、帯域外（out of band）メカニズムを介して前記デコーダに提供されることをさらに含む。 In another aspect, the genomic decoder further comprises the one or more reference sequences being provided to the decoder via an out-of-band mechanism.

別の態様において、ゲノムデコーダは、前記１つ以上のリファレンスシーケンスが、デコーダで構築されることをさらに含む。 In another aspect, the genomic decoder further comprises the one or more reference sequences being constructed at a decoder.

別の態様において、ゲノムデコーダは、１つ以上のリファレンスシーケンスが、リファレンス変換デコーダ（２１１３）によってデコーダで変換されることをさらに含む。 In another aspect, the genomic decoder further comprises that one or more reference sequences are converted at the decoder by a reference conversion decoder (2113).

本発明は、前述のコード化方法の全ての態様を実行するための少なくとも１のプロセッサを実行させる命令を含むコンピュータ可読媒体をさらに提供する。 The invention further provides a computer readable medium comprising instructions for executing at least one processor for performing all aspects of the coding method described above.

本発明は、前述のデコード化方法の全ての態様を実行するための少なくとも１のプロセッサを実行させる命令を含むコンピュータ可読媒体をさらに提供する。 The invention further provides a computer readable medium comprising instructions for executing at least one processor for performing all aspects of the decoding method described above.

本発明に、前述のコード化方法の全ての態様に従ってコード化されたゲノムを記憶するサポートデータをさらに提供する。 The present invention further provides support data for storing a genome encoded according to all aspects of the encoding method described above.

提案される手法の一態様は、異なるブロックで構造化され、別々にコード化されたデータとメタデータのクラスの定義である。既存の方法に関するこのような手法のより適切な改善は以下の通りである：
１．データ又はメタデータの種類ごとに効率的なソースモデルを提供することにより構成される情報ソースのエントロピーの減少による圧縮性能の向上；
２．圧縮されたデータ及びメタデータの一部に対して、圧縮されたドメイン内で直接、更なる処理目的のために選択的アクセスを行う可能性；
３．新しいシーケンシングデータ及び／又はメタデータ及び／又は特定のシーケンスリードのセットに関連する新しい解析結果を用いて、圧縮データ及びメタデータを増分的に（すなわち、デコード化と再コード化を必要としない）更新する可能性。 One aspect of the proposed approach is the definition of separately coded data and metadata classes that are structured in different blocks. More appropriate improvements of such methods over existing methods are as follows:
1. Improving compression performance by reducing the entropy of information sources configured by providing an efficient source model for each type of data or metadata;
2. Possibility of selectively accessing compressed data and some of the metadata directly in the compressed domain for further processing purposes;
3. Include compressed data and metadata incrementally (i.e., no decoding and recoding required) with new sequencing data and / or metadata and / or new analysis results associated with a particular set of sequence reads ) Possibility to update.

マッピングされたリードペアの位置が、第１のマッピングされたリードの絶対位置との差として「ｐｏｓ」ブロックでどのようにコード化されるかを示す。FIG. 4 shows how the position of the mapped lead pair is coded in the “pos” block as the difference from the absolute position of the first mapped lead. ペアである２つのリードがどのようにして２つのＤＮＡ鎖から生成されるのかを示す。Figure 2 shows how two reads in a pair are generated from two DNA strands. ストランド１がリファレンスとして使用される場合、リード２の逆相補がどのようにコード化されるかを示す。If strand 1 is used as a reference, it shows how the reverse complement of lead 2 is coded. リードペアを構成するリードの４つの可能な組み合わせと、「ｒｃｏｍｐ」ブロック内のそれぞれのコード化を示す。The four possible combinations of reads that make up the read pair and their respective encodings in the "rcomp" block are shown. ３つのリードペアのリード長が一定の場合におけるペアリング距離の計算方法を示す。A method for calculating a pairing distance when the read lengths of three read pairs are constant is shown. 「ペア」ブロックでコード化されたペアリングエラー（pairing errors）によって、どのようにデコーダがコード化された「ＭＰＰＰＤ」を使用して正しいリードのペアリングを再構築する方法を示す。Pairing errors coded in "pair" blocks show how the decoder reconstructs correct lead pairing using coded "MPPPD". リードがそのメイトよりも異なるリファレンスにマッピングされる場合のペアリング距離のコード化を示す。この場合、付加的な記述子がペアリング距離に追加される。その１つはシグナリングフラグ、２つ目はリファレンス識別子、及びペアリング距離である。FIG. 4 shows the encoding of the pairing distance when a lead is mapped to a different reference than its mate. In this case, an additional descriptor is added to the pairing distance. One is a signaling flag, the second is a reference identifier, and a pairing distance. 「ｎｍｉｓ」ブロック内の「ｎタイプ」のミスマッチングのコード化を示す。FIG. 7 illustrates the encoding of “n-type” mismatches in the “nmis” block. リファレンスシーケンスに関する置換を示すマッピングされたリードペアを示す。5 shows a mapped read pair indicating replacement for a reference sequence. 置換の位置を絶対値又は微分値として計算する方法を示す。A method of calculating the position of the replacement as an absolute value or a differential value will be described. ＩＵＰＡＣコードを使用しない場合の置換の種類をコード化する記号の計算方法を示す。記号は、リードに存在する分子とその位置のリファレンスに存在する分子との間の距離−環状置換（circular substitution）ベクトルを表す。The calculation method of the symbol which codes the type of substitution when the IUPAC code is not used is shown. The symbol represents the distance between the molecule present in the lead and the molecule present in the reference at that position-the circular substitution vector. 置換を「ｓｎｐｔ」ブロックにどのようにコード化する方法を示す。Figure 3 shows how the permutation is coded into a "snpt" block. ＩＵＰＡＣ曖昧性コードを使用した場合における置換コードの計算方法を示す。A method of calculating a replacement code when an IUPAC ambiguity code is used will be described. ＩＵＰＡＣコードを使用した場合の「ｓｎｐｔ」ブロックのコード化の方法を示す。The coding method of the "snpt" block when the IUPAC code is used is shown. クラスＩのリードで使用される置換ベクトルがクラスＭと同じであり、記号Ａ、Ｃ、Ｇ、Ｔ、Ｎの挿入に特別なコードが追加されている態様を示す。The permutation vector used in the class I lead is the same as the class M, and shows that the special codes are added to the insertion of the symbols A, C, G, T, and N. ＩＵＰＡＣ曖昧性コードの場合のミスマッチとインデル（indels）のコード化の例を示す。この場合、置換ベクトルは非常に長くなるため、従って、可能な計算された記号は５つの記号の場合より多くなる。FIG. 6 shows examples of mismatch and indels coding for IUPAC ambiguity codes. In this case, the permutation vector will be very long, so there are more possible calculated symbols than with five symbols. 各ブロックに単一型のミスマッチ又は挿入の位置が含まれる、ミスマッチ及びインデル（indels）の異なるソースモデルを示す。この場合、記号は、ミスマッチ又はインデル（indels）の類型に対してコード化されない。Figure 3 shows different source models of mismatches and indels, where each block contains the location of a single type of mismatch or insertion. In this case, the symbols are not coded for mismatch or indels types. ミスマッチ及びインデル（indels）のコード化の例を示す。特定の種類のミスマッチ又はインデル（indels）がリードに存在しない場合、対応するブロックに０がコード化される。０は、各ブロックのセパレータ及びターミネータとして機能する。Figure 2 shows examples of mismatch and indels coding. If a particular type of mismatch or indel is not present in the lead, a 0 is coded in the corresponding block. 0 functions as a separator and a terminator for each block. リファレンスシーケンスの変更がＭリードをＰリードに変換する方法を示す。この操作により、特に高カバレッジデータの場合、データ構造の情報エントロピーを削減できる。The change of the reference sequence shows how to convert the M-read into the P-read. This operation can reduce the information entropy of the data structure, especially in the case of high coverage data. 本発明の一実施形態によるゲノムエンコーダ２０１０を示す。4 shows a genome encoder 2010 according to one embodiment of the present invention. 本発明の一実施形態によるゲノムデコーダ２１８を示す。4 shows a genome decoder 218 according to one embodiment of the present invention. リードをクラスタリングし、各クラスタから取得したセグメントをアセンブリすることによって、「内部」リファレンスを構築する方法を示す。FIG. 4 illustrates how to build an “internal” reference by clustering leads and assembling the segments obtained from each cluster. 特定のソート（例えば、辞書編集の順序）がリードに適用された後で、最新のリードを格納することによってリファレンスを構築する方法を示す。7 illustrates how a reference is constructed by storing the latest leads after a particular sort (eg, lexicographic order) has been applied to the leads. 「マッピングされていない」リードのクラス（クラスＵ）に属するリードを、対応するブロックに格納又は伝送される６つの記述子を使用してコード化する方法を示す。Fig. 7 illustrates how leads belonging to the class of "unmapped" leads (class U) are encoded using six descriptors stored or transmitted in corresponding blocks. クラスＵに属するリードの代替のコード化を示す。ここでは、コード付きｐｏｓ記述子を使用して、構築されたリファレンスリードのマッピング位置をコード化する。4 shows an alternative encoding of a lead belonging to class U. Here, the mapping position of the constructed reference lead is encoded using a coded pos descriptor. リードからミスマッチを除去するためにリファレンスを適用する方法を示す。場合によっては、リファレンス変換によって新しいミスマッチが生成されるか、変換が適用される前にリファレンスを参照するときに見つかったミスマッチの類型が変更される場合がある。4 illustrates a method of applying a reference to remove mismatches from a read. In some cases, the reference transformation may create a new mismatch or change the type of mismatch found when referencing the reference before the transformation is applied. ミスマッチの全て又はサブセットが削除された場合（つまり、変換前のクラスＭに属するリードは、リファレンスの変換が適用された後にクラスＰに割り当てられる）に、リファレンス変換がどのようにしてクラスリードの所属先を変更できるかを示す。If all or a subset of the mismatches are deleted (ie, leads belonging to class M before conversion are assigned to class P after reference conversion is applied), how the reference conversion belongs to class lead Indicates whether the destination can be changed. ハーフマッピングリードペア（クラスＨＭ）を使用して、マッピングされていないリードで長いコンティグを構築することにより、リファレンスシーケンスの不明な領域を埋める方法を示す。FIG. 4 illustrates how to fill unknown regions of a reference sequence by constructing long contigs with unmapped reads using a half-mapped read pair (class HM). クラスＮ、Ｍ、及びＩのデータのエンコーダがしきい値のベクトルで構成され、Ｎ、Ｍ、及びＩのデータクラスの個別のサブクラスを生成する方法を示す。FIG. 4 shows how the encoders of the classes N, M and I data are composed of a vector of thresholds and generate separate subclasses of the N, M and I data classes. 全てのクラスのデータが、再コード化のために同じ変換されたリファレンスを使用することができるか、又は各クラスＮ、Ｍ及びＩ又はそれらの任意の組み合わせのために異なる変換を使用することができるかを示す。All classes of data can use the same transformed reference for recoding, or use different transformations for each class N, M and I or any combination thereof Indicate if you can. ゲノムデータセットヘッダの構造を示す。3 shows the structure of a genome data set header. マスターインデックステーブルの一般的な構造を示す。各行には、データＰ、Ｎ、Ｍ、Ｉ、Ｕ、ＨＭのいくつかのクラスのゲノム区間（genomic intervals）と、メタデータ及び注釈へのポインタが含まれる。列は、コード化されたゲノムデータに関連するリファレンスシーケンス上の特定の位置を示す。1 shows a general structure of a master index table. Each row contains genomic intervals for several classes of data P, N, M, I, U, HM, and pointers to metadata and annotations. The columns indicate a particular location on the reference sequence associated with the encoded genomic data. クラスＰのリードに関連するゲノム区間（genomic intervals）を含むＭＩＴの１行の例を示す。異なるリファレンスシーケンスに関連するゲノム領域は、特別なフラグ（例では「Ｓ」）で区切られている。FIG. 4 shows an example of a single line of an MIT containing genomic intervals associated with class P reads. Genomic regions associated with different reference sequences are separated by special flags ("S" in the example). ローカルインデックステーブル（ＬＩＴ）の一般的な構造と、保存又は送信されたデータに含まれるコード化されたゲノム情報の物理的な場所へのポインタを保存するために使用される方法を示す。Fig. 3 illustrates the general structure of a local index table (LIT) and the method used to store a pointer to the physical location of the encoded genomic information contained in the stored or transmitted data. ブロックペイロードのアクセスユニット番号７及び８にアクセスするために使用されるＬＩＴの例を示す。9 shows an example of an LIT used to access access unit numbers 7 and 8 of a block payload. ゲノムブロックヘッダーに含まれるＭＩＴとＬＩＴの複数の行の間の機能的な関係を示しす。3 shows a functional relationship between a plurality of lines of MIT and LIT included in a genome block header. 異なるクラスに属するデータを含む異なるゲノムストリームによって伝送されるゲノムデータのいくつかのブロックによって、アクセスユニットがどのように構成されるかを示す。各ブロックは、さらに、データ伝送単位として用いられるデータパケットによって構成される。It shows how an access unit is composed of several blocks of genomic data transmitted by different genomic streams containing data belonging to different classes. Each block is further constituted by a data packet used as a data transmission unit. ヘッダと同種データの１つ以上のブロックに属する多重化ブロックによってアクセスユニットがどのように構成されるかを示す。各ブロックは、ゲノム情報の実際の記述子を含む１つ以上のパケットで構成できる。It shows how the access unit is composed of multiplexed blocks belonging to one or more blocks of the same data as the header. Each block can consist of one or more packets containing the actual descriptors of the genomic information. スプライシングのないマルチプルアライメントを示す。左端のリードには、Ｎ個のアライメントを有する。Ｎはデコード化されるｍｍａｐの第１の値で、第１のリードのアライメントの数を通知する。ｍｍａｐ記述子の次のＮ値がデコード化され、第２のリードのアライメントの数であるＰを計算するために使用される。2 shows multiple alignment without splicing. The leftmost lead has N alignments. N is the first value of mmap to be decoded, and indicates the number of alignments of the first read. The next N values in the mmap descriptor are decoded and used to calculate P, the number of alignments in the second read. 位置、ペア、及びｍｍａｐ記述子を使用して、スプライスなしでマルチプルアライメントをコード化する方法を示す。左端のリードは、Ｎ個のアライメントを有する。FIG. 4 illustrates how to encode multiple alignments without splices using position, pair, and mmap descriptors. The leftmost lead has N alignments. スプライスを使用したマルチプルアライメントを示す。3 shows a multiple alignment using a splice. ｐｏｓ、ｐａｉｒ、ｍｍａｐ、及びｍｓａｒ記述子を使用して、スプライスとのマルチプルアライメントを表す方法を示す。Figure 4 illustrates how pos, pair, mmap, and msar descriptors are used to represent multiple alignments with splices.

本発明に係るゲノム又はプロテオミックシーケンスには、例えば、限定ではなく、ヌクレオチドシーケンス、デオキシリボ核酸（ＤＮＡ）シーケンス、リボ核酸（ＲＮＡ）、及びアミノ酸シーケンスが含まれる。本明細書の説明は、ヌクレオチドシーケンスの形式のゲノム情報に関してかなり詳細であるが、当業者によって理解されるように、いくつかのバリエーションがあり、圧縮のための方法及びシステムは、他のゲノム又はプロテオームシーケンスについても同様に適用できることが理解されるであろう。 Genomic or proteomic sequences according to the present invention include, for example, without limitation, nucleotide sequences, deoxyribonucleic acid (DNA) sequences, ribonucleic acid (RNA), and amino acid sequences. Although the description herein is fairly detailed with respect to genomic information in the form of nucleotide sequences, as will be appreciated by those skilled in the art, there are several variations, and methods and systems for compression can be performed on other genomes or It will be appreciated that the same applies to proteome sequences.

ゲノムシーケンシング情報は、高スループットシーケンシング（ＨＴＳ）装置によって、規定された語彙からの文字列によって表されるヌクレオチドのシーケンス（「塩基」とも呼ばれる）の形で生成される。最小の語彙は５つの記号で表され：｛Ａ、Ｃ、Ｇ、Ｔ、Ｎ｝はＤＮＡに存在する４種類のヌクレオチド、すなわちアデニン、シトシン、グアニン、チミンを表す。ＲＮＡにおいてチミンはウラシル（Ｕ）に置換される。Ｎは、シーケンシング装置がいずれの塩基も呼び出せなかったとき、その位置の実際の性質が決定されていないことを示す。ＩＵＰＡＣ曖昧性コードがシーケンシング装置によって採用される場合、記号に使用されるアルファベットは（Ａ、Ｃ、Ｇ、Ｔ、Ｕ、Ｗ、Ｓ、Ｍ、Ｋ、Ｒ、Ｙ、Ｎ、Ｄ、Ｈ、Ｖ、Ｎ）である。 Genomic sequencing information is generated by high-throughput sequencing (HTS) equipment in the form of sequences of nucleotides (also called "bases") represented by strings from a defined vocabulary. The minimal vocabulary is represented by five symbols: {A, C, G, T, N} represent the four nucleotides present in DNA: adenine, cytosine, guanine, thymine. Thymine is replaced by uracil (U) in RNA. N indicates that when the sequencing device failed to call any base, the actual nature of that position has not been determined. If the IUPAC ambiguity code is employed by the sequencing device, the alphabet used for the symbols is (A, C, G, T, U, W, S, M, K, R, Y, N, D, H, V, N).

シーケンシング装置によって生成されたヌクレオチドシーケンスは「リード」と呼ばれる。シーケンスリードは、数十から数千のヌクレオチドの長さを有する。一部の技術では、１つのリードは１つのＤＮＡ鎖から、第２のリードは他の鎖から得られた「ペア」のシーケンスリードを生成する。ゲノムシーケンシングでは、「カバレッジ」という用語を使用して、「リファレンスシーケンス」に関するシーケンスデータの冗長性のレベルを表す。例えば、ヒトゲノム（長さ3３２億塩基）で３０倍のカバレッジを達成するには、シーケンシング装置が合計３０×３２億塩基を生成し、リファレンスの各位置が平均３０回「カバー」されるようにする。 The nucleotide sequence generated by the sequencing device is called a "read". Sequence reads have a length of tens to thousands of nucleotides. In some techniques, one read generates a “pair” of sequence reads from one DNA strand and a second read from the other strand. Genome sequencing uses the term "coverage" to describe the level of redundancy in sequence data with respect to a "reference sequence." For example, to achieve 30-fold coverage in the human genome (333.2 billion bases in length), the sequencing equipment would generate a total of 30 x 3.2 billion bases and each reference position would be "covered" on average 30 times. I do.

本開示を通して、リファレンスシーケンスは、シーケンシング装置により生成されたヌクレオチドシーケンスがアライメント／マッピングされる任意のシーケンスである。シーケンスの一例は、実際には「リファレンスゲノム」であり、種の遺伝子セットの代表例として科学者によってアセンブリされたシーケンスである。例えば、ＧＲＣｈ３７、ゲノム・リファレンス・コンソーシアムのヒトゲノム（ｂｕｉｌｄ３７）は、ニューヨーク州バッファローの匿名ボランティア１３名から派生している。但し、リファレンスシーケンスは、リードの圧縮性をさらに処理することを考慮して単に改善するように考案及び構築された合成シーケンスで構成することもできる。これについては、「クラスＵの記述子と、「クラスＵ」及び「クラスＨＭ」のマッピングされていないリードの「内部」リファレンスの構築」で詳しく説明し、図２２及び２３に示す。 Throughout this disclosure, a reference sequence is any sequence to which a nucleotide sequence generated by a sequencing device is aligned / mapped. One example of a sequence is actually a “reference genome”, which is a sequence assembled by scientists as representative of a species gene set. For example, GRCh37, the human genome of the Genome Reference Consortium (build37), is derived from 13 anonymous volunteers in Buffalo, NY. However, the reference sequence can also be composed of a synthesized sequence that has been devised and constructed to simply improve in consideration of further processing of the read compressibility. This is described in detail in "Building Class U Descriptors and" Internal "References for Unmapped Leads of" Class U "and" Class HM ", and is shown in FIGS.

シーケンシング装置では、次のようなシーケンスリードエラーが発生する可能性がある。
１．特定の塩基を呼び出す信頼性がないため、塩基の呼び出しをスキップする決定。これは未知の塩基と呼ばれ、「Ｎ」とラベル付けされる（「ｎタイプ」のミスマッチとして示される）。
２．シーケンスされたサンプルに実際に存在する核酸を表すために、間違った記号（つまり、異なる核酸を表す）を使用する；これは通常、「置換エラー」と呼ばれる（「ｓタイプ」のミスマッチとして示される）。
３．実際に存在する核酸を参照しないで付加的な記号の１つのシーケンスリードに挿入；これは通常、「挿入エラー」と呼ばれる（「ｉタイプ」のミスマッチとして示される）。
４．シーケンスされたサンプルに実際に存在する核酸を表す記号の１つのシーケンスリードからの削除；これは通常「削除エラー」と呼ばれる（「ｄタイプ」のミスマッチとして示される）。
５．元のシーケンスの実在を反映しない単一のフラグメントへの１つ以上のフラグメントの組換え；これは通常、アライナが塩基をクリップすると決定する結果となる（「ｃタイプ」のミスマッチとして示される）。 In the sequencing device, the following sequence read error may occur.
1. Decision to skip calling a base because there is no reliability calling a particular base. This is called the unknown base and is labeled "N" (indicated as an "n-type" mismatch).
2. Use the wrong symbol (ie, representing a different nucleic acid) to represent the nucleic acid actually present in the sequenced sample; this is commonly referred to as a "substitution error" (indicated as an "s-type" mismatch) ).
3. Insertion of one of the additional symbols into the sequence read without reference to the nucleic acid actually present; this is commonly referred to as an "insertion error" (indicated as an "i-type" mismatch).
4. Deletion of one of the symbols representing nucleic acids actually present in the sequenced sample from the sequence read; this is commonly referred to as a "deletion error" (indicated as a "d-type" mismatch).
5. Recombination of one or more fragments into a single fragment that does not reflect the reality of the original sequence; this usually results in the aligner deciding to clip the base (indicated as a "c-type" mismatch).

「カバレッジ」という用語は、リファレンスゲノム又はその一部が利用可能なシーケンスリードでカバーできる程度を定量化するために文献で使用されている。カバレッジは次のように言われている：
・リファレンスゲノムのいくつかの部分がどんな解読可能な配列によってもマッピングされていない場合の部分的な（partial）（１×未満）；
・リファレンスゲノムの全てのヌクレオチドが、シーケンス中のただ一つの記号によってマッピングされる単一の（single）（１×）；
・リファレンスゲノムの各ヌクレオチドが複数回マッピングされる場合は、多数の（multiple）（２×、３×、Ｎ×）。 The term "coverage" is used in the literature to quantify the extent to which a reference genome or a part thereof can be covered by available sequence reads. Coverage is said to be:
• partial (less than 1x) where some parts of the reference genome are not mapped by any readable sequence;
A single (1x) in which all nucleotides of the reference genome are mapped by a single symbol in the sequence;
-Multiple (2x, 3x, Nx) if each nucleotide of the reference genome is mapped multiple times.

本発明は、関連情報が効率的にアクセス可能かつ移動可能であり、冗長情報の重みが低減されたゲノム情報表示フォーマットを定義することを目的とする。 An object of the present invention is to define a genome information display format in which related information is efficiently accessible and movable, and the weight of redundant information is reduced.

開示された発明の主な革新的な態様は以下のとおりである。
１シーケンスリードは、リファレンスシーケンスに関するアライメントの結果に従って、データクラスに分類及び区分される。このような分類及び区分化は、アラインメント結果及びマッチング精度に関連する基準に従って、コード化されたデータへの選択的アクセスを可能にする。
２分類されたシーケンスリード及び関連するメタデータは、低い情報エントロピーによって特徴付けられる別個の情報ソースを取得するために、記述子の同種のブロックによって表される。
３各クラスの統計的特性に適合した別個のソースモデルを用いて、各分離された情報ソースをモデル化する可能性、及び各リードのクラス内及び各別々にアクセス可能なデータユニット（アクセス単位）の各記述子ブロック内でソースモデルを変更する可能性。各ソースモデルの統計的性質に従って、適切なコンテキスト適応確率モデルと関連するエントロピーコーダの採用。
４記述子ブロック間の対応と依存関係の定義により、全ての情報が必要ではない場合、全ての記述子ブロックをデコード化することなく、シーケンシングデータ及び関連するメタデータに選択的にアクセスできる。
５「既存の」（「外部」とも呼ばれる）リファレンスシーケンス又は「変換された」リファレンスシーケンスに関する各シーケンスのデータクラス及び関連するメタデータブロックのコードは、記述子ブロックの情報ソースのエントロピーを減らすために、「既存の」リファレンスシーケンスに適切な変換を適用することによって取得される。前記記述子は、異なるデータクラスに分割されたリードを表す。「既存の」リファレンス又は「変換された」「既存の」リファレンスシーケンスを参照して、対応する記述子を使用したリードのコード化に続いて、さまざまなミスマッチの発生を使用して、低エントロピーの最終的なコード化表現を見つけ、より高い圧縮効率を達成するために、リファレンスシーケンスへの適切な変換を定義できる。
６１つ以上のリファレンスシーケンスの構築（「内部の」リファレンスとも呼ばれ、本明細書では「外部の」リファレンスシーケンスとも呼ばれる「既存の」リファレンスシーケンスと区別する）は、制約のセットを満たさない既存のリファレンスシーケンスに関してある程度のマッチング精度を示すリードのクラスをコード化するために使用される。このような制約は、「内部の」リファレンスシーケンスに関してアライメントされたリードのクラスを圧縮形式で表現するためのコード化のコスト、及び「内部の」リファレンスシーケンス自体を表現するためのコストが、アライメントされていないリードのクラスを逐語的にコード化するよりも、又は変換を伴わずに、又は伴う「外部の」リファレンスシーケンスを使用するよりも低いという目的で設定される。 The main innovative aspects of the disclosed invention are as follows.
1 Sequence reads are classified and classified into data classes according to the result of alignment with respect to the reference sequence. Such classification and segmentation allows for selective access to the encoded data according to criteria relating to alignment results and matching accuracy.
2 The categorized sequence reads and associated metadata are represented by like blocks of descriptors to obtain a separate information source characterized by low information entropy.
3 Possibility to model each separated information source using a separate source model adapted to the statistical properties of each class, and data units (access units) within each class of the lead and each separately accessible Possibility to change the source model within each descriptor block. Adopting an appropriate context adaptive stochastic model and associated entropy coder according to the statistical properties of each source model.
4. Defining correspondences and dependencies between descriptor blocks allows for selective access to sequencing data and associated metadata without decoding all descriptor blocks when not all information is needed.
5 The data class of each sequence and the code of the associated metadata block for the “existing” (also called “external”) reference sequence or the “transformed” reference sequence are used to reduce the entropy of the information source of the descriptor block. , Obtained by applying the appropriate transformation to the "existing" reference sequence. The descriptor represents a lead divided into different data classes. Referencing the "existing" reference or the "transformed""existing" reference sequence, following the coding of the read using the corresponding descriptor, using the occurrence of various mismatches, the low entropy To find the final coded representation and achieve higher compression efficiency, an appropriate conversion to the reference sequence can be defined.
6 The construction of one or more reference sequences (also referred to as “external” reference sequences, also referred to herein as “external” reference sequences, as distinguished from “existing” reference sequences) is based on existing Used to encode a class of reads that exhibits some degree of matching accuracy with respect to the reference sequence. Such constraints imply that the cost of coding to represent the class of reads aligned with respect to the "internal" reference sequence in a compressed format, and the cost of representing the "internal" reference sequence itself, is aligned. The purpose is to set the class of unread leads to be lower than using verbatim coding or using an "external" reference sequence with or without conversion.

以下、上記に各態様についてさらに詳細に説明する。
［マッチング規則に従ったシーケンスリードの分類］ Hereinafter, each embodiment will be described in more detail.
[Classification of sequence read according to matching rule]

シーケンシング装置により生成されたシーケンスリードは、開示された発明により、１つ以上の「既存の」リファレンスシーケンスに関するアライメントのマッチング結果に従って６つの異なる「クラス」に分類される。 Sequence reads generated by the sequencing device are classified into six different "classes" according to the disclosed invention, according to the alignment matching results for one or more "existing" reference sequences.

ヌクレオチドのＤＮＡシーケンスをリファレンスシーケンスに対してアライメントさせる場合、次のケースを特定できる：
１．リファレンスシーケンス内のある領域は、エラーを伴わないシーケンスリードと一致することが分かる（すなわち、完全なマッピング）そのようなヌクレオチドのシーケンスは、「完全にマッチングするリード」と呼ばれるか、「クラスＰ」と表示される。
２．リファレンスシーケンスのある領域は、リードを生成するシーケンシング装置が塩基（又はフクレオチド）を呼び出すことができなかった数と位置によってのみ決定されるミスマッチの数と類型を伴うシーケンスリードと一致することが分かる。そのような類型のミスマッチは、未定義のヌクレオチド塩基を示すために使用される文字「Ｎ」で示される。本明細書では、この類型のミスマッチを「ｎタイプ」ミスマッチと呼ぶ。このようなシーケンスは「クラスＮ」リードに属する。リードが「クラスＮ」に属すると分類されると、マッチングの不正確さの程度を特定の上限に制限し、有効なマッチングと見なされるものとそうでないものとの境界を設定すると便利である。したがって、クラスＮに割り当てられたリードは、リードに含めることができる未定義の塩基（「Ｎ」と呼ばれる塩基）の最大数を定義するしきい値（ＭＡＸＮ）を設定することによっても制約される。このような分類は、クラスＮに属する全てのリードが、対応するリファレンスシーケンスを参照するときに共有する必要な最小マッチング精度(又は最大マッチング度)を黙示的に定義し、これは、選択的なデータ検索を圧縮データに適用するための有用な基準を構成する。
３．リファレンスシーケンス中のある領域は、リードを生成するシーケンシング装置がいずれのヌクレオチド塩基も呼び出せなかった位置の数、もし存在するならば（すなわち「ｎタイプ」のミスマッチ）、それに加えて、リファレンス中に存在するものとは異なる塩基が呼ばれた不一致の数、によって決定されたミスマッチの数と類型を伴うシーケンスリードと一致することが分かる。「置換」として示されるこのようなミスマッチの類型は、一塩基変異（ＳＮＶ）又は一塩基多型（ＳＮＰ）とも呼ばれる。本明細書では、この類型のミスマッチを「ｓタイプ」ミスマッチと呼ぶ。シーケンスリードは「Ｍミスマッチリード」として参照され、「クラスＭ」に割り当てられる。「クラスＮ」の場合と同様に、「クラスＭ」に属するすべてのリードについても、マッチングの不正確さの程度を特定の上限に制限し、有効なマッチングと見なされるものとそうでないものとの境界を設定すると便利である。したがって、クラスＭに割り当てられたリードは、しきい値のセットを定義することによって制約され、１つは「ｎタイプ」のミスマッチが存在する場合はその数「ｎ」（ＭＡＸＮ）、もう１つは置換の数「ｓ」（ＭＡＸＳ）である。第３の制約は、数値「ｎ」と「ｓ」との両方の関数ｆ（ｎ，ｓ）によって定義されるしきい値である。このような第２の制約は、任意の意味のある選択的アクセス基準に従ってマッチングの不正確さの上限を持つクラスを生成することを可能にする。例えば、限定ではないが、ｆ（ｎ，ｓ）は、（ｎ＋ｓ）１／２、又は（ｎ＋ｓ）、又は「クラスＭ」に属するリードに対して許容されるマッチングの最大不正確さレベルに境界を設定する任意の線形式又は非線形式であり得る。このような境界は、１つの類型又は他の類型に適用される単純なしきい値を超えて、「ｎタイプ」のミスマッチと「ｓタイプ」のミスマッチ（置換）の数の可能な組み合わせにさらなる境界を与えるため、様々な目的のためにシーケンスリードを分析する際に、所望の選択的なデータ検索を、圧縮データに適用するための非常に有用な基準を構成する。
４．第４の分類は、「挿入」、「削除」（インデル（indels）とも呼ばれる）、「クリップ」のいずれかの類型の少なくとも１つのミスマッチを示すシーケンシングリードで構成され、さらに、クラスＮ又はＭに属するミスマッチの類型が存在する場合である。このようなシーケンスは「Ｉミスマッチリード」と呼ばれ、「クラスＩ」に割り当てられる。挿入は、リファレンスには存在しないがリードシーケンスには存在する１つ以上のヌクレオチドの追加の配列によって構成される。本明細書では、この類型のミスマッチを「ｉタイプ」ミスマッチと呼ぶ。挿入されたシーケンスがシーケンスの端にあるとき、文献では、それは「ソフトクリップ」とも呼ばれる（すなわち、ヌクレオチドはリファレンスにマッチングしていないが、廃棄される「ハードクリップ」ヌクレオチドとは対照的に、アライメントされたリードにおいて保持される）。本明細書では、この類型のミスマッチを「ｃタイプ」ミスマッチと呼ぶ。ヌクレオチドの保持又は破棄は、シーケンシング装置によって、又は以下のシーケンシング段階によって決定されるように、リードを受け取り処理する本発明に開示されるリードの分類器によってではなく、アライナ段階によって行われる決定である。シーケンシング装置によって、又は以下のシーケンシング段階によって決定されるように、リードを受信して処理する本発明に開示されるリードの分類器によってではなく、アライナ段階によって行われる決定である。削除は、リファレンスに対するリードにおける「ホール」（ヌクレオチド欠損）である。本書では、このタイプのミスマッチを「ｄタイプ」ミスマッチと呼ぶ。クラス「Ｎ」及び「Ｍ」の場合と同様に、マッチングの不正確さに対する制限を定義することは可能であり、かつ適切である。「クラスＩ」に対する一連の制約の定義は、「クラスＭ」に使用されたものと同じ原則に基づいており、表１の最後の行に示されている。クラスＩのデータに対して許容される各類型のミスマッチに対するしきい値の他に、さらなる制約は、ミスマッチの数「ｎ」、「ｓ」、「ｄ」、「ｉ」及び「ｃ」であり、関数ｗ（ｎ，ｓ，ｄ，ｉ，ｃ）によって決定されるしきい値によって定義される。このような付加的制約は、任意の意味のあるユーザ定義の選択的なアクセス基準に従ってマッチングの不正確さの上限を持つクラスを生成することを可能にする。例えば、これに限定されるものではないが、ｗ（ｎ，ｓ，ｄ，ｉ，ｃ）は、（ｎ＋ｓ＋ｄ＋ｉ＋ｃ）１／５又は（ｎ＋ｓ＋ｄ＋ｉ＋ｃ）、又は「クラスＩ」に属するリードに対して許容されるマッチングの最大不正確レベルに境界を設定する任意の線形式又は非線形式であり得る。このような境界は、この境界は、許容可能なミスマッチの各タイプに適用される単純な閾値を超えて、「クラスＩ」のリードにおいて許容可能なミスマッチの数の任意の可能な組み合わせに対して、さらなる境界を設定することを可能にするため、様々な目的でシーケンスリードを解析するときに、所望の選択的なデータ検索を圧縮データに適用するための非常に有用な基準を構成する。
５．第５の分類は、リファレンスシーケンスを参照するときに、各データクラスに対して有効であると見なされるマッピング（すなわち、表１で指定されたマッチングの最大精度の上限を定義するマッチング規則のセットを満たしていない）を見つけないすべてのリードを含む。このようなシーケンスは、リファレンスシーケンスを参照するときに「マッピングされていない（Unmapped）」と呼ばれ、「クラスＵ」に属するものとして分類される。
［マッチング規則によるリードペアの分類］ When aligning a DNA sequence of nucleotides to a reference sequence, the following cases can be identified:
1. Certain regions within the reference sequence are found to be consistent with error-free sequence reads (ie, perfect mapping). Such sequences of nucleotides are referred to as "perfectly matched reads" or "class P" Is displayed.
2. Certain regions of the reference sequence are found to be consistent with sequence reads with the number and type of mismatches determined solely by the number and location at which the sequencing device generating the read was unable to call the base (or nucleotide). . Such type of mismatch is indicated by the letter "N" used to indicate an undefined nucleotide base. This type of mismatch is referred to herein as an "n-type" mismatch. Such a sequence belongs to a "Class N" lead. When a lead is classified as belonging to "Class N", it is convenient to limit the degree of inaccuracy of the matching to a certain upper limit and to set a boundary between what is considered a valid match and what is not. Therefore, reads assigned to class N are also constrained by setting a threshold (MAXN) that defines the maximum number of undefined bases (bases called "N") that can be included in the read. . Such a classification implicitly defines the minimum matching accuracy (or maximum matching degree) that all leads belonging to class N need to share when referring to the corresponding reference sequence, which is optional. Configure a useful criterion for applying data retrieval to compressed data.
3. One region in the reference sequence is the number of positions where the sequencing device that generated the read could not call any nucleotide bases, if any (ie, an “n-type” mismatch), plus It can be seen that the number of mismatches called bases different from those present are consistent with the sequence read with the number and type of mismatches determined by the mismatch. Such a type of mismatch, denoted as "substitution", is also called a single nucleotide mutation (SNV) or a single nucleotide polymorphism (SNP). This type of mismatch is referred to herein as an "s-type" mismatch. The sequence read is referred to as “M mismatch read” and is assigned to “class M”. As in the case of "Class N", for all leads belonging to "Class M", the degree of matching inaccuracies is limited to a certain upper limit, and the difference between what is considered a valid match and what is not. It is convenient to set boundaries. Thus, the leads assigned to class M are constrained by defining a set of thresholds, one for the number “n” (MAXN) if there is an “n-type” mismatch, and one for Is the number of substitutions “s” (MAXS). A third constraint is a threshold defined by a function f (n, s) of both numerical values "n" and "s". Such a second constraint makes it possible to generate a class with an upper bound on matching inaccuracies according to any meaningful selective access criteria. For example, but not by way of limitation, f (n, s) bounds to (n + s) 1/2, or (n + s), or the maximum matching inaccuracy level allowed for leads belonging to "Class M". Can be any linear or non-linear expression that sets Such boundaries exceed the simple threshold applied to one or the other type, and further limit the possible combinations of the number of “n-type” mismatches and “s-type” mismatches (permutations). To constitute a very useful criterion for applying the desired selective data retrieval to the compressed data when analyzing sequence reads for various purposes.
4. The fourth category is composed of sequencing reads that show at least one mismatch of any type of “insert”, “delete” (also called indels), “clip”, and furthermore class N or M There is a type of mismatch belonging to Such a sequence is called "I mismatch read" and is assigned to "Class I". Insertions are made up of additional sequences of one or more nucleotides that are not present in the reference but are present in the read sequence. This type of mismatch is referred to herein as an "i-type" mismatch. When the inserted sequence is at the end of the sequence, it is also referred to in the literature as a “soft clip” (ie, the nucleotides do not match the reference, but are aligned with the discarded “hard clip” nucleotides) In the read lead). This type of mismatch is referred to herein as a "c-type" mismatch. The retention or discarding of nucleotides is determined by a sequencing device or by the aligner stage rather than by the read classifier disclosed and disclosed in the present invention that receives and processes reads, as determined by the following sequencing stages. It is. This is a decision made by the aligner stage, rather than by the sequencing device or by the lead classifier disclosed and disclosed in the present invention that receives and processes the leads as determined by the following sequencing stages. Deletions are "holes" (nucleotide deletions) in reads relative to the reference. This type of mismatch is referred to herein as a "d-type" mismatch. As with the classes "N" and "M", it is possible and appropriate to define a limit on the inaccuracy of the matching. The definition of the set of constraints for "Class I" is based on the same principles used for "Class M" and is shown in the last row of Table 1. In addition to the thresholds for each type of mismatch allowed for class I data, a further constraint is the number of mismatches "n", "s", "d", "i" and "c". , W (n, s, d, i, c). Such additional constraints make it possible to generate classes with an upper bound on matching inaccuracies according to any meaningful user-defined selective access criteria. For example, but not limited to, w (n, s, d, i, c) is allowed for leads belonging to (n + s + d + i + c) 1/5 or (n + s + d + i + c), or "class I". It can be any linear or non-linear expression that bounds the maximum inaccuracy level of the matching. Such a boundary is such that, beyond a simple threshold applied to each type of acceptable mismatch, this boundary is defined for any possible combination of the number of allowable mismatches in a "Class I" lead. It constitutes a very useful criterion for applying the desired selective data search to the compressed data when analyzing sequence reads for various purposes, to allow further boundaries to be set.
5. The fifth category is a mapping that is considered valid for each data class when referencing the reference sequence (ie, a set of matching rules that defines the upper limit of the maximum precision of the matching specified in Table 1). (Not met) include all leads that do not find. Such a sequence is referred to as “Unmapped” when referring to the reference sequence and is classified as belonging to “Class U”.
[Classification of lead pairs by matching rules]

前のセクションで指定された分類は、単一のシーケンスリードに関するものである。
２つのリードが可変長の未知のシーケンスで分離されていることがわかっているペアでリードを生成するシーケンス技術（イルミナ社（Illumina Inc.））の場合、ペア全体を単一のデータクラスに分類することを検討するのが適切である。別のリードと結合されたリードは、その「メイト（mate）」と呼ばれる。 The classification specified in the previous section is for a single sequence read.
For sequencing techniques (Illumina Inc.) that generate reads in pairs where the two reads are known to be separated by an unknown sequence of variable length, the entire pair is classified into a single data class. It is appropriate to consider doing so. A lead that is combined with another lead is called its "mate."

ペアの両方のリードが同じクラスに属している場合、ペア全体のクラスへの割り当ては明らかである：ペア全体が任意のクラスの同じクラスに割り当てられる（つまり、Ｐ、Ｎ、Ｍ、Ｉ、Ｕ）。２つのリードが異なるクラスに属しているが、いずれも「クラスＵ」に属していない場合、ペア全体が次の式に従って定義された最高の優先度を持つクラスに割り当てられる：
Ｐ＜Ｎ＜Ｍ＜Ｉ
ここで、「クラスＰ」の優先度が最も低く、「クラスＩ」の優先度が最も高くなる。 If both leads of a pair belong to the same class, the assignment of the entire pair to the class is obvious: the entire pair is assigned to the same class of any class (ie, P, N, M, I, U ). If two leads belong to different classes, but neither belong to "class U", the entire pair is assigned to the class with the highest priority defined according to the following equation:
P <N <M <I
Here, “class P” has the lowest priority and “class I” has the highest priority.

リードの１つだけが「クラスＵ」に属し、そのメイトがクラスＰ、Ｎ、Ｍのいずれかに属する場合、第６のクラスは「ハーフマッピング」を表す「クラスＨＭ」として定義される。 If only one of the leads belongs to "class U" and its mate belongs to any of classes P, N, or M, the sixth class is defined as "class HM", which represents "half mapping".

このような特定のクラスのリードの定義は、リファレンスゲノムに存在するギャップ又は未知の領域（ほとんど知られていない未知の領域とも呼ばれる）を決定しようとするために使用されるという事実に基づいている。このような領域は、既知の領域にマッピングすることができるペアリードを使用してエッジでペアをマッピングすることによって再構成される。マッピングされていないメイトは、図２８に示すように、未知の領域のいわゆる「コンティグ」を作るのに使われる。したがって、このような類型のリードペアのみに選択的アクセスを提供すると、関連する計算の負担が大幅に軽減され、最新のソリューションを使用すると完全に検査する必要がある大量のデータセットに起因するデータの非常に効率的な処理が可能になる。 The definition of such a particular class of reads is based on the fact that it is used to try to determine gaps or unknown regions (also known as little-known unknown regions) present in the reference genome. . Such regions are reconstructed by mapping pairs at edges using pair reads that can map to known regions. The unmapped mate is used to create a so-called "contig" of the unknown area, as shown in FIG. Therefore, providing selective access to only these types of read pairs greatly reduces the associated computational burden and the use of modern solutions requires a large amount of data due to large data sets that need to be fully examined. Very efficient processing becomes possible.

次の表に、各リードが属するデータのクラスを定義するためにリードに適用されるマッチング規則を示す。この規則は、ミスマッチの類型（ｎ、ｓ、ｄ、ｉ、ｃ型ミスマッチ）の有無に関して、表の最初の５列で定義される。第６の列は、それぞれのミスマッチの類型に対する最大しきい値、及び起こり得るミスマッチの類型の任意の関数ｆ（ｎ，ｓ）及びｗ（ｎ，ｓ，ｄ，ｉ、ｃ）に関する規則を提供する。 The following table shows the matching rules that are applied to leads to define the class of data to which each lead belongs. This rule is defined in the first five columns of the table regarding the presence or absence of mismatch types (n, s, d, i, c type mismatch). The sixth column provides the maximum threshold for each mismatch type and the rules for any functions f (n, s) and w (n, s, d, i, c) of the possible mismatch types. I do.

表１．各シーケンスリードが本発明の開示において定義されるデータのクラスに分類されるために満たさなければならないミスマッチの類型及び制限のセット。 Table 1. A set of mismatch types and restrictions that must be met in order for each sequence read to fall into the class of data defined in the present disclosure.

表１．各シーケンスリードが、本発明の開示において定義されるデータクラスに分類されるために満足しなければならないミスマッチの類型及び制約のセット

［マッチング精度の異なるサブクラスへのクラスＮ、Ｍ及びＩのシーケンスリードのマッチング規則のパーティション］ Table 1. A set of mismatch types and constraints that must be satisfied for each sequence read to be classified into the data classes defined in the present disclosure.

[Partitions of matching rules for class N, M and I sequence reads into subclasses with different matching accuracy]

前のセクションで定義されたタイプＮ、Ｍ及びＩのデータクラスは、さらに、マッチング精度の程度が異なる任意の数の別個のサブクラスに分解することができる。このようなオプションは、より細かい粒度を提供する上で重要な技術的利点であり、その結果、各データクラスへのより効率的な選択的アクセスを提供する。限定ではなく一例として、クラスＮをサブクラス数ｋ（サブクラスＮ_１、・・・、サブクラスＮ_ｋ）に分解するには、対応する成分ＭＡＸＮ_１、ＭＡＸＮ_２、・・・、ＭＡＸＮ_{（ｋ−１）}、ＭＡＸＮ_（ｋ）を持つベクトルを定義する必要があり、条件ＭＡＸＮ_１＜ＭＡＸＮ_２＜・・・＜ＭＡＸＮ_{（ｋ−１）}＜ＭＡＸＮで、各リードを、ベクトルの各エレメントが評価されたときに表１で指定された制限を満たす最下位にランク付けされたサブクラスに割り当てる。これは、図２９に示されており、データ分類ユニット２９１は、クラスＰ、Ｎ、Ｍ、Ｉ、Ｕ、ＨＭエンコーダ、及び注釈及びメタデータ用のエンコーダを含む。クラスＮのエンコーダは、Ｎ個のデータ（２９６）のｋ個のサブクラスを生成するＭＡＸＮ_１からＭＡＸＮ_ｋ２９２までのしきい値のベクトルで構成される。 The data classes of types N, M and I defined in the previous section can be further broken down into any number of distinct subclasses with different degrees of matching accuracy. Such an option is an important technical advantage in providing finer granularity and, as a result, provides more efficient selective access to each data class. By way of example and not limitation, the number of subclasses of class N k (subclass _{N 1,} · · ·, subclasses _{N k)} to decompose to the corresponding component _{_{MAXN 1, MAXN 2, ···,}} MAXN (k-1) , MAXN _(k) must be defined, and under the conditions MAXN ₁ <MAXN ₂ <... <MAXN _(k−1) <MAXN, each lead is evaluated when each element of the vector is evaluated. Assign to the lowest ranked subclass that meets the restrictions specified in Table 1. This is illustrated in FIG. 29, where the data classification unit 291 includes classes P, N, M, I, U, HM encoders, and encoders for annotations and metadata. A class N encoder consists of a vector of thresholds from MAXN ₁ to MAXN _k 292 that generate k subclasses of N data (296).

タイプＭとタイプＩのクラスの場合、ＭＡＸＭとＭＡＸＴＯＴにそれぞれ同じ特性を持つベクトルを定義することによって同じ原理が適用され、関数ｆ（ｎ，ｓ）と関数ｗ（ｎ，ｓ，ｄ，Ｉ，ｃ）が制限を満たすか否かをチェックするためのしきい値として各ベクトル成分が使用される。タイプＮのサブクラスの場合と同様に、割り当ては、制限が満たされている最下位のサブクラスに与えられる。各クラスの類型に対するサブクラスの数は独立しており、サブ区分の任意の組み合わせが許容される。これは図２９に示されており、クラスＭエンコーダ２９３及びクラスＩエンコーダ２９４は、それぞれ、しきい値ＭＡＸＭ_１からＭＡＸＭ_ｊ、及びＭＡＸＴＯＴ_１からＭＡＸＴＯＴ_ｈのベクトルで構成されている。２つのエンコーダはそれぞれＭ個のデータ（２９７）のｊ個のサブクラスとＩ個のデータ（２９８）のｈ個のサブクラスを生成する。 For classes of type M and type I, the same principle applies by defining vectors with the same properties in MAXM and MAXTOT, respectively, with functions f (n, s) and w (n, s, d, I, Each vector component is used as a threshold to check if c) satisfies the constraint. As in the case of type N subclasses, the assignment is given to the lowest subclass for which the restriction is satisfied. The number of subclasses for each class type is independent, and any combination of subclasses is allowed. This is illustrated in Figure 29, class M encoder 293 and Class I encoder 294 are each configured MAXM _j from the threshold MAXM _1, and from MAXTOT ₁ vector of MAXTOT _h. The two encoders generate j subclasses of M data (297) and h subclasses of I data (298), respectively.

ペアの２つのリードが同じサブクラスに分類される場合、ペアは同じサブクラスに属する。 If two leads of a pair fall into the same subclass, the pair belongs to the same subclass.

ペアの２つのリードが異なるクラスのサブクラスに分類される場合、ペアは次の式に従って優先度の高いクラスのサブクラスに属する。
Ｎ＜Ｍ＜Ｉ
ここで、Ｎの優先度が最も低く、Ｉの優先度が最も高くなる。 If the two reads of a pair are classified into different classes of subclasses, the pair belongs to a higher priority class subclass according to the following formula:
N <M <I
Here, N has the lowest priority and I has the highest priority.

２つのリードがクラスＮ、Ｍ、又はＩのいずれかの異なるサブクラスに属する場合、ペアは次の式に従って最も高い優先度を持つサブクラスに属する。
Ｎ_１＜Ｎ_２＜・・・＜Ｎ_ｋ
Ｍ_１＜Ｍ_２＜・・・Ｍ_ｊ
Ｉ_１＜Ｉ_２＜・・・＜Ｉ_ｈ
ここで、最も高いインデックスが最も高い優先順位を持つ。
［「外部の」リファレンスシーケンスの変換］ If the two leads belong to different subclasses of any of the classes N, M, or I, the pair belongs to the subclass with the highest priority according to the following formula:
N ₁ <N ₂ <... <N _k
M ₁ <M ₂ <... M _j
I ₁ <I ₂ <... <I _h
Here, the highest index has the highest priority.
[Conversion of "external" reference sequence]

クラスＮ、Ｍ、Ｉに分類されたリードで見つかったミスマッチを使用して、リードの表現をより効率的に圧縮するために使用される「変形」のリファレンスを作成できる。 Mismatches found in leads classified into classes N, M, I can be used to create a "deformation" reference that is used to more efficiently compress the representation of the lead.

クラスＮ、Ｍ又はＩ（ＲＳ_０として示される「既存の」（すなわち「外部の」）リファレンスシーケンスに関して）に属すると分類されたリードは、「変換」のリファレンスとの実際のミスマッチの発生に従って、「変換」リファレンスシーケンスＲＳ_１に関してコード化することができる。例えば、リファレンスシーケンスＲＳ_ｎに関してミスマッチを含むクラスＭ（クラスＭの第ｉ番目のリードとして示される）に属するｒｅａｄ^Ｍ _ｉｎの場合、「変換」後のｒｅａｄ^Ｍ _ｉｎ＝ｒｅａｄ^Ｐ _{ｉ（ｎ＋１）}は、Ａ（Ｒｅｆ_ｎ）＝Ｒｅｆ_ｎ＋１として得ることができる。ここで、ＡはリファレンスシーケンスＲＳ_ｎからリファレンスシーケンスＲＳ_ｎ＋１への変換である。 A lead classified as belonging to class N, M or I (with respect to an "existing" (ie "external") reference sequence, denoted as RS ₀ ) will, according to the occurrence of an actual mismatch with the "translated" reference, it can be encoded with respect to "conversion" Reference sequence RS _1. For example, for read ^M _in belonging to class M (shown as the ith read of class M) that includes a mismatch with respect to reference sequence RS _n , read ^M _in = read ^P _{i (n + 1)} after “conversion” is A (Ref _n ) = Ref _{n + 1} . Here, A is converted from the reference sequence RS _n to the reference sequence _{RS n + 1.}

図１９は、リファレンスシーケンス１（ＲＳ_１）に対するミスマッチ（クラスＭに属する）を含むリードを、ミスマッチ位置に対応する塩基を修正することによって、ＲＳ_１から得られるリファレンスシーケンス２（ＲＳ_２）に対する完全にマッチングするリードに変換する方法の例を示す。これらは分類されたままであり、同じデータクラスアクセスユニット内の他のリードと一緒にコード化されるが、コード化はクラスＰリードに必要な記述子と記述子値のみを使用して行われる。この変換は、次のように表すことができる。 FIG. 19 shows that a read including a mismatch (belonging to class M) to reference sequence 1 (RS ₁ ) is completely corrected for reference sequence 2 (RS ₂ ) obtained from RS ₁ by correcting the base corresponding to the mismatch position. An example of a method of converting leads into matching leads is shown below. These remain classified and are coded along with other reads in the same data class access unit, but coding is done using only the descriptors and descriptor values required for class P reads. This transformation can be expressed as:

ＲＳ_１に適用されたときにＲＳ_２を生成する変換Ａの表現に、リードペアＲＳ_２の表現を加えたものが、クラスＭ対ＴＳ_１のリードの表現よりも低いエントロピーに対応する場合、
データ表現のより高い圧縮が達成されるので、変換Ａの表現及びリード対ＲＳ_２の対応する表現を送信することが有利である。 If the representation of transform A, which produces RS ₂ when applied to RS ₁ , plus the representation of lead pair RS ₂ corresponds to a lower entropy than the representation of the class M vs. TS ₁ lead,
Since a higher compression of data representation is achieved, it is advantageous to transmit a corresponding representation of the expression and lead pair RS ₂ conversion A.

圧縮ビットストリームにおける送信のための変換Ａのコード化は、以下の表に定義されるように、２つの付加的な記述子の定義を必要とする。

The encoding of transform A for transmission in a compressed bitstream requires the definition of two additional descriptors, as defined in the table below.

図２６は、マッピングされたリードでコード化されるミスマッチの数を減らすために、リファレンス変換がどのように適用されるかの例を示す。 FIG. 26 shows an example of how reference transform is applied to reduce the number of mismatches coded in the mapped leads.

場合によっては、リファレンスに変換が適用されることに注意する必要がある。
・変換を適用する前にリファレンスを参照するときに存在しなかったリードの表現にミスマッチが生じる場合がある。
・ミスマッチの類型を変更することができ、リードにはＧの代わりにＡが含まれ、他のすべてのリードにはＧの代わりにＣが含まれるが、ミスマッチが同じ位置に残る。
・異なるデータクラス及び各データクラスのデータのサブセットは、同じ「変換された」リファレンスシーケンス、又は同じ既存のリファレンスシーケンスに異なる変換を適用することによって得られたリファレンスシーケンスを参照することがある。 Note that in some cases, a transformation is applied to the reference.
A mismatch may occur in the representation of the lead that did not exist when referencing the reference before applying the transformation.
The type of mismatch can be changed, the lead contains A instead of G and all other leads contain C instead of G, but the mismatch remains in the same position.
-Different data classes and a subset of the data of each data class may refer to the same "transformed" reference sequence or a reference sequence obtained by applying a different transformation to the same existing reference sequence.

図２７はさらに、リファレンス変換が適用され、リードが「変換された」リファレンスを使用して表された後に、リードが適切な記述子セット（例えば、クラスＰの記述子を使用してクラスＭからのリードをコード化する）によってあるデータクラスから別のクラスにコード化の類型を変更する方法の例を示す。これは、例えば、変換により、実際にリードに存在する塩基のリードのミスマッチに対応するすべての塩基が変更されると、それによって、クラスＭに属するリード（もとの非「変換」リファレンスシーケンスを参照する場合）を、クラスＰの仮想リード（「変換された」リファレンスを参照するとき）の仮想的なリードに仮想的に変換する場合に発生する。データの各クラスに使用される記述子のセットの定義は、以下のセクションで提供される。 FIG. 27 further illustrates that after the reference transformation has been applied and the lead has been represented using the “translated” reference, the lead is read from the appropriate descriptor set (eg, from class M using class P descriptors). Here is an example of how to change the type of coding from one data class to another class by encoding the read of This means that if, for example, the conversion changes all bases corresponding to the read mismatch of the bases actually present in the read, then the reads belonging to class M (the original non- "converted" reference sequence Occurs when a reference is virtually converted to a virtual lead of a class P virtual lead (when referring to a "converted" reference). Definitions of the set of descriptors used for each class of data are provided in the following sections.

図３０は、異なるクラスのデータが同じ「変換された」リファレンスＲ_１＝Ａ_０（Ｒ_０）（３００）を使用してリードを再コード化する方法、又は異なる変換Ａ_Ｎ（３０１）、Ａ_Ｍ（３０２）、Ａ_Ｉ（３０３）を各クラスのデータに別々に適用できることを示す。
［記述子のブロックへのシーケンスリードを表現するために必要な情報の定義］ FIG. 30 shows how different classes of data can recode the read using the same “transformed” reference R ₁ = A ₀ (R ₀ ) (300), or a different transformation A _N (301), A _M (302) and A _I (303) can be applied separately to each class of data.
[Definition of information required to express sequence read to descriptor block]

リードの分類がクラスの定義で完了すると、さらなる処理は、特定のリファレンスシーケンスにマッピングされているとして表されたときに、リードシーケンスの再構築を可能にする残りの情報を表す個別の記述子のセットを定義することにある。これらの記述子のデータ構造は、デコーディングエンジンによって使用されるグローバルパラメータ及びメタデータの記憶を必要とする。これらのデータは、以下の表に示すゲノムデータセットヘッダ（Genomic Dataset Header）で構成されている。データセットは、単一のゲノムシーケンシングの実行及び以下の全ての分析に関連するゲノム情報を再構築するのに必要なコードのエレメントの集合として定義される。同一のゲノム試料を２回の個別の実行で２回シーケンシングする場合、得られたデータは２つの個別のデータセットとしてコード化される。 Once the lead classification has been completed with the definition of the class, further processing, when represented as being mapped to a particular reference sequence, of a separate descriptor representing the remaining information that allows the reconstruction of the lead sequence Is to define a set. The data structure of these descriptors requires the storage of global parameters and metadata used by the decoding engine. These data are composed of a Genomic Dataset Header shown in the following table. A data set is defined as the set of elements of the code necessary to reconstruct genomic information relevant to a single genomic sequencing run and all of the following analyses. If the same genomic sample is sequenced twice in two separate runs, the resulting data is encoded as two separate data sets.

表１．ゲノムデータセットヘッダの構造

Table 1. Genome dataset header structure

所定のリファレンスシーケンスを参照するシーケンスリード（すなわち、ＤＮＡセグメント）は、次式で十分に表すことができる：
・リファレンスシーケンス上の開始位置（ｐｏｓ）
・リードが、リファレンス（ｒｃｏｍｐ）に対する逆補完と見なされなければならない場合にシグナルを送るフラグ。
・ペアリードにおける場合の、メイトとなるペアまでの距離（ｐａｉｒ）。
・可変読み出し長を生成するシーケンシング技術の場合のリード長（ｌｅｎ）。リード長が一定の場合、各リードに関連するリード長は明らかに省略でき、メインファイルのヘッダに格納できる。
・各ミスマッチについて：
・位置のミスマッチ（クラスＮはｎｍｉｓ、クラスＭはｓｎｐｐ、クラスＩはｉｎｄｐ）
・ミスマッチの類型（クラスＮに存在せず、クラスＭにｓｎｐｔ、クラスＩにｉｎｄｔ）
・次のようなシーケンスリードの特別な特性を表すフラグ
・シーケンシングにおいて複数のセグメントを有するテンプレート
・各セグメントがアライナに従って正しく位置合わせされていること
・マッピングされていないセグメント
・マッピングされていないテンプレートの次のセグメント
・最初又は最後のセグメントの信号化
・品質管理不良
・ＰＣＲ又は光学的複製
・二次的なアライメント
・補助的なアライメント
・ソフトクリップされたヌクレオチドシーケンスが存在する場合（クラスＩのindc）
・アライメントと圧縮に使用されるリファレンスを示すフラグ（例：クラスＵの「内部の」リファレンス）、該当する場合において（記述子ｒｔｙｐｅ）。
・クラスＵの場合、記述子ｉｎｄｃは、「内部」のリファレンスを使用して、指定されたマッチング精度の制限のセットを使用し、リードのマッチングしない部分（通常はエッジ）を識別する。
・ｕｒｅａｄｓ記述子は、既存の（すなわち、「外部」のリファレンスゲノム）又は「内部の」リファレンスシーケンスであるため、使用可能なリファレンスにマッピングできないリードをそのままコード化するために使用される。 A sequence read (ie, a DNA segment) that references a given reference sequence can be adequately represented by the following equation:
・ Start position on reference sequence (pos)
A flag to signal if the read has to be considered as a reverse complement to the reference (rcomp).
The distance (pair) to the mate pair in the case of pair read.
A read length (len) in the case of a sequencing technology that generates a variable read length. If the lead length is constant, the lead length associated with each lead can obviously be omitted and stored in the header of the main file.
・ For each mismatch:
-Position mismatch (nmis for class N, snpp for class M, indp for class I)
-Type of mismatch (not present in class N, snpt in class M, indt in class I)
Flags that indicate special characteristics of sequence reads, such as: Templates with multiple segments in sequencing.Each segment must be correctly aligned according to the aligner.Unmapped segments. Next segment • Signaling of first or last segment • Poor quality control • PCR or optical replication • Secondary alignment • Auxiliary alignment • When soft clipped nucleotide sequence is present (Class I indc)
A flag indicating the reference used for alignment and compression (eg, an “internal” reference of class U), if applicable (descriptor rtype).
-For class U, the descriptor indc uses an "internal" reference to identify the unmatched parts of the lead (usually edges) using a specified set of matching precision limits.
The ureads descriptor is used to directly encode reads that cannot be mapped to an available reference because they are pre-existing (ie, “external” reference genomes) or “internal” reference sequences.

この分類は、ゲノムシーケンスリードを一義的に表現するために用いることができる記述子のグループ（記述子）を生成する。次の表は、「外部の」（すなわち「既存の」）リファレンス又は「内部の」（すなわち「構築された」）リファレンスでアライメントされたリードの各クラスに必要な記述子をまとめたものである。 This classification produces a group of descriptors (descriptors) that can be used to unambiguously represent genomic sequence reads. The following table summarizes the descriptors required for each class of read aligned with an "external" (i.e., "existing") or "internal" (i.e., "built") reference. .

表２．データのクラスごとに定義された記述子のブロック

Table 2. A block of descriptors defined for each class of data

クラスＰに属するリードは、特徴づけられ、位置、逆相補情報、及び、メイトペア、いくつかのフラグ、及びリード長を生成するシーケンシング技術によって取得されたメイトとの間のオフセットのみによって完全に再構成される。 Reads belonging to class P are characterized and completely re-created solely by position, reverse complement information, and the offset between the mate obtained by the sequencing technique that generates the mate pair, some flags, and the read length. Be composed.

次のセクションでは、これらの記述子がクラスＰ、Ｎ、Ｍ、Ｉに対してどのように定義されるかを詳細に説明し、クラスＵについては、以下のセクションで説明する。 The next section describes in detail how these descriptors are defined for classes P, N, M, I, and class U is described in the following section.

クラスＨＭはリードペアにのみ適用され、一方のリードがクラスＰ、Ｎ、Ｍ、又はＩに属し、もう一方のリードがクラスＵに属する特殊なケースである。
［位置記述子］ Class HM applies only to lead pairs, a special case where one lead belongs to class P, N, M or I and the other lead belongs to class U.
[Position descriptor]

位置（ｐｏｓ）ブロックでは、コード化された第１のリードのマッピング位置のみがリファレンスシーケンス上の絶対値として格納される。他の全ての位置記述子は、前の位置に対する差を表す値を仮定する。リード位置記述子のシーケンスによって定義される情報ソースのこのようなモデリングは、一般に、特に高カバレッジ結果を生成するシーケンシングプロセスのために、低減されたエントロピーによって特徴付けられる。 In the position (pos) block, only the coded first read mapping position is stored as an absolute value on the reference sequence. All other location descriptors assume a value representing the difference to the previous location. Such modeling of the information source defined by the sequence of lead position descriptors is generally characterized by reduced entropy, especially for sequencing processes that produce high coverage results.

例えば、図１は、リファレンスシーケンス上の位置「１００００」として第１のアライメントの開始位置を記述した後、位置１０１８０で開始する第２のリードの位置を「１０８０」として記述する方法を示す。高カバレッジ（＞５０×）では、位置ベクトルの記述子の大部分は、０や１等の低い値や他の小さな整数の高い出現率を示す。図１は、３つのリードペアの位置がどのようにｐｏｓブロックに記述されるかを示す。
［逆相補記述子］ For example, FIG. 1 shows a method of describing the start position of the first alignment as the position “10000” on the reference sequence, and then describing the position of the second lead starting at the position 10180 as “1080”. At high coverage (> 50 ×), most of the position vector descriptors exhibit a high value of low values, such as 0 or 1, or other small integers. FIG. 1 shows how the positions of three read pairs are described in a pos block.
[Reverse complementary descriptor]

シーケンシング技術によって生じたリードペアの各リードは、シーケンシングされた有機試料のいずれのゲノムストランド（genome strands）からも生じ得る。しかし、２本のストランドのうち１本だけがリファレンスシーケンスとして用いられる。図２は、リードペアにおいて、一方のリード（リード１）が、一方のストランドから、もう一方のリード（リード２）がもう一方のストランドから開始される様子を示す。 Each read of a read pair generated by the sequencing technique can originate from any genome strands of the sequenced organic sample. However, only one of the two strands is used as a reference sequence. FIG. 2 shows a state in which one lead (lead 1) starts from one strand and the other lead (lead 2) starts from the other strand in the lead pair.

ストランド１をリファレンスシーケンスとして用いた場合、リード２はストランド１上の対応する断片の逆相補ストランドとしてコード化される。これを図３に示す。 When strand 1 is used as a reference sequence, lead 2 is encoded as the reverse complement strand of the corresponding fragment on strand 1. This is shown in FIG.

結合されたリードの場合には、直接相補ペアと逆相補ペアの組み合わせは４通りある。これを図４に示す。ｒｃｏｍｐブロックは、可能な４つの組み合わせをコード化する。 In the case of coupled reads, there are four combinations of direct and reverse complement pairs. This is shown in FIG. The rcomp block encodes four possible combinations.

同じコード化は、クラスＮ、Ｍ、Ｐ、Ｉに属するリードの逆補完情報に使用される。異なるデータクラスへの選択的アクセスを可能にするために、４つのクラスに属するリードの逆補完情報は、表２に示すように異なるブロックにコード化される。
［ペアリング情報記述子］ The same coding is used for the reverse complement information of the leads belonging to classes N, M, P, I. In order to allow selective access to different data classes, the reverse complement information of the leads belonging to the four classes is coded in different blocks as shown in Table 2.
[Pairing information descriptor]

ペア記述子はペアブロックに記憶される。このようなブロックは、適用されたシーケンシング技術がペアごとのリードを生成するときに、元のリードペアを再構築するために必要な情報をコード化する記述子を記憶する。本発明の開示の時点で、シーケンシングデータの大部分は、ペアを形成するリードを生成する技術を使用して生成されるが、それは全ての技術の場合ではない。これは、考慮されるゲノムデータのシーケンシング技術が、ペアを形成するリード情報を生成しない場合、このブロックの存在が全てのシーケンシングデータ情報を再構築するために必要でない理由である。
［定義］ The pair descriptor is stored in the pair block. Such a block stores a descriptor that encodes the information needed to reconstruct the original read pair when the applied sequencing technique generates a pair-wise read. At the time of this disclosure, most of the sequencing data is generated using techniques to generate paired reads, but not in all techniques. This is why the presence of this block is not necessary to reconstruct all the sequencing data information, if the genomic data sequencing technique considered does not generate the paired read information.
[Definition]

・メイトペア（mate pair）：リードペアの他のリードに関連付けられたリード（例えば、前述の例では、リード２はリード１のメイトペアである）。
・ペアリング距離（pairing distance）：第１のリード（ペアリングアンカー、例えば、第１のリードの最後のヌクレオチド）のある位置から第２のリード（例えば、第２のリードの最初のヌクレオチド）のある位置を分離するリファレンスシーケンス上に配置されるヌクレオチドの数。
・最も可能性のあるペアリング距離（ＭＰＰＤ）：これは、ヌクレオチドの数で表される最も可能性の高いペアリング距離。
・ペアリング距離の位置（ＰＤＤ）：ＰＤＤは、特定の位置記述子ブロックに存在するそれぞれのメイトから、リードを分離するリードの数によってペアリング距離を表す方法である。
・最も可能性の高いペアリング距離の位置（ＭＰＰＤ）：特定の位置記述子ブロックに存在するメイトペアからリードを分離する、最も可能性の高いリード数である。
・ペアリングエラーの位置（ＰＰＥ）：ＭＰＰＤ又はＭＰＰＤとメイトの実際の位置との差として定義される。
・ペアリングアンカー：ペアの中の第１のリードの最後のヌクレオチドの位置で、リード位置の数又はヌクレオチドの位置の数に関してメイトペアの距離を計算するためのリファレンスとして用いられる。 Mate pair: a lead associated with another lead in the lead pair (eg, in the example above, lead 2 is the mate pair of lead 1).
Pairing distance: from a position of the first read (pairing anchor, eg, the last nucleotide of the first read) to a second read (eg, the first nucleotide of the second read). The number of nucleotides placed on a reference sequence that separates a position.
-Most likely pairing distance (MPPD): This is the most likely pairing distance expressed in number of nucleotides.
Pairing distance location (PDD): PDD is a method of expressing the pairing distance by the number of leads that separate the leads from each mate present in a particular location descriptor block.
Most likely pairing distance location (MPPD): The most likely number of reads that separates leads from mate pairs present in a particular location descriptor block.
Pairing Error Location (PPE): Defined as MPPD or the difference between MPPD and the actual location of the mate.
A pairing anchor: the last nucleotide position of the first read in the pair, used as a reference to calculate the distance of the mate pair in terms of the number of read positions or the number of nucleotide positions.

図５は、リードペア間のペアリング距離の計算方法を示す。 FIG. 5 shows a method of calculating a pairing distance between read pairs.

ペア（ｐａｉｒ）記述子ブロックは、定義されたデコード化されたペアリング距離に関して、ペアの第１のリードのメイトペアに達するためにスキップされるリードの数として計算されるペアリングエラー（pairing errors）のベクトルである。 The pair descriptor block is a pairing error calculated as the number of reads that are skipped to reach the mate pair of the first read of the pair, for a defined decoded pairing distance. Is a vector.

図６は、ペアリングエラーが、絶対値と微分ベクトル（高カバレッジのためのより低いエントロピーによって特徴づけられる）の両方によってどのように計算されるかの一例を示す。 FIG. 6 shows an example of how the pairing error is calculated by both the absolute value and the derivative vector (characterized by lower entropy for high coverage).

クラスＮ、Ｍ、Ｐ及びＩに属するリードのペアリング情報には、同じ記述子が使用される。異なるデータクラスへの選択的アクセスを可能にするために、図８（クラスＮ）、図１０、１２及び１４（クラスＭ）、及び図１５及び図１６（クラスＩ）に示すように、４つのクラスに属するリードのペアリング情報が異なるブロックにコード化される。
［異なるリファレンスシーケンス上にマッピングされたリードの場合のペアリング情報］ The same descriptor is used for the pairing information of the leads belonging to the classes N, M, P and I. To allow selective access to different data classes, as shown in FIGS. 8 (Class N), FIGS. 10, 12 and 14 (Class M), and FIGS. The pairing information of the leads belonging to the class is coded in different blocks.
[Pairing information for reads mapped on different reference sequences]

シーケンスリードをリファレンスシーケンスにマッピングする過程で第１のリードをあるリファレンスシーケンス（例えば、第１の染色体）にマッピングし、第２のリードを別のリファレンスシーケンス（例えば、第４染色体）にマッピングすることも珍しくない。この場合、上述のペアリング情報は、リードの１つをマッピングするために使用されるリファレンスシーケンスに関連する追加情報によって統合する必要がある。これは、コード化によって達成される：
１．２つの異なるシーケンス（リード１又はリード２が現在コード化されていないシーケンス上にマッピングされているとしたならば、異なる値を示す）にマッピングされていることを示す予め定められた値（フラグ）。
２．表１に示されるように、メインヘッダ構造においてコード化されたリファレンス識別子を参照するユニークなリファレンス識別子。
３．第３のエレメントは、ポイント２で識別され、最後にコード化された位置に対するオフセットとして表されるリファレンスに関するマッピング情報を含む。 Mapping a first read to one reference sequence (eg, a first chromosome) and mapping a second read to another reference sequence (eg, a fourth chromosome) in the process of mapping the sequence reads to a reference sequence. Is not uncommon. In this case, the above pairing information needs to be integrated with additional information related to the reference sequence used to map one of the leads. This is achieved by coding:
A predetermined value (indicating that it is mapped to two different sequences (if the lead 1 or lead 2 is mapped onto a sequence that is not currently coded, it will show different values) flag).
2. A unique reference identifier that refers to the coded reference identifier in the main header structure, as shown in Table 1.
3. The third element contains mapping information for the reference identified at point 2 and represented as an offset to the last coded position.

図７に、このシナリオの例を示す。 FIG. 7 shows an example of this scenario.

図７では、リード４は、現在コード化されているリファレンスシーケンス上にマッピングされていないので、ゲノムエンコーダは、ペアブロック中に付加的な記述子を作ることによってこの情報をシグナリングする。次の例では、ペア２のリード４がリファレンスＮｏ．４にマッピングされているが、現在コード化されているリファレンスはＮｏ．１である。この情報は、次の３つのコンポーネントを使用してコード化される。
１）一つの特別な予め定められた値はペアリング距離（この場合は、０ｘｆｆｆｆｆ）としてコード化される。
２）第２の記述子は、メインヘッダ（この場合は４）に記載されたリファレンスＩＤを提供する。
３）第３のエレメントは、関連するリファレンス（１７０）のマッピング情報が含まれる。
［クラスＮリードのミスマッチ記述子］ In FIG. 7, the genomic encoder signals this information by creating an additional descriptor in the paired block, since read 4 is not mapped onto the currently coded reference sequence. In the following example, the lead 4 of the pair 2 has the reference No. 4, but the currently coded reference is no. It is one. This information is coded using three components:
1) One special predefined value is coded as the pairing distance (in this case, 0xffffff).
2) The second descriptor provides the reference ID described in the main header (4 in this case).
3) The third element contains the mapping information of the associated reference (170).
[Class N read mismatch descriptor]

クラスＮには、「ｎタイプ」のミスマッチのみが存在するすべてのリードが含まれ、
Ａ、Ｃ、Ｇ又はＴ塩基の場所で、呼び出された塩基がＮとして見出される。リードの他のすべての塩基は、リファレンスシーケンスと完全にマッチングする。 Class N includes all leads with only "n-type" mismatches,
At the location of the A, C, G or T base, the called base is found as N. All other bases in the read match perfectly with the reference sequence.

図８に、その方法を示し：
リード１における「Ｎ」の位置は、
・リード１の絶対位置、又は、
・同じリードにおける前の「Ｎ」に対する微分位置、
としてコード化され、
リード２の「Ｎ」の位置は、
・リード１＋リード２の長さの絶対位置、又は、
・前の「Ｎ」に対する微分位置
としてコード化される。ｎｍｉｓブロックでは、各リードペアのコード化は、特殊な「セパレータ」記号で終了する。
［置換（ミスマッチ又はＳＮＰｓ）、挿入、削除をコード化する記述子］ FIG. 8 shows the method:
The position of "N" in lead 1 is
The absolute position of lead 1 or
The derivative position relative to the previous "N" in the same read,
Coded as
The position of “N” on lead 2 is
-The absolute position of the length of lead 1 + lead 2, or
-Coded as the derivative position relative to the previous "N". In the nmis block, encoding of each read pair ends with a special "separator" symbol.
[Descriptors encoding substitutions (mismatches or SNPs), insertions, deletions]

置換は、マッピングされたリードにおいて、リファレンスシーケンス中の同じ位置に存在するものに対して異なるヌクレオチド塩基の存在として定義される。 A substitution is defined as the presence of a different nucleotide base in a mapped read relative to that occurring at the same position in the reference sequence.

図９は、マッピングされたリードペアにおける置換の例を示す。各置換は、「位置」（ｓｎｐｐブロック）及び「類型」（ｓｎｐｔブロック）としてコード化される。置換、挿入又は削除の統計的な発生に応じて、関連する記述子の異なるソースモデルを定義し、関連するブロック内に生成された記号をコード化することができる。
［ソースモデル１：位置と類型としての置換］
［置換位置識別子］ FIG. 9 shows an example of replacement in a mapped read pair. Each substitution is coded as a "position" (snpp block) and a "type" (snpt block). Depending on the statistical occurrence of substitutions, insertions or deletions, different source models of the relevant descriptors can be defined, and the symbols generated in the relevant blocks can be coded.
[Source Model 1: Position and Substitution as Type]
[Replacement position identifier]

置換位置は、ｍｍｉｓブロックの値と同様に計算される。すなわち、
リード１において置換は、
・リード１の絶対的な位置として、又は
・同じリードの前の置換に対する微分位置として、
コード化される。
リード２において置換は、
・リード２＋リード１の長さの絶対位置として、又は
・前の置換に対する微分位置として、
コード化される。 The replacement position is calculated in the same manner as the value of the mmis block. That is,
In lead 1, the replacement is
As an absolute position of lead 1 or as a derivative position relative to a previous replacement of the same lead
Coded.
In lead 2, the replacement is
As the absolute position of the length of lead 2 + lead 1, or
Coded.

図１０は、置換（指定されたマッピング位置で、リードの記号がリファレンスシーケンスの記号と異なる場合）がどのようにコード化されるかを示す。
１．ミスマッチの位置
・リードの開始位置に関して、又は
・以前のミスマッチに関して（微分のコード化）
２．図１０に示されるように計算されたコードとして表されるミスマッチの類型 FIG. 10 shows how the permutation (if the symbol of the lead differs from the symbol of the reference sequence at the specified mapping position) is coded.
1. The position of the mismatch-with respect to the starting position of the read, or-with respect to the previous mismatch (derivative coding)
2. Mismatch types represented as codes calculated as shown in FIG.

ｓｎｐｐブロックにおいて、各リードペアのコーディングが特殊な「セパレータ」記号で終了する。
［置換形記述子］ In the snpp block, the coding of each read pair ends with a special "separator" symbol.
[Replacement descriptor]

クラスＭ（及びＩ、次のセクションで説明するように）の場合、ミスマッチは、リファレンスに存在する実際の記号から、リード｛Ａ、Ｃ、Ｇ、Ｔ、Ｎ、Ｚ｝に存在する対応する置換記号に、インデックスによってコード化される（右から左に移動する）。例えば、アライメントされたリードが、リファレンス内の同じ位置に存在するＴの代わりにＣを提示する場合、ミスマッチの指標は「４」と示される。デコード化プロセスはコード化された記述子を読み取り、リファレンス上の指定された位置にあるヌクレオチドを左から右に移動して、デコードされた記号を取得する。例えば、リファレンスにＧが存在する位置に対して受信された「２」は、「Ｎ」としてデコードされる。図１１は、すべての可能な置換及びそれぞれのコード化の記号を示す。明らかに異なるコンテキスト適応確率モデルを、記述子のエントロピーを最小化するために、各データクラスの各置換の種類の統計プロパティに従って、各置換インデックスに割り当てることができる。 In the case of class M (and I, as described in the next section), a mismatch is determined from the actual symbol present in the reference by the corresponding substitution present in lead {A, C, G, T, N, Z}. The symbol is coded by an index (moving from right to left). For example, if the aligned read presents a C instead of a T located at the same position in the reference, the mismatch indicator is indicated as "4". The decoding process reads the encoded descriptor and moves the nucleotide at the specified position on the reference from left to right to obtain the decoded symbol. For example, “2” received for a position where G exists in the reference is decoded as “N”. FIG. 11 shows all possible substitutions and the respective coding symbols. Clearly different context-adaptive probability models can be assigned to each permutation index according to the statistical properties of each permutation type of each data class to minimize descriptor entropy.

ＩＵＰＡＣ曖昧性コードを採用する場合、置換メカニズムは正確に同じ結果となるが、置換ベクトルはＳ＝｛Ａ、Ｃ、Ｇ、Ｔ、Ｎ、Ｚ、Ｍ、Ｒ、Ｗ、Ｓ、Ｙ、Ｋ、Ｖ、Ｈ、Ｄ、Ｂ｝として拡張される。 When employing the IUPAC ambiguity code, the permutation mechanism yields exactly the same result, but the permutation vector is S = ｛A, C, G, T, N, Z, M, R, W, S, Y, K, V, H, D, B}.

図１２は、ｓｎｐｔブロック内の置換のコード化の例を示す。 FIG. 12 shows an example of coding permutations within a snpt block.

ＩＵＰＡＣ曖昧性コードが採用された場合の置換形のコード化のいくつかの例を、図１３に、置換インデックスの別の例を図１４に示す。
［挿入と欠損のコード化］ Some examples of permutation coding when an IUPAC ambiguity code is employed are shown in FIG. 13, and another example of a permutation index is shown in FIG.
[Encoding Insertion and Deletion]

クラスＩの場合、ミスマッチ及び削除は、リファレンスに存在する実際の記号から、リードに存在する対応する置換記号：｛Ａ、Ｃ、Ｇ、Ｔ、Ｎ、Ｚ｝へ、インデックスによってコード化される（右から左に移動する）。例えば、アライメントされたリードが、リファレンス内の同じ位置に存在するＴの代わりにＣを示す場合、ミスマッチの指標は「４」となる。リファレンスに「Ａ」が存在することで、リードが削除を提示する場合、コード化された記号は「５」になる。デコード化プロセスは、コード化された記述子、すなわちリファレンス上の所定の位置にあるヌクレオチドを読み取り、左から右に移動してデコード化された記号を検索する。例えば、リファレンスにＧが存在する位置に対して受信された「３」は、「Ｚ」としてデコード化される。 For class I, mismatches and deletions are coded by index from the actual symbol present in the reference to the corresponding replacement symbol present in the lead: {A, C, G, T, N, Z} ( Move from right to left). For example, if the aligned read shows a C instead of a T at the same position in the reference, the mismatch indicator will be "4". If the lead presents a deletion due to the presence of "A" in the reference, the coded symbol will be "5". The decoding process reads the coded descriptor, the nucleotide at a given position on the reference, and moves from left to right to search for the decoded symbol. For example, a “3” received for a location where a G is in the reference is decoded as a “Z”.

挿入は、挿入されたＡ、Ｃ、Ｇ、Ｔ、Ｎに対してそれぞれ６、７、８、９、１０としてコード化される。 Insertions are coded as 6, 7, 8, 9, 10 for the inserted A, C, G, T, N, respectively.

図１５は、クラスＩのリードペアにおける置換、挿入及び欠損をコード化する方法の例を示す。ＩＵＰＡＣ曖昧性コードの全体集合をサポートするために、置換ベクトルＳ＝｛Ａ、Ｃ、Ｇ、Ｔ、Ｎ、Ｚ｝は、ミスマッチのために前段落で記述されたように、Ｓ＝｛Ａ、Ｃ、Ｇ、Ｔ、Ｎ、Ｚ、Ｍ、Ｒ、Ｗ、Ｓ、Ｙ、Ｋ、Ｖ、Ｈ、Ｄ、Ｂ｝によって置き換えられる。この場合、置換ベクトルが１６個のエレメントを持つ場合、挿入コードは異なる値、すなわち１６、１７、１８、１９、２０を持つ必要がある。このメカニズムを図１６に示す。
［ソースモデル２：１つのブロック当たりの置換の類型とインデル（indels）］ FIG. 15 shows an example of a method for encoding substitutions, insertions and deletions in class I read pairs. To support the entire set of IUPAC ambiguity codes, the permutation vector S = {A, C, G, T, N, Z} is defined as S = {A, C, G, T, N, Z, M, R, W, S, Y, K, V, H, D, B}. In this case, if the permutation vector has 16 elements, the insertion code must have different values, ie, 16, 17, 18, 19, 20. This mechanism is shown in FIG.
[Source Model 2: Permutation types and indels per block]

一部のデータ統計については、前のセクションで説明したものとは異なるコード化モデルによって、エントロピーの低いソースを生成する置換及びインデルの開発をすることができる。このようなコード化モデルは、ミスマッチのみ、及びミスマッチ及びインデルについて上述した技術の代替となる。 For some data statistics, coding models different from those described in the previous section allow for the development of permutations and indels that generate low entropy sources. Such a coded model is an alternative to the techniques described above for mismatches only and mismatches and indels.

この場合、１つのデータブロックが、置換可能記号（５を除くＩＵＰＡＣコード、１６を伴うＩＵＰＡＣコード）ごとに定義され、さらに、削除用に１ブロック、挿入用に４ブロックが定義される。説明を簡単にするためにＩＵＰＡＣコードがサポートされていない場合に焦点を当てて説明する。 In this case, one data block is defined for each replaceable symbol (IUPAC code excluding 5, IUPAC code with 16), and one block is defined for deletion and four blocks for insertion. For simplicity, the description focuses on the case where IUPAC code is not supported.

図１７は、各ブロックが，どのように単一型挿入又はミスマッチの位置を含むかを示す。コード化されたリードペアにその類型のミスマッチ又は挿入が存在しない場合、対応するブロックに０がコード化される。各アクセスユニットのヘッダには、デコーダがこのセクションで説明したブロックのデコード処理を開始できるように、デコード化される第１のブロックを通知するフラグが含まれてる。図１８の例では、デコード化される第１のエレメントは、Ｃブロックの位置２である。特定の類型のミスマッチ又はインデルがリードペアに存在しない場合、対応するブロックに０が追加される。デコード化の側では、各ブロックのデコード化ポインタが０の値を指している場合、デコード化プロセスは次のリードペアに移る。
［付加的なシグナリングフラグのコード化］ FIG. 17 shows how each block contains a single-type insertion or mismatch location. If there is no mismatch or insertion of that type in the coded read pair, a 0 is coded in the corresponding block. The header of each access unit includes a flag indicating the first block to be decoded so that the decoder can start decoding the blocks described in this section. In the example of FIG. 18, the first element to be decoded is position 2 of the C block. If a particular type of mismatch or indel does not exist in the read pair, a 0 is added to the corresponding block. On the decoding side, if the decoding pointer of each block points to a value of 0, the decoding process moves on to the next read pair.
[Encoding of additional signaling flags]

上記で導入された各データクラス（Ｐ、Ｍ、Ｎ、Ｉ）は、コード化されたリードの性質に関する追加情報のコード化を必要とする場合がある。この情報は、例えば、シーケンシングの実験に関連していてもよく（例えば、１つのリードが重複する可能性を示す）、又はリードマッピングの何らかの特性を表してもよい（例えば、ペアの第１番目又は第２番目）。本発明のコンテキストでは、この情報は、各データクラスに対して別々のブロックにコード化される。このようなアプローチの主な利点は、必要な場合にのみ、必要なリファレンスシーケンス領域においてのみ、この情報に選択的にアクセスできることである。このようなフラグの他の使用例を次に示す。
・リードペア
・適切なペアとしてマッピングされたリード
・マッピングされていないリード又はメイト
・逆ストランドからのリード又はメイト
・ペアの第１番目／第２番目
・プライマリアライメントではない
・リードに失敗したプラットフォーム／ベンダーの品質チェック
・リードはＰＣＲ又は光学的複製
・補助的なアライメント
［クラスＵの記述子と、「クラスＵ」及び「クラスＨＭ」のマッピングされていないリードの「内部」リファレンスの構築］ Each of the data classes (P, M, N, I) introduced above may require encoding of additional information regarding the nature of the encoded lead. This information may be, for example, related to a sequencing experiment (eg, indicating the likelihood of one read overlapping) or may represent some characteristic of the read mapping (eg, the first of a pair). Or second). In the context of the present invention, this information is coded in a separate block for each data class. The main advantage of such an approach is that this information can be selectively accessed only when needed and only in the required reference sequence area. Another example of using such a flag is as follows.
• Lead pairs • Leads mapped as appropriate pairs • Unmapped leads or mates • Leads from reverse strand or first / second of mate pairs • Not primary alignment • Platforms / vendors that failed to read PCR or optical duplication • Auxiliary alignment [construction of class U descriptors and “internal” references of unmapped “class U” and “class HM” reads]

クラスＵに属するリード又はマッピングされていない「ＨＭクラス」のペアリードの場合、それらは、クラスＰ、Ｎ、Ｍ、又はＩのいずれかに属するためのマッチング精度の制約の指定されたセットを満たす「外部の」リファレンスシーケンスにマッピングできないので、一つ以上の「内部の」リファレンスシーケンスが「構築」され、これらのデータクラスに属するリードの圧縮表現のために使用される。 For leads belonging to class U or unmapped "HM class" paired leads, they satisfy a specified set of matching accuracy constraints to belong to any of classes P, N, M, or I " Since it cannot be mapped to an "external" reference sequence, one or more "internal" reference sequences are "built" and used for the compressed representation of reads belonging to these data classes.

例えば、次のような制限ではなく、適切な「内部の」リファレンスを構築する方法がいくつかある。
・少なくとも最小サイズ（シグネチャ）の共通の連続したゲノムシーケンスを共有するリードを含むクラスタへのマッピングされていないリードの分割。各クラスタは、図２２に示すように、そのシグネチャによってユニークに識別することができる。
・意味のある順序（例えば辞書順）でのリードのソートと、最後のＮリードをＮ＋１のコード化に対する「内部の」リファレンスとして使用する方法。この方法を図２３に示す。
・指定されたマッチング精度の制約、又は新しい制約セットに従って、そのクラスに属するリードの全て又は関連するサブセットをアライメントし、コード化することができるように、クラスＵのリードのサブセット上で、いわゆる「デノボアセンブリ（de-novo assembly）」を実行すること。 For example, there are several ways to build a proper "internal" reference, rather than the following restrictions:
Splitting unmapped reads into clusters containing reads that share a common contiguous genomic sequence of at least the smallest size (signature). Each cluster can be uniquely identified by its signature, as shown in FIG.
Sorting leads in a meaningful order (eg, lexicographical order) and using the last N leads as an “internal” reference to N + 1 coding. This method is shown in FIG.
-On a subset of the leads of class U, so-called "so-called", so that all or related subsets of the leads belonging to that class can be aligned and coded according to the specified matching accuracy constraints or the new constraint set. Doing "de-novo assembly".

コード化されているリードを、指定されたマッチング精度の制約のセットを満たす「内部」リファレンスにマッピングできる場合、圧縮後にリードを再構築するために必要な情報は、次の種類の記述子を使用してコーディングされる。
１．内部リファレンス（ｐｏｓブロック）のリード番号の観点から、内部リファレンスのマッチング部分の開始位置。この位置は、以前にコード化されたリードに対して絶対値又は微分値としてコード化できる。
２．内部リファレンス（ｐａｉｒブロック）の対応するリードの開始点からの開始位置のオフセット。例えば、リード長が一定の場合、実際の位置はｐｏｓ＊ｌｅｎｇｔｈ＋ｐａｉｒである。
３．ミスマッチの位置（ｓｎｐｐブロック）及び類型（ｓｎｐｔブロック）としてコード化されたミスマッチが存在する可能性がある。
４．内部リファレンスとマッチングしない（又はマッチングするが、定義されたしきい値を超える数のマッチングがある）リードの部分（一般には、ペアで識別されるエッジ）は、ｉｎｄｃブロックでコード化される。図２４に示すように、ｉｎｄｃブロックでコード化されたミスマッチのエントロピーを低減するために、使用される内部リファレンスの一部のエッジに対してパディング操作を実行することができる。エンコーダは、処理中のゲノムデータの統計的特性に応じて、最適なパディングの方策を選択できる。選択可能なパディングの方策は次のとおりである。
ａ．パディングをしない
ｂ．現在コード化されているデータの頻度に応じて選択された一定のパディングパターン
ｃ．最新のＮ個のコード化されたリードに関して定義された、現在のコンテキストの統計的特性に従った可変パディングパターン
特定の種類のパディングの方策は、ｉｎｄｃブロックヘッダの特別な値によって通知され得る。
５．リードが内部の自己生成、外部又はリファレンス無しでコード化されているか否かを示すフラグ（ｒｔｙｐｅブロック）。
６．逐語的にコード化されたリード（ｕｒｅａｄｓ）。 If the encoded lead can be mapped to an "internal" reference that meets the specified set of matching precision constraints, the information needed to reconstruct the lead after compression uses the following types of descriptors: Coded.
1. Start position of the matching portion of the internal reference from the viewpoint of the read number of the internal reference (pos block). This location can be coded as an absolute or derivative value relative to a previously coded lead.
2. Offset of the start position from the start point of the corresponding read of the internal reference (pair block). For example, if the lead length is constant, the actual position is pos * length + pair.
3. There may be mismatches coded as the location of the mismatch (snpp block) and the type (snpt block).
4. Portions of the lead that do not match (or match, but have a number of matches above a defined threshold) the internal reference (generally edges identified in pairs) are coded in indc blocks. As shown in FIG. 24, a padding operation can be performed on some edges of the internal reference used to reduce the entropy of mismatch coded in the indc block. The encoder can select an optimal padding strategy according to the statistical characteristics of the genomic data being processed. The padding strategies that can be selected are as follows.
a. No padding b. A fixed padding pattern selected according to the frequency of the currently coded data c. Variable padding patterns defined according to the statistical properties of the current context, defined for the last N coded leads, The specific kind of padding strategy may be signaled by a special value in the indc block header.
5. Flag (rtype block) indicating whether the read is coded internally self-generated, external or without reference.
6. Verbatim coded leads.

図２４に、このようなコード化手順の例を示す。 FIG. 24 shows an example of such an encoding procedure.

図２５は、ｐｏｓ＋ｐａｉｒ記述子がコード付きｐｏｓに置き換えられた、内部リファレンス上のマッピングされていないリードの代替的なコード化を示す。この場合、ｐｏｓは、−リファレンスシーケンス上の位置に関して−、リードｎ−１の左端のヌクレオチドの位置に対するリードｎの左端のヌクレオチド位置の距離を表す。 FIG. 25 shows an alternative encoding of unmapped leads on the internal reference, where the pos + pair descriptor has been replaced with a coded pos. In this case, pos represents the distance of the leftmost nucleotide position of read n to the leftmost nucleotide position of read n-1, relative to the position on the reference sequence.

クラスＵのリードが可変長の場合、各リードの長さを記憶するために付加的な記述子ｒｌｅｎが使用される。 If the leads of class U are of variable length, an additional descriptor rlen is used to store the length of each lead.

このコーディングアプローチは、リードを２つ以上のリファレンス位置に分割できるように、リードごとにＮ個の開始位置をサポートするように拡張できる。これは、シーケンシング方法論のループによって生成される繰り返しパターンを通常表示する非常に長いリード（５０Ｋ＋塩基）を生成するシーケンシングテクノロジー（パシフィックバイオサイエンス（Pacific Bioscience）等）によって生成されたリードをコード化するのに特に役立つ。同じアプローチを使用して、ゲノムの２つの異なる部分に重なりがほとんど又は全く無いリードとして定義されるキメラシーケンスリードをコード化することもできる。 This coding approach can be extended to support N starting positions per lead so that the lead can be split into two or more reference positions. It encodes reads generated by sequencing technologies (such as Pacific Bioscience) that generate very long reads (50K + bases) that typically display the repetitive pattern generated by the sequencing methodology loop. Especially useful to do. The same approach can also be used to encode chimeric sequence reads, defined as reads with little or no overlap between two different parts of the genome.

上記のアプローチは、単純なクラスＵを超えて明確に適用でき、リード位置（ｐｏｓブロック）に関連する記述子を含む任意のブロックに適用できる。
［アライメントスコア記述子］ The above approach is clearly applicable beyond the simple class U and can be applied to any block that contains a descriptor associated with a lead position (pos block).
[Alignment score descriptor]

ｍｓｃｏｒｅ記述子は、アライメントごとにスコアを提供する。本発明のコンテキストにおいて、ゲノムシーケンスリードアライナーにより生成されるリードごとのマッピング／アライメントスコアを表すために使用される。 The mscore descriptor provides a score for each alignment. In the context of the present invention, used to represent per-read mapping / alignment scores generated by the Genome Sequence Read Aligner.

スコアは、指数部と仮数部を使用して表される。指数部及び仮数部を表すために使用されるビット数は、構成パラメータとして転送される。一例として、しかし限定としてではなく、表２は、１１ビットの指数部及び５２ビットの仮数部に関して、これがどのようにＩＥＥＥＲＦＣ７５４に規定されているかを示す。 The score is represented using an exponent and a mantissa. The number of bits used to represent the exponent and the mantissa are transferred as configuration parameters. By way of example, but not by way of limitation, Table 2 shows how this is defined in IEEE RFC 754 for an 11-bit exponent and a 52-bit mantissa.

各アライメントのスコアは、次のように表すことができる：
・１ビットの符号（Ｓ）
・１１ビット指数部（Ｅ）
・５３ビットの仮数部（Ｍ） The score for each alignment can be expressed as:
-1-bit code (S)
-11-bit exponent (E)
-53-bit mantissa (M)

表２．アライメントスコアは、６４ビットの倍精度浮動小数点値として表現できる

Table 2. The alignment score can be expressed as a 64-bit double-precision floating-point value

スコアの計算に使用される塩基（基数）は１０であるため、次のようになる。
スコア＝−１^ｓ×１０^Ｅ×Ｍ
［リードのグループ］ Since the base (radix) used in the calculation of the score is 10, the result is as follows.
Score = −1 ^s × 10 ^E × M
[Group of Leads]

シーケンシングプロセス中に、さまざまな類型のシーケンスリードを生成できる。例として、しかし限定ではなく、類型は異なるシーケンスされたサンプル、異なる実験、シーケンシング装置の異なる構成に関連付けることができる。開示された発明によれば、ｒｇｒｏｕｐと名付けられた専用の記述子により、シーケンシング及びアライメントの後、この情報が保存される。ｒｇｒｏｕｐは、それぞれコード化されたリードに関連付けられたラベルであり、デコード化装置がデコード後にデコードされたリードをグループに分割することを可能にする。
［マルチプルアライメントの記述子］ Various types of sequence reads can be generated during the sequencing process. By way of example, and not limitation, types may be associated with different sequenced samples, different experiments, and different configurations of sequencing equipment. According to the disclosed invention, this information is stored after sequencing and alignment by a dedicated descriptor named rgroup. rgroup is a label associated with each coded lead, and allows the decoding device to divide the decoded leads into groups after decoding.
[Multiple alignment descriptor]

マルチプルアラインメントをサポートするために、次の記述子が指定されている。スプライスされたリードが存在する場合、本発明は、１に設定されるグローバルフラグとしてｓｐｌｉｎｅ＿ｒｅａｄｓ＿ｆｌａｇを定義する。
［ｍｍａｐ記述子］ The following descriptors are specified to support multiple alignment: If there is a spliced lead, the present invention defines spline_reads_flag as a global flag that is set to one.
[Mmap descriptor]

ｍｍａｐ記述子は、リード又はペアのリードの左端の位置が何個アライメントアラインされたかを通知するために使用される。マルチプルアライメントを含むゲノムレコードは、１つのマルチバイトのｍｍａｐ記述子に関連付けられる。ｍｍａｐ記述子の最初の２バイトは、単一のセグメント（コード化されたデータセットにスプライスが存在しない場合）又は、その代わりにリードがいくつかの可能なアライメントのためにスプライスされた全てのセグメント（データセットにスプライスが存在する場合）としてのリードを参照する符号無しの整数Ｎを表す。Ｎの値は、このレコードのテンプレートに対してｐｏｓ記述子の値がいくつコード化されているかを示す。以下に説明するように、Ｎの後に１つ以上の符号なし整数Ｍ_ｉが続く。
［マルチプルアライメントのストランド性］ The mmap descriptor is used to notify how many positions of the left end of a lead or a pair of leads are aligned. Genome records containing multiple alignments are associated with one multi-byte mmap descriptor. The first two bytes of the mmap descriptor may be a single segment (if there is no splice in the encoded data set) or, alternatively, all segments where the read was spliced for some possible alignments An unsigned integer N that refers to the read as (if a splice exists in the data set). The value of N indicates how many pos descriptor values are coded for the template for this record. As described below, one or more unsigned integer M _i after N is followed.
[Strand property of multiple alignment]

本発明で説明されるｒｃｏｍｐ記述子は、本発明で指定される構文を使用して各リードアライメントのストランド性（strandedness）を指定するために使用される。
［マルチプルアラインメントのスコア］ The rcomp descriptor described in the present invention is used to specify the strandedness of each read alignment using the syntax specified in the present invention.
[Multiple alignment score]

マルチプルアラインメントの場合、本発明で指定される１つのｍｓｃｏｒｅが各アラインメントに割り当てられる。
［スプライスのないマルチプルアライメント］ In the case of multiple alignment, one mscore specified in the present invention is assigned to each alignment.
[Multiple alignment without splice]

アクセスユニットにスプライスがない場合、ｓｐｌｉｎｅ＿ｒｅａｄｓ＿ｆｌａｇは設定が解除される。 If there is no splice in the access unit, the setting of spline_reads_flag is released.

ペアエンドシーケンシングでは、ｍｍａｐ記述子は、ｉを１から、完全に最初（ここでは左端）のリードアライメントアライメントの数までの値をとると仮定して、１６ビットの符号無しの整数Ｎとそれに続く１つ以上の８ビットの符号無しの整数Ｍ_ｉで構成される。第１のリードアライメントに対して、スプライスされているか否かにかかわらず、Ｍｉは、第２のリードのアライメントに使用されるセグメントの数（この場合、スプライスが無い場合、これはアライメントの数に等しくなる）、そして、第１のリードのアライメントのためにペア記述子の値が何個コード化されているかを通知するために使用される。 In paired-end sequencing, the mmap descriptor is a 16-bit unsigned integer N followed by i, assuming i takes a value from 1 to the number of the completely first (here, left-most) read alignment alignment. It consists of an integer M _i unsigned one or more 8-bit. For the first read alignment, whether spliced or not, Mi is the number of segments used to align the second read (in this case, if there is no splice, this is the number of alignments). Equals) and is used to signal how many pair descriptor values are coded for the alignment of the first read.

Ｍ_ｉの値は、第２のリードのアライメントの数を表すために、次式

が使用される。 The value of M _i, to represent the number of the second lead of the alignment, the following equation

Is used.

Ｍ_ｉ（＝０）の特別な値は、左端のリードの第ｉ番目のアライメントが、ｋ＜ｉ（上式と一致する新しいアラインメントは検出されないとき）を有する左端のリードの第ｋ番目のアライメントとすでに対になっている右端のリードのアライメントとペアになっていることを示す。 The special value of M _i (= 0) is that the i-th alignment of the left-most lead is the k-th alignment of the left-most lead with k <i (when no new alignment matching the above equation is detected). Indicates that it is paired with the alignment of the right-most lead that has already been paired.

例えば、最も単純な場合は次のようになる。
１左端のリードに対してシングルアライメントと、右端の２つの代替的なアライメントがある場合、Ｎは１となり、Ｍ１は２となる。
２２つの代替的アライメントが左端のリードで検出され、右端のリードで１つしか検出されない場合、Ｎは２となり、Ｍ_２は０となる。 For example, in the simplest case:
1 If there is a single alignment for the left end lead and two alternative alignments for the right end, N is 1 and M1 is 2.
2 two alternative alignment is detected in the left end of the lead, if only detected one at the right end of the lead, N is the 2 next, M ₂ is zero.

Ｍ_ｉが０であるとき、ペアの関連する値は、既存の第２のリードアライメントにリンクしなければならず；そうしないと構文エラーが発生し、アラインメントが壊れたとみなされる。 When M _i is 0, the associated value pairs must be linked to the existing second lead alignment; otherwise the syntax error occurs, is regarded as the alignment is broken.

例：先に述べたように、第１のリードが２つのマッピング位置を有し、第２のリードが１つのマッピング位置を有する場合、Ｎは２であり、Ｍ_１は１であり、及びＭ_２は０である。これに続いて、テンプレート全体に対する別の代替的なセカンダリマッピングが行われる場合、Ｎは３であり、Ｍ_３は１である。 Example: As mentioned above, if the first lead has two mapping positions and the second lead has one mapping position, N is 2, M ₁ is 1, and M ₂ is 0. Subsequent to this, if another alternative secondary mapping to the entire template is performed, N is 3 and M ₃ is 1.

３９は、スプライス及びエラーの無いマルチプルアラインメントの場合のＮ、Ｐ、Ｍ_ｉの意味付けを示し、リファレンスソースは見つからず、ｐｏｓ、ｐａｉｒ、及びｍｍａｐ記述子を使用してマルチプルアラインメント情報をコード化する方法を示す。 39, N in the case of no multiple alignment of splice and error, P, indicates the meaning of _{M i,} the reference source is not found, encode multiple alignment information using pos, pair, and a mmap descriptor Here's how.

４０に関しては、以下のとおりである：
・右端のリードは

のアライメントを有し、
・左端のリードの第ｉ番目のアライメントが、左端のリードの第ｋ番目（ｋ＜ｉ）のアライメントと既にペアになっている右端のリードのアライメントと、ペアになっている場合のＭ_ｉのいくつかの値は＝０になることがあり、
・ペア記述子の１つの予め定められた値は、他のＡＵの範囲に属するアライメントの信号に存在することができる。それが存在する場合は、常に、現在のレコードに対する第１のｐａｉｒ記述子になる。
［スプライスを使用したマルチプルアライメント］ For 40, it is as follows:
・ The lead on the right end

Has the alignment of
- the left end of the i-th alignment of the lead, the k-th of the left end of the lead (k <i) the alignment of the already of the right end of the lead in the pair alignment, of if it is to a pair of M _i Some values can be = 0,
-One predetermined value of the pair descriptor can be present in the signal of the alignment belonging to the range of another AU. If it exists, it will always be the first pair descriptor for the current record.
[Multiple alignment using splice]

データセットがスプライスされたリードでコード化されている場合、ｍｓａｒ記述子を使用すると、スプライスの長さとストランド性（strandedness）を表現できる。 If the dataset is coded with spliced reads, the msar descriptor can be used to represent splice length and strandedness.

ｍｍａｐ及びｍｓａｒ記述子をデコードした後、デコーダは、マルチプルマッピングを表すためにコード化されたリード又はリードペアの数、及び各リード又はリードペアのマッピングを構成しているセグメントの数を知っている。これを図４１及び図４２に示す。 After decoding the mmap and msar descriptors, the decoder knows the number of leads or lead pairs coded to represent multiple mappings and the number of segments that make up the mapping of each lead or lead pair. This is shown in FIG. 41 and FIG.

図４１を参照すると、以下が適用される：
・左端のリードには、Ｎ個のスプライス（Ｎ_１≦Ｎ）を伴うＮ_１アライメントを有する。
・Ｎは、左端のリードの全てのアライメントに存在するスプライスの数を表し、ｍｍａｐ記述子の最初の値としてコード化される。
・右端のリードは、

のスプライスを有し、ここでＭ_ｉは、左端のリードの第ｉ番目のアライメント（１≦ｉ≦Ｎ_１）とペアで関連付けられた右端のリードのスプライスの数である。つまり、Ｐは右端のリードのスプライスの数を表し、ｍｍａｐ記述子の最初の値に続くＮ値を使用して計算される。
・Ｎ_１及びＮ_２は、第１及び第２のリードのアラインメントの数を表し、ｍｓａｒ記述子のＮ＋Ｐ値を使用して計算される。 Referring to FIG. 41, the following applies:
The-left end of the lead, having _{N 1} alignment with the N splice _{(N 1} ≦ N).
N represents the number of splices present in all alignments of the leftmost lead, coded as the first value of the mmap descriptor.
・ The right end lead

Where M _i is the number of splices in the rightmost lead paired with the ith alignment of the leftmost lead (1 ≦ i ≦ N ₁ ). That is, P represents the number of splices of the rightmost lead and is calculated using the N value following the first value of the mmap descriptor.
· N ₁ and _{N 2} represents the number of alignment of the first and second leads, are calculated using the N + P value of msar descriptor.

図４２を参照すると、以下が適用される：
・左端にはＮ個のスプライス（Ｎ_１≦Ｎ）を伴うＮ_１アライメントを有する。Ｎ_１＝ＮＡＮＤＮ_２＝Ｐの場合スプライスは存在しない。
・右端のリードは、

スプライス、ｔ_ｊ１≦ｊ≦Ｐ、及びＮ_２（Ｎ_２≦Ｐ）アライメントを有する。
・ｐａｉｒ記述子の数は、ＮＰ＝Ｍａｘ（Ｎ１，Ｐ）＋Ｍ_０として計算され、ここで
・Ｍ０は値が０のＭｉの数であり、
・ＮＰは、1つの特別なｐａｉｒ記述子が他のＡＵにアラインメントが存在することを示す場合に１だけ増加する必要がある。
［アライメントスコア］ Referring to FIG. 42, the following applies:
- the left end has a _{N 1} alignment with the N splice _{(N 1} ≦ N). If N ₁ = N AND N ₂ = P, there is no splice.
・ The right end lead

With splice, t _j 1 ≦ j ≦ P, and N ₂ (N ₂ ≦ P) alignment.
· Pair number of descriptors are calculated as NP = Max (N1, P) + M 0, where · M0 is the number of values is 0 Mi,
NP needs to be incremented by 1 if one special pair descriptor indicates that there is an alignment in another AU.
[Alignment score]

ｍｓｃｏｒｅ記述子は、アライメントのマッピングスコアの通知を許容する。シングル・エンドシーケンシングでは、テンプレートごとにＮ_１値を有し；ペアエンドシーケンシングにおいて、テンプレート全体の各アラインメントに対して値を有する（第１のリードの異なるアライメントの数＋第２のリードのさらなるアライメントの数、すなわち、Ｍ_ｉ−１＞０の場合）
スコアの数＝ＭＡＸ（Ｎ_１，Ｎ_２）＋Ｍ_０
ここで、Ｍ０はＭ_ｉ＝０の総数を示す。 The mscore descriptor allows notification of the alignment mapping score. In single-ended sequencing, having N ₁ value for each template; in paired end sequencing has values for each alignment of the entire template (of the first lead different alignment numbers + the second lead further Number of alignments, ie, if M _i −1> 0)
Number of scores = MAX (N ₁ , N ₂ ) + M ₀
Here, M0 indicates the total number of M _i = 0.

本発明では、複数のスコアの値を、各アライメントに関連付けることができる。アライメントの数は、構成設定パラメータのａｓ＿ｄｅｐｔｈによって通知される。
［スプライスのないマルチプルアライメントに対する記述子］ In the present invention, a plurality of score values can be associated with each alignment. The number of alignments is signaled by the configuration parameter as_depth.
[Descriptor for multiple alignment without splice]

表３．スプライスのないマルチプルアライメントの場合に、１つのゲノムレコード内の複数のアライメントを表すために必要な記述子の数の決定

［スプライスを使用したマルチプルアライメントの記述子］ Table 3. Determining the number of descriptors needed to represent multiple alignments within a single genome record for spliceless multiple alignments

[Descriptor for multiple alignment using splice]

表４は、スプライスを有するマルチプルアラインメントの場合に、１つのゲノム記録においてマルチプルアラインメントを表すのに必要な記述子の数の決定を示す。 Table 4 shows the determination of the number of descriptors needed to represent multiple alignments in one genomic record, for multiple alignments with splices.

表４．スプライスを有するマルチプルアラインメントにおける、１つのゲノムレコードにおいてマルチプルアラインメントを表すのに必要な記述子の数の決定

［異なるシーケンス上のマルチプルアラインメント］ Table 4. Determining the number of descriptors needed to represent multiple alignments in one genomic record in multiple alignments with splices

[Multiple alignment on different sequences]

アライメントプロセスは、プライマリマッピングが配置されているリファレンスシーケンスとは別のリファレンスシーケンスへの代替マッピングを見つけることがある。 The alignment process may find an alternative mapping to a reference sequence different from the reference sequence where the primary mapping is located.

ユニークにアライメントされたリードペアの場合、例えば、別の染色体上のメイトとのキメラシーケンスがある場合、絶対リード位置を表すためにｐａｉｒ記述子を使用しなければならない。ｐａｉｒ記述子は、リファレンスと、同じテンプレートに対する更なるアラインメントを含む次のレコードの位置とを通知するために使用されなければならない。最後のレコード（例えば、代替マッピングが３つの異なるＡＵでコード化されている場合、第３番目）は、リファレンスと最初のレコードの位置を含む。 For uniquely aligned read pairs, for example, when there is a chimeric sequence with a mate on another chromosome, the pair descriptor must be used to represent the absolute read position. The pair descriptor must be used to signal the reference and the location of the next record that contains further alignment to the same template. The last record (eg, the third if the alternate mapping is coded with three different AUs) includes the reference and the location of the first record.

ペアの左端のリードの１つ以上のアラインメントが、現在コード化されているＡＵに関連するリファレンスシーケンスとは異なるリファレンスシーケンス上に存在する場合、予め定められた値がペア記述子に使用される。予め定められた値の後には、リファレンスシーケンス識別子と、次のＡＵ（つまり、そのレコードのｐｏｓ記述子の第１のデコード値）に含まれるすべての中で左端のアラインメントの位置が続く。
［挿入、削除、マッピングされていない部分を含むマルチプルアラインメント］ If one or more alignments of the leftmost lead of the pair are on a different reference sequence than the reference sequence associated with the currently coded AU, a predetermined value is used for the pair descriptor. The predetermined value is followed by the reference sequence identifier and the position of the leftmost alignment in everything contained in the next AU (ie, the first decoded value of the pos descriptor of the record).
[Multiple alignment including uninserted, deleted and unmapped parts]

代替的なセカンダリマッピングが、シーケンスがアライメントされるリファレンス領域の連続性を保持しない場合、実際のシーケンス（及び、置換又はインデル（indels）のようなミスマッチングに関連する記述子）は、プライマリアライメントに対してのみコード化されるので、アライナによって生成された正確なマッピングを再構築することは不可能かもしれない。ｍｓａｒ記述子は、インデル（indels）及び／又はソフトクリップが含まれている場合に、セカンダリアライメントがリファレンスシーケンスにどのようにマッピングされるかを表すために使用される。ｍｓａｒがセカンダリアライメントの特殊な記号「＊」によって表されている場合、デコーダは、プライマリアライメント及びセカンダリアライメントのマッピング位置からセカンダリアライメントを再構築する。
［ｍｓａｒ記述子］ If the alternative secondary mapping does not preserve the continuity of the reference region to which the sequence is aligned, the actual sequence (and descriptors associated with mismatches such as substitutions or indels) will be It may not be possible to reconstruct the exact mapping generated by the aligner, since it is only coded for this. The msar descriptor is used to indicate how the secondary alignment is mapped to the reference sequence when indels and / or soft clips are included. If msar is represented by the special symbol “*” of the secondary alignment, the decoder reconstructs the secondary alignment from the mapping positions of the primary alignment and the secondary alignment.
[Msar descriptor]

ｍｓａｒ（Multiple Segments Alignment Record）記述子は、スプライスされたリードと、インデル（indels）又はソフトクリップを含む代替的なセカンダリアライメントをサポートする。 The msar (Multiple Segments Alignment Record) descriptor supports spliced leads and alternative secondary alignments including indels or soft clips.

ｍｓａｒは、次の情報を通知することを目的としている：
・マッピングされたセグメント長
・セカンダリアライメント及び／又はスプライスされたリードの異なるマッピングの連続製（すなわち、挿入、欠損又はクリップされた塩基の存在） msar aims to communicate the following information:
• the length of the mapped segment; • the serialization of different mappings of the secondary alignment and / or spliced reads (ie, the presence of inserted, deleted or clipped bases).

ｍｓａｒは、以下で説明する拡張ＣＩＧＡＲ文字列の構文と、表５で説明する付加的な記号を使用する。 msar uses the extended CIGAR string syntax described below and the additional symbols described in Table 5.

表５．表６で説明されている構文に加えて、ｍｓａｒ記述子に使用される特別な記号

［拡張シガー構文］ Table 5. Special symbols used in msar descriptors in addition to the syntax described in Table 6

[Extended cigar syntax]

本セクションでは、シークエンス及び関連するミスマッチ、インデル（indels）、クリップされた塩基、マルチプルアラインメント、及びスプライスされたリードに関する情報に文字列を関連付けるための拡張ＣＩＧＡＲ（Ｅ−ＣＩＧＡＲ）構文を指定する。 This section specifies the extended CIGAR (E-CIGAR) syntax for associating strings with information about sequences and associated mismatches, indels, clipped bases, multiple alignments, and spliced reads.

本発明で説明する編集操作を表６に掲載する。 Table 6 lists the editing operations described in the present invention.

表６．ＭＰＥＧ−ＧＥ−ＣＩＧＡＲストリングの構文

［ソースモデル、エントロピーコード化及びコード化モード］ Table 6. MPEG-GE-CIGAR string syntax

[Source model, entropy coding and coding mode]

本発明で開示されるゲノムデータ構造の各データクラス、サブクラス及び関連する記述子ブロックについて、異なるコード化アルゴリズムは、各ブロック及びその統計的特性によって得られるデータ又はメタデータの特定の特徴に従って採用されるかもしれない。「コード化アルゴリズム」は、記述子ブロックの特定の「ソースモデル」と特定の「エントロピーコーダ」の関連付けとして意図されている必要がある。特定の「ソースモデル」を、ソースエントロピーの最小化に関してデータの最も効率的なコーディングを取得するために指定及び選択できる。エントロピーコーダの選択はコード化効率の考慮及び／又は確率分布の特徴及び関連する実装問題によって推進できる。「コード化モード」とも呼ばれる特定の「コード化アルゴリズム」の各選択は、データセット全体のデータクラス又はサブクラスに関連付けられた「記述子ブロック」の全体に適用でき、又は、アクセスユニットに、分割された記述子の各部分に異なる「コード化モード」を適用できる。 For each data class, subclass and associated descriptor block of the genomic data structure disclosed in the present invention, different coding algorithms are employed according to the particular characteristics of the data or metadata obtained by each block and its statistical properties. May be. The "coding algorithm" needs to be intended as an association between a particular "source model" of the descriptor block and a particular "entropy coder". A particular "source model" can be specified and selected to obtain the most efficient coding of the data with respect to minimizing source entropy. The choice of entropy coder can be driven by coding efficiency considerations and / or probability distribution features and related implementation issues. Each selection of a particular "coding algorithm", also referred to as a "coding mode", can be applied to the entire "descriptor block" associated with the data class or subclass of the entire dataset, or divided into access units. A different "coding mode" can be applied to each part of the descriptor.

コード化モードに関連付けられた各「ソースモデル」は、次のように特徴付けられる：
・各ソースから発生する記述子の定義（すなわち、表２に定義されるように、リード位置、リードペアリング情報、リファレンスシーケンスに対するミスマッチ等のデータのクラスを表すために使用される記述子のセット）。
・関連する確率モデルの定義。
・関連するエントロピーコード化の定義。
［更なる利点］ Each "source model" associated with a coding mode is characterized as follows:
The definition of descriptors originating from each source (ie, the set of descriptors used to represent the class of data, such as read position, read pairing information, mismatch to reference sequence, etc. as defined in Table 2) ).
• Definition of the associated probability model.
• Definition of the relevant entropy coding.
[Another advantage]

定義されたデータクラス及びサブクラスへのシーケンスデータの分類は、単一の個別のデータソース（例えば、距離、位置等）によって記述子のシーケンスをモデル化することによって特徴付けられる、より低い情報ソースエントロピーを利用する効率的なコーディングモードの実装を可能にする。 Classification of sequence data into defined data classes and subclasses is characterized by modeling the sequence of descriptors by a single, discrete data source (eg, distance, location, etc.), lower information source entropy Enables efficient coding mode implementation using

本発明の別の利点は、関心のある種類のデータのサブセットのみにアクセスすることができることである。たとえば、ゲノミクスにおける最も重要なアプリケーションの１つは、リファレンス（ＳＮＶ）又は母集団（ＳＮＰ）に対するゲノムサンプルの差異を見出すことである。今日、そのような分析は、完全なシーケンスリードの処理を必要とするが、本発明によって開示されるデータ表現を採用することによって、ミスマッチは、既に、１つから３つのデータクラスのみに分離されている（「ｎタイプ」と「ｉタイプ」のミスマッチも考慮することへの関心によって異なる）。 Another advantage of the present invention is that only a subset of the type of data of interest can be accessed. For example, one of the most important applications in genomics is to find differences in genomic samples relative to a reference (SNV) or population (SNP). Today, such analysis requires the processing of a complete sequence read, but by employing the data representation disclosed by the present invention, mismatches are already separated into only one to three data classes. (Depending on interest in also considering mismatches between "n-type" and "i-type").

さらなる利点は、新たなリファレンスシーケンスが公開されるとき、又は新たなアラインメントを得るために既にマッピングされたデータ（例えば、異なるマッピングアルゴリズムの使用）に対して再マッピングが実行されるときに、特定の「外部の」リファレンスシーケンスを参照して圧縮されたデータ及びメタデータから別の異なる「外部の」リファレンスシーケンスへの効率的なトランスコーディングを実行する可能性である。 A further advantage is that when a new reference sequence is published, or when remapping is performed on data that has already been mapped to obtain a new alignment (eg, using a different mapping algorithm), certain It is possible to perform efficient transcoding from data and metadata compressed with reference to an “external” reference sequence to another different “external” reference sequence.

図２０は、本発明の原理に基づくコード化装置２０７を示す。コード化装置２０７は、例えば、ゲノムシーケンシング装置２００によって生成された生のシーケンスデータ２０９を入力として受け取る。ゲノムシーケンシング装置２００は、イルミナ社のＨｉＳｅｑ２５００（Illumina HiSeq 2500）又はサーモ−フィッシャーイオントレント（Thermo-Fisher Ion Torrent）装置のように本技術分野で公知のものである。生のシーケンスデータ２０９はアライナユニット２０１に供給され、リードをリファレンスシーケンス２０２０にアライメントすることによってコード化するシーケンスを準備する。あるいは、専用モジュール２０２を使用して、本明細書のセクション「クラスＵのマッピングされていないリードのための内部リファレンスの構築」及び「ＨＭクラス」に記載されているような異なる方策を使用して、利用可能なリードからリファレンスシーケンスを生成することができる。リファレンスジェネレータ２０２によって処理された後、リードは、得られたより長いシーケンス上にマッピングされ得る。次いで、アライメントされたシーケンスは、データ分類モジュール２０４によって分類される。次に、データ分類ユニット２０４によって生成されたデータのエントロピーを減少させるために、リファレンス変換のさらなるステップがリファレンスに適用される。これは、外部リファレンス２０２０を、変換されたデータクラス２０１８及びリファレンス変換記述子２０２１を生成するリファレンス変換ユニット２０１９で処理することを意味する。次に、変換されたデータクラス２０１８は、リファレンス変換記述子２０２１と共にブロックエンコーダ２０５〜２０７に供給される。
次いで、ゲノムブロック２０１１は、ブロックによって運ばれるデータ又はメタデータの統計的特性に従ってブロックをコード化する算術エンコーダ２０１２〜２０１４に供給される。その結果は、ゲノムストリーム２０１５である。 FIG. 20 shows a coding device 207 based on the principles of the present invention. The coding device 207 receives, for example, raw sequence data 209 generated by the genome sequencing device 200 as an input. Genome sequencing device 200 is well known in the art, such as the Illumina HiSeq 2500 (Illumina HiSeq 2500) or Thermo-Fisher Ion Torrent device. The raw sequence data 209 is supplied to the aligner unit 201 to prepare a sequence to be encoded by aligning the read with the reference sequence 2020. Alternatively, using a dedicated module 202, using different strategies as described in the sections "Building Internal References for Unmapped Leads of Class U" and "HM Classes" herein. , A reference sequence can be generated from the available leads. After being processed by the reference generator 202, the reads may be mapped onto the resulting longer sequence. The aligned sequence is then classified by the data classification module 204. Next, a further step of reference conversion is applied to the reference to reduce the entropy of the data generated by the data classification unit 204. This means that the external reference 2020 is processed by the reference conversion unit 2019 that generates the converted data class 2018 and the reference conversion descriptor 2021. Next, the converted data class 2018 is supplied to the block encoders 205 to 207 together with the reference conversion descriptor 2021.
The genomic block 2011 is then provided to arithmetic encoders 2012-2014 that encode the block according to the statistical properties of the data or metadata carried by the block. The result is a genome stream 2015.

図２１は、本開示の原理に基づくデコード化装置２１８を示す。デコード化装置２１８は、ネットワーク又は記憶素子から多重化されたゲノムビットストリーム２０１１を受信する。多重化されたゲノムビットストリーム２１１０は、デマルチプレクサ２１０に供給され、個別のストリーム２１１を生成し、次に、これらのストリームは、エントロピーデコーダ２１２〜２１４に供給され、ゲノムブロック２１５及びリファレンス変換記述子２１１２を生成する。抽出されたゲノムブロックは、ブロックデコーダ２１６〜２１７に供給され、さらにブロックがデータのクラスにデコードされ、リファレンス変換ディスクリプタがリファレンス変換ユニット２１１３に供給される。クラスデコーダ２１９は、さらにゲノム記述子２１１１及び変換されたリファレンス２１１４を処理し、その結果をマージして、シーケンスの非圧縮リードを生成し、これをさらに本技術分野で公知のフォーマット、例えばテキストファイル又はｚｉｐ圧縮ファイル、あるいはＦＡＳＴＱ又はＳＡＭ／ＢＡＭファイルに記憶することができる。 FIG. 21 shows a decoding device 218 according to the principles of the present disclosure. The decoding device 218 receives the multiplexed genomic bitstream 2011 from a network or a storage element. The multiplexed genomic bit stream 2110 is provided to a demultiplexer 210 to generate individual streams 211, which are then provided to entropy decoders 212-214, where the genomic blocks 215 and a reference transform descriptor 2112 is generated. The extracted genomic blocks are supplied to block decoders 216 to 217, the blocks are further decoded into data classes, and a reference conversion descriptor is supplied to a reference conversion unit 2113. The class decoder 219 further processes the genomic descriptor 2111 and the transformed reference 2114 and merges the results to generate an uncompressed read of the sequence, which is further converted to a format known in the art, such as a text file. Alternatively, it can be stored in a zip compressed file, or a FASTQ or SAM / BAM file.

クラスデコーダ２１９は、一つ以上のゲノムストリームによって担持される元のリファレンスシーケンスに関する情報、及びコード化されたビットストリーム中に担持されるリファレンス変換記述子２１１２を利用することによって、元のゲノムシーケンスを再構築することができる。リファレンスシーケンスがゲノムストリームによって転送されない場合、それらはデコード側で利用可能であり、クラスデコーダによってアクセス可能でなければならない。 The class decoder 219 converts the original genomic sequence by utilizing information about the original reference sequence carried by one or more genomic streams, and the reference transform descriptor 2112 carried in the encoded bitstream. Can be rebuilt. If reference sequences are not transferred by the genomic stream, they must be available on the decoding side and accessible by the class decoder.

本明細書に開示された本発明の技術は、ハードウェア、ソフトウェア、ファームウェア、又はそれらの任意の組み合わせで実施することができる。ソフトウェアで実現される場合、これらは、コンピュータ媒体に記憶され、ハードウェア処理ユニットによって実行されてもよい。ハードウェア処理ユニットは、１つ以上のプロセッサ、デジタルシグナルプロセッサ、汎用マイクロプロセッサ、特定用途向け集積回路又は他の個別論理回路を含むことができる。 The techniques of the present invention disclosed herein can be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, they may be stored on a computer medium and executed by a hardware processing unit. A hardware processing unit may include one or more processors, digital signal processors, general purpose microprocessors, application specific integrated circuits or other discrete logic circuits.

本開示の技術は、携帯電話、デスクトップコンピュータ、サーバ、タブレット及び同様のデバイスを含む様々なデバイス又は装置で実施することができる。
［ファイルフォーマット：マスターインデックステーブルを用いたゲノムデータ領域への選択的アクセス］ The techniques of this disclosure may be implemented with various devices or devices, including mobile phones, desktop computers, servers, tablets, and similar devices.
[File format: Selective access to genomic data area using master index table]

アライメントされたデータの特定の領域への選択的なアクセスをサポートするために、本明細書で説明するデータ構造には、マスターインデックステーブル（ＭＩＴ）と呼ばれるインデックス作成ツールが実装されている。これは、特定のリードが関連するリファレンスシーケンスにマップされる位置を含む多次元配列である。
ＭＩＴに含まれる値は、各アクセスユニットへの非シーケンシャルアクセスがサポートされるように、各ｐｏｓブロック内の第１のリードのマッピング位置である。ＭＩＴには、データの各クラス（Ｐ、Ｎ、Ｍ、Ｉ、Ｕ、及びＨＭ）及びリファレンスシーケンス毎にセクションが含まれている。ＭＩＴは、コード化されたデータのゲノムデータセットヘッダ（Genomic Dataset Header）に含まれている。図２１はゲノムデータセットヘッダ（Genomic Dataset Header）の構造を示し、図３２はＭＩＴの一般的な視覚的表現を示し、図３３はコード化されたリードのクラスＰに対するＭＩＴの例を示す。 To support selective access to specific regions of aligned data, the data structures described herein implement an indexing tool called a master index table (MIT). This is a multi-dimensional array containing the locations where a particular read is mapped to the associated reference sequence.
The value included in the MIT is the mapping position of the first read in each pos block so that non-sequential access to each access unit is supported. The MIT includes a section for each class of data (P, N, M, I, U, and HM) and a reference sequence. The MIT is included in the genomic data set header of the encoded data. FIG. 21 shows the structure of a Genomic Dataset Header, FIG. 32 shows a general visual representation of MIT, and FIG. 33 shows an example of MIT for class P of coded reads.

図３３に示すＭＩＴに含まれる値は、圧縮ドメイン内の関心領域（及び対応するＡＵ）に直接アクセスするために使用される。 The values contained in the MIT shown in FIG. 33 are used to directly access the region of interest (and the corresponding AU) in the compression domain.

例えば、図３３を参照すると、リファレンス２上の位置１５０，０００と２５０，０００との間に含まれる領域にアクセスする必要がある場合、デコード化アプリケーションはＭＩＴの第２のリファレンスにスキップし、ｋ１＜１５０，０００及びｋ２＞２５０，０００となるように２つの値ｋ１とｋ２を探す。ここで、ｋ１とｋ２はＭＩＴから読み込まれた２つのインデックスである。図３３の例では、これは、ＭＩＴの第２番目のベクトルの第２と第３の位置になる。これらの戻り値は、デコード化アプリケーションによって使用され、次のセクションで説明するように、ｐｏｓブロックのローカルインデックステーブルから適切なデータの位置を取得する。 For example, with reference to FIG. 33, if it is necessary to access an area included between locations 150,000 and 250,000 on reference 2, the decoding application skips to the MIT second reference and k1 Search for two values k1 and k2 such that <150,000 and k2> 250,000. Here, k1 and k2 are two indexes read from the MIT. In the example of FIG. 33, this is the second and third position of the second vector of the MIT. These return values are used by the decoding application to get the appropriate data location from the pos block's local index table, as described in the next section.

上述の４つのクラスのゲノムデータに属するデータを含むブロックへのポインタと共に、ＭＩＴは、そのライフサイクルの間にゲノムデータに追加される付加的なメタデータ及び／又は注釈のインデックスとして使用することができる。
［ローカルインデックステーブル］ Along with pointers to blocks containing data belonging to the four classes of genomic data described above, the MIT can be used as an index for additional metadata and / or annotations added to the genomic data during its life cycle. it can.
[Local index table]

各ゲノムデータブロックの先頭には、ローカルヘッダと呼ばれるデータ構造が付く。ローカルヘッダには、ブロックの特有の識別子、リファレンスシーケンス毎のアクセスユニットカウンタのベクトル、ローカルインデックステーブル（ＬＩＴ）、及びオプションでブロック固有のメタデータが含まれる。ＬＩＴは、ブロックペイロード内の各アクセスユニットに属するデータの物理的位置へのポインタのベクトルである。図３４は、コード化されたデータの特定の領域に、非シーケンシャルな方法でアクセスするためにＬＩＴが使用される、一般的なブロックヘッダ及びペイロードを示す。 At the beginning of each genome data block, a data structure called a local header is attached. The local header includes a unique identifier of the block, a vector of access unit counters for each reference sequence, a local index table (LIT), and optionally, block-specific metadata. LIT is a vector of pointers to the physical locations of the data belonging to each access unit in the block payload. FIG. 34 illustrates a typical block header and payload where the LIT is used to access a particular area of coded data in a non-sequential manner.

前の例では、リファレンスシーケンスＮｏ．２にアライメントされたリードの領域１５０，０００から２５０，０００にアクセスするために、デコード化化アプリケーションはＭＩＴから位置３と４を検索した。これらの値は、デコード化プロセスが、ＬＩＴの対応するセクションの第３と第４のエレメントにアクセスするために使用される。図３５に示す例では、ブロックヘッダに含まれるトータルアクセスユニット（Total Access Units）カウンタを使用して、リファレンス１（例では５）に関連するＡＵに関連するＬＩＴインデックスをスキップする。したがって、コード化されたストリーム内の要求されたＡＵの物理的位置を含むインデックスは、次のように計算される：
要求されたＡＵに属するデータブロックの位置＝スキップされるリファレンス１のＡＵに属するデータブロック＋ＭＩＴを使用して検索される位置
最初のブロック位置：５＋３＝８
最後のブロック位置：５＋４＝９ In the previous example, the reference sequence No. To access regions 150,000 to 250,000 of reads aligned to 2, the decoding application searched locations 3 and 4 from the MIT. These values are used by the decoding process to access the third and fourth elements of the corresponding section of the LIT. In the example shown in FIG. 35, the LIT index related to the AU related to reference 1 (5 in the example) is skipped using the total access unit (Total Access Units) counter included in the block header. Therefore, an index containing the physical location of the requested AU in the encoded stream is calculated as follows:
Position of data block belonging to requested AU = data block belonging to AU of reference 1 to be skipped + position searched using MIT First block position: 5 + 3 = 8
Last block position: 5 + 4 = 9

ローカルインデックステーブルと呼ばれるインデックス作成メカニズムを使用して取得されたデータのブロックは、要求されたアクセスユニットの一部である。 A block of data obtained using an indexing mechanism called a local index table is part of the requested access unit.

図２６は、ＭＩＴテーブルに含まれるブロックが、データの各クラス又はサブクラス毎のＬＩＴのブロックにどのように対応するかを示す。 FIG. 26 shows how blocks included in the MIT table correspond to LIT blocks for each class or subclass of data.

図３７は、ＭＩＴ及びＬＩＴを使用して検索されたデータブロックが、次のセクションで定義されるように、１つ以上のアクセスユニットを構成する方法を示す。 FIG. 37 shows how data blocks retrieved using MIT and LIT constitute one or more access units as defined in the next section.

本発明の一実施形態では、ＬＩＴをＭＩＴのサブ構造として統合することができる。このようなアプローチの利点は、圧縮ファイルの逐次的な構文解析の場合のインデックス付きデータへのアクセス速度にある。ＬＩＴがファイルヘッダのＭＩＴに統合されている場合、デコード化装置は、選択的アクセスの場合、要求された圧縮情報を検索するために、データのごく一部を解析するだけでよい。別の利点は、ネットワーク上でストリーミングする場合、ＭＩＴ及びＬＩＴに含まれるインデックス情報が、第１のデータブロックの中で配信され、したがって、全データ転送が完了する前に、受信装置がソート及び選択的アクセス等の動作を実行することを可能にすることは、当業者にとって明らかである
［アクセスユニット］ In one embodiment of the present invention, the LIT can be integrated as a substructure of the MIT. The advantage of such an approach is the speed of accessing indexed data in the case of sequential parsing of the compressed file. If the LIT is integrated into the MIT of the file header, the decoding device need only analyze a small portion of the data to retrieve the requested compression information for selective access. Another advantage is that, when streaming over a network, the index information contained in the MIT and LIT is delivered in the first data block, thus allowing the receiving device to sort and select before the entire data transfer is completed. It is clear to a person skilled in the art to be able to perform operations such as dynamic access [access unit].

データクラスで分類され、圧縮又は非圧縮ブロックで構造化されたゲノムデータは、異なるアクセスユニットに編成される。 Genomic data, categorized by data class and structured by compressed or uncompressed blocks, is organized into different access units.

ゲノムアクセスユニット（ＡＵ）は、ヌクレオチド配列及び／又は関連するメタデータを再構築するゲノムデータ（圧縮された、又は圧縮されていない状態で）、及び／又はＤＮＡ／ＲＮＡのシーケンス（たとえば、仮想リファレンス）及び／又はゲノムシーケンシング装置及び／又はゲノム処理装置又は分析アプリケーションによって生成された注釈データのセクションとして定義される。アクセスユニットの例を図３７に示す。 Genome access units (AUs) can be used to reconstruct nucleotide sequences and / or associated metadata (generally or uncompressed) and / or sequences of DNA / RNA (eg, a virtual reference). And / or a section of annotation data generated by a genome sequencing device and / or a genome processing device or analysis application. FIG. 37 shows an example of the access unit.

アクセスユニットは、グローバルに利用可能なデータ（例えばデコーダ構成）のみを使用するか、他のアクセスユニットに含まれる情報を使用することによって、他のアクセスユニットから独立してデコードできるデータのブロックである。 An access unit is a block of data that can be decoded independently of other access units, using only globally available data (eg, a decoder configuration) or using information contained in other access units. .

アクセスユニットは次のように区別される：
・タイプ（type）、ゲノムデータの性質とそれらが保有するデータセット、及びそれらにアクセスする方法を特徴づけ、
・オーダー（order）、同じタイプに属するアクセスユニットに固有の順序を提供する。 Access units are distinguished as follows:
Characterize the type, the nature of the genomic data, the datasets they hold, and the way to access them,
Order, which provides a unique order for access units belonging to the same type.

あらゆるタイプのアクセスユニットは、さらに異なる「カテゴリ」に分類することができる。 All types of access units can be further divided into different “categories”.

以下に、様々な類型のゲノムアクセスユニットの定義の非網羅的リストを示す：
１）タイプ０のアクセスユニットは、アクセス又はデコードされアクセスされる他のアクセスユニットからの情報を参照する必要はない。それらが含むデータ又はデータセットによって伝送される全情報は、デコード化装置又はプロセッシングアプリケーションによって独立に読み取られ、処理される。
２）タイプ１のアクセスユニットは、タイプ０のアクセスユニットによって伝送されるデータを参照するデータを含む。読み取り又はデコード化、及びタイプ１のアクセスユニットに含まれるデータの処理は、タイプ０の１つ以上のアクセスユニットへアクセスする必要がある。タイプ１のアクセスユニットは、「クラスＰ」のシーケンスリードに関連するゲノムデータをコード化する。
３）タイプ２のアクセスユニットは、タイプ０のアクセスユニットによって伝送されるデータを参照するデータを含む。読み取り又はデコード化、及びタイプ２のアクセスユニットに含まれるデータの処理は、タイプ０の１つ以上のアクセスユニットへアクセスする必要がある。タイプ２のアクセスユニットは、「クラスＮ」のシーケンスリードに関連するゲノムデータをコード化する。
４）タイプ３のアクセスユニットは、タイプ０のアクセスユニットによって伝送されるデータを参照するデータを含む。読み取り又はデコード化、及びタイプ３のアクセスユニットに含まれるデータの処理は、タイプ０の１つ以上のアクセスユニットへアクセスする必要がある。タイプ３のアクセスユニットは、「クラスＭ」のシーケンスリードに関連するゲノムデータをコード化する。
５）タイプ４のアクセスユニットは、タイプ０のアクセスユニットによって伝送されるデータを参照するデータを含む。読み取り又はデコード化、及びタイプ４のアクセスユニットに含まれるデータの処理は、タイプ０の１つ以上のアクセスユニットへアクセスする必要がある。タイプ４のアクセスユニットは、「クラスＩ」のシーケンスリードに関連するゲノムデータをコード化する。
６）タイプ５のアクセスユニットは、利用可能なリファレンスシーケンス（「クラスＵ」）等にマッピングできず、内部で構築されたリファレンスシーケンスを使用してコード化されるリードを含む。タイプ５のアクセスユニットは、タイプ０のアクセスユニットによって伝送されるデータを参照するデータを含む。読み取り又はデコード化、及びタイプ５のアクセスユニットに含まれるデータの処理は、タイプ０の１つ以上のアクセスユニットへアクセスする必要がある。
７）タイプ６のアクセスユニットにはリードペアが含まれており、一方のリードはＰ、Ｎ、Ｍ、Ｉのいずれかのクラスに属し、もう一方のリードは使用可能なリファレンスシーケンス（「ＨＭクラス」）にマッピングできない。タイプ６のアクセスユニットは、タイプ０のアクセスユニットによって伝送されるデータを参照するデータを含む。読み取り又はデコード化、及びタイプ６のアクセスユニットに含まれるデータの処理は、タイプ０の１つ以上のアクセスユニットへアクセスする必要がある。
８）タイプ７のアクセスユニットには、タイプ１のアクセスユニットに含まれるデータ又はデータセットに関連するメタデータ（例えば品質スコア）及び／又は注釈データを含む。タイプ７のアクセスユニットは、異なるブロックに分類及びラベル付けされてもよい。
９）タイプ８のアクセスユニットには、注釈データとして分類されるデータ又はデータセットが含まれる。タイプ８のアクセスユニットは、ブロック単位で分類及びラベル付けされてもよい。
１０）追加型のアクセスユニットは、ここで説明する構造とメカニズムを拡張できる。一例として、しかし限定としてではなく、ゲノムバリアント呼び出し、構造及び機能分析の結果は、新しい種類のアクセスユニットにコード化されることができる。本明細書で説明するアクセスユニットにおけるデータ編成は、コード化データの性質に関して完全に透過的なメカニズムであるアクセスユニットにカプセル化されるいかなる種類のデータも妨げるものではない。 The following is a non-exhaustive list of definitions of various types of genomic access units:
1) Type 0 access units need not refer to information from other access units that are accessed or decoded and accessed. All the information transmitted by the data or data sets they contain is read and processed independently by the decoding device or processing application.
2) Type 1 access units include data referencing data transmitted by type 0 access units. Reading or decoding and processing of data contained in Type 1 access units requires access to one or more Type 0 access units. Type 1 access units encode genomic data associated with “class P” sequence reads.
3) Type 2 access units include data referencing data transmitted by type 0 access units. Reading or decoding and processing of the data contained in the type 2 access units requires access to one or more type 0 access units. Type 2 access units encode genomic data associated with "class N" sequence reads.
4) Type 3 access units include data referencing data transmitted by type 0 access units. Reading or decoding and processing of data contained in Type 3 access units requires access to one or more Type 0 access units. Type 3 access units encode genomic data associated with “class M” sequence reads.
5) Type 4 access units include data referencing data transmitted by type 0 access units. Reading or decoding and processing of data contained in a type 4 access unit requires access to one or more type 0 access units. Type 4 access units encode genomic data associated with “class I” sequence reads.
6) Type 5 access units contain leads that cannot be mapped to available reference sequences ("Class U") or the like and are coded using internally built reference sequences. The type 5 access unit includes data that refers to data transmitted by the type 0 access unit. Reading or decoding and processing of data contained in a type 5 access unit requires access to one or more type 0 access units.
7) The type 6 access unit includes a read pair. One of the leads belongs to any of the classes P, N, M, and I, and the other read has an available reference sequence (“HM class”). ) Cannot be mapped. The type 6 access unit includes data referencing data transmitted by the type 0 access unit. Reading or decoding and processing of data contained in a type 6 access unit requires access to one or more type 0 access units.
8) Type 7 access units include metadata (eg, quality scores) and / or annotation data associated with the data or data set contained in the type 1 access units. Type 7 access units may be classified and labeled into different blocks.
9) Type 8 access units include data or data sets classified as annotation data. Type 8 access units may be classified and labeled on a block basis.
10) Additional access units can extend the structures and mechanisms described herein. As an example, but not by way of limitation, the results of genomic variant calling, structural and functional analysis can be encoded into a new type of access unit. The data organization in the access unit described herein does not preclude any kind of data being encapsulated in the access unit, which is a completely transparent mechanism with respect to the nature of the coded data.

タイプ０のアクセスユニットは順序付けされ（例えば番号付け）、順序付けられた方法で記憶及び／又は伝送される必要はない（技術的な利点：並列処理／並列ストリーミング、多重化）。 Type 0 access units are ordered (eg, numbered) and need not be stored and / or transmitted in an ordered manner (technical advantage: parallel processing / parallel streaming, multiplexing).

タイプ１、２、３、４、５及び６のアクセスユニットは、順序付けする必要はなく、順序付けされた方法で格納及び／又は送信する必要もない（技術的な利点：並列処理／並列ストリーミング）。 Access units of types 1, 2, 3, 4, 5 and 6 do not need to be ordered and need not be stored and / or transmitted in an ordered manner (technical advantage: parallel processing / parallel streaming).

図３７は、アクセスユニットがヘッダと同種データの１つ以上のブロックでどのように構成されているかを示す。各ブロックは、１つ以上のブロックで構成できる。各ブロックは、いくつかのパケットを含み、パケットは、例えば、リード位置、ペアリング情報、逆補完情報、ミスマッチ位置及び類型等を表すために上記で導入された記述子の構造化されたシーケンスである。 FIG. 37 shows how an access unit is made up of one or more blocks of the same kind of data as a header. Each block can be composed of one or more blocks. Each block contains a number of packets, where the packets are a structured sequence of descriptors introduced above to represent, for example, lead positions, pairing information, reverse complement information, mismatch positions and types, etc. is there.

各アクセスユニットは、ブロックごとに異なる数のパケットを持つことができるが、アクセスユニット内では、すべてのブロックが同じ数のパケットを持つ。 Each access unit can have a different number of packets per block, but within an access unit, all blocks have the same number of packets.

各データパケットは、３つの識別子ＸＹＺの組み合わせによって識別できる：
・Ｘは、属するアクセスユニットを示し、
・Ｙは、属するブロックを示す（すなわち、カプセル化されるデータの種類）、
・Ｚは、同一ブロック内の他のパケットに対するパケット順序を表す識別子である。 Each data packet can be identified by a combination of three identifiers XYZ:
X indicates the access unit to which it belongs,
Y indicates the block to which it belongs (ie the type of data to be encapsulated),
-Z is an identifier indicating the packet order for other packets in the same block.

図３８はアクセスユニットとパケットラベルの例を示す。ここでＡＵ＿Ｔ＿Ｎは識別子Ｎを持つタイプＴのアクセスユニットで、アクセスユニットの種類による順序の概念を暗示している場合もあれば、暗示していない場合もある。識別子は、ある種類のアクセスユニットを、転送されたゲノムデータを完全に解読するのに必要な他の種類のアクセスユニットに特有に関連付けるために使用される。 FIG. 38 shows an example of an access unit and a packet label. Here, AU_T_N is a type T access unit having an identifier N, which may or may not imply the concept of the order depending on the type of the access unit. The identifier is used to uniquely associate one type of access unit with another type of access unit that is required to fully decrypt the transferred genomic data.

あらゆる種類のアクセスユニットは、異なるシーケンシングプロセスに従って、さらに異なる「カテゴリ」に分類され、表示される。例えば、限定ではないが、分類及び表示は以下の場合に行うことができる。
１．同一生物を異なる時刻でシーケンシングすること（アクセスユニットは「一時的な」意味を持つゲノム情報を含む）
２．同一の生物の異なる性質の有機試料をシーケンシングすること（ヒトの皮膚、血液、毛髪等の試料）これらは、「生物学的」を意味するアクセスユニットである。
All types of access units are further classified and displayed in different "categories" according to different sequencing processes. For example, without limitation, classification and display can be performed in the following cases.
1. Sequencing the same organism at different times (access units contain genomic information with "temporary" meaning)
2. Sequencing organic samples of different properties of the same organism (samples of human skin, blood, hair, etc.) These are access units meaning "biological".

Claims

A method of encoding genomic sequence data comprising a nucleotide sequence read, said method comprising:
Aligning said lead with one or more reference sequences, thereby creating an alignment lead;
Classifying said alignment reads using said one or more reference sequences according to a specified matching rule, thereby creating a class of alignment reads;
Encoding the classified alignment reads as multiple blocks of descriptors,
Encoding the classified alignment reads as multiple blocks of the descriptor includes selecting the descriptor according to the class of the alignment reads,
Structuring the block of descriptors with header information, thereby creating a continuous access unit.

Further comprising classifying the leads that do not satisfy the specified matching rule into a class of unmapped leads,
Constructing a set of reference sequences using at least some of the unmapped reads;
Aligning the unmapped class of reads with the constructed set of reference sequences,
Encoding the classified alignment reads as multiple blocks of descriptors,
Encoding the set of constructed reference sequences;
Constructing the block of descriptors and the coded reference sequence with header information, thereby creating a continuous access unit;
The coding method according to claim 1.

The classification includes classifying a genomic read with no mismatch in the reference sequence as a first `` Class P '' if there is no mismatch in the mapped read with respect to the reference sequence used for mapping.
The method according to claim 2.

The classification is based on the genomic read if the sequencing device cannot call any "bases" and a mismatch is found only at positions where the number of mismatches in each read does not exceed a predetermined threshold. Further classifying as a second "class N".
The method of claim 3.

The classification further comprises identifying the genomic read as a third "class M" if a mismatch is found at a position where the sequencing device was unable to call any "bases", wherein "n-type" ), And / or referred to as “bases” that are different from the reference sequence and are termed “s-type” mismatches, and the number of mismatches is greater than the “n-type” mismatch, the “s-type” mismatch. A predetermined threshold is not exceeded for the number of "type" mismatches, said threshold being given by a function (f (n, s)),
The method according to claim 4.

The classification further includes identifying the genomic read as a fourth "Class I" if a mismatch of the same type as the "Class M" is likely to occur, wherein at least one mismatch type: " Inserts ("i-type"), "deletions"("d-type"), and soft clips ("c-type"), where the number of mismatches of each type is determined by a corresponding predetermined threshold And the threshold is given by a function (w (n, s, i, d, c)),
The method of claim 5.

The classifying further includes identifying the genomic read as a fifth "class U" as including all reads that do not find any of the classes P, N, M, I.
The method of claim 6.

Encoded genomic sequence reads are paired,
The method according to claim 7.

The classification may be such that the classification includes genomic reads as a sixth read, with one read belonging to class P, N, M or I and the other read including all read pairs belonging to "class U". Further comprising identifying as "Class HM".
The method according to claim 8.

Identifying whether the leads of the two mates are classified into the same class (P, N, M, I, U respectively) and assigning the pair to the same identified class;
Identify whether the two mate's leads are classified into different classes, and if neither belongs to "Class U", assign the paired leads to the highest priority class according to:
P <N <M <I
Here, "class P" has the lowest priority, "class I" has the highest priority,
Identifying whether only one of the two mate's leads is classified as belonging to "Class U" and classifying the paired leads as belonging to a "Class HM" sequence.
An encoding method according to claim 9.

Each class of leads N, M, and I has the number of “n-type” mismatches (292), function f (n, s) (293), and function w (n, s, i, d, c) (294) Is further divided into two or more subclasses (296, 297, 298) according to the threshold vector (292, 293, 294) defined for each class N, M, I, respectively.
The method according to claim 11.

Identifying whether the leads of the two mates are classified in the same subclass, assigning the pair to the same subclass,
Identifying whether the leads of the two mates are classified into different classes of subclasses and assigning the pair to the subclass belonging to the higher priority class according to the following formula:
N <M <I
Here, N has the lowest priority, I has the highest priority,
Identify whether the two mates' leads are classified into the same class, which class is N, M, or I, but have different subclasses, and assign the pair to the highest priority according to the following formula: Assign to higher subclass,
N ₁ <N ₂ <... <N _k
M ₁ <M ₂ <... M _j
I ₁ <I ₂ <... <I _h
Where the highest index has the highest priority,
The method according to claim 11.

Information about the mapping location of each lead is coded by a "pos" descriptor block,
The method according to claim 12.

Information about the strandiness of each read (ie, the sequence from which the DNA strand reads are derived) is encoded by the rcomp descriptor block,
The method according to claim 13.

The pairing information of the paired-end read is encoded by a “pair” descriptor block.
The method according to claim 14.

Additional alignment information, whether the reads are mapped in the correct pair, platform / vendor quality check failure, PCR or optical replication, or ancillary alignments, is included in the "flags" descriptor. Coded by blocks,
The method according to claim 15.

Information about the unknown base is encoded by a "mmis" descriptor block,
The method of claim 16.

Information about the location of the replacement is coded by an "snpp" descriptor block,
The method according to claim 17.

Information about the type of substitution is coded by a particular "snpt" descriptor block,
The method according to claim 18.

Information about the location of the mismatch, substitution, insertion or deletion is coded by an "indp" descriptor block,
The method according to claim 19.

Information about the type of substitution, insertion, or deletion mismatch is encoded by an "indt" descriptor block;
The method according to claim 20.

Information about the clipped bases of the mapped read is encoded by an "indc" descriptor block,
A method according to claim 21.

Information about unmapped leads is coded by an “ureads” descriptor block,
23. The method according to claim 22.

Information about the type of reference sequence used for encoding is encoded by an "rtype" descriptor block;
A method according to claim 23.

Information about the multiple alignment of the mapped read is encoded by a "mmmap" descriptor block;
A method according to claim 24.

Information about the spliced alignment and multiple alignment of the same read is encoded by an "msar" descriptor block and an "mmp" descriptor block;
A method according to claim 25.

Information about the alignment score of the read is encoded by an “mscore” descriptor block;
The method of claim 26.

Information about the group to which the lead belongs is coded by an “rgroup” descriptor block,
A method according to claim 27.

A class P access unit is constructed using a block of descriptors of type "pos", "rcomp", and "flags".
29. The method according to claim 28.

The access unit of class P encodes pair-end pairing information using a block of "pair"descriptors;
A method according to claim 29.

Class N access units are constructed using the same block of class P access unit descriptors, in addition to using the "nmis" descriptor block for information on the location of unknown bases,
31. The method according to claim 30.

A class M access unit is constructed using the same block of class P access unit descriptors in addition to the "snpp" and "snpt" descriptor blocks for information about the location and type of permutation.
31. The method according to claim 30.

The class I access unit is the same as the class P access unit descriptor, in addition to the "indp", "indt", and "indc" descriptor blocks for information about substitutions, insertions, deletions, and the location and type of clip bases. Built using blocks,
31. The method according to claim 30.

A class HM access unit is constructed using the same block of descriptors of the class I access unit for the mapped read and the block of the "ureads" descriptor for the unmapped read,
A method according to claim 33.

Information about the multiple alignment is conveyed using blocks of “mmmap” and “msar” descriptors,
A method according to claim 33.

Information about the spliced alignment
・ Symbol for displaying matching base =
・ Symbol for displaying insertion +
・ Symbol for displaying deletion ・ Symbol for displaying forward strand splice /
Symbol% for indicating the strand of the reverse strand
・ Symbol for indicating non-directional splice *
A text character from the IUPAC code for the DNA to indicate the substitution; a symbol (n) to display n soft-clip bases, where n is an integer; a symbol to display n hard-clipped bases. [N], where n is transmitted in an extended cigar string containing
A method according to claim 35.

The block of descriptors includes a "master index table" including one section for each class and subclass of aligned reads, the sections being coded in both the "master index table" and the access unit. Including the mapping location on the one or more reference sequences of the first read of each access unit of each class or subclass of data.
37. The method of claim 36.

The block of the descriptor further includes the type of reference used (existing or constructed) and information about the segment of the lead that is not mapped to the reference sequence;
38. The method of claim 37.

The reference sequence is first transformed to a different reference sequence by applying substitution, insertion, deletion, and clipping, and the encoding of the classified alignment read as multiple blocks of descriptors is transformed. Referencing the reference sequence
39. The method according to claim 38.

The same transformation is applied to the reference sequence used for all classes of data,
40. The method of claim 39.

Different transformations are applied to the reference sequence used for each class of data;
41. The method according to claim 40.

The transformation of the reference sequence is coded as a block of descriptors and structured with header information, thereby creating a continuous access unit;
42. The method of claim 41.

The associated reference sequence transformation as a multiplex of the coding and descriptor blocks of the classified alignment read includes associating with a particular descriptor block and a particular source model.
43. The method according to claim 42.

The entropy coder is one of a context adaptive arithmetic coder, a variable length coder, and a Golomb coder.
A method according to claim 43.

A method of decoding encoded genomic data, said method comprising:
Analyzing the access unit containing the encoded genomic data to extract a multiplexed block of descriptors using the header information;
Decoding the multiplexed block of descriptors to extract leads according to specific matching rules that define a classification for one or more reference sequences.

Further comprising decoding a master index table including one section for each class of associated associated mapping locations and leads;
The decoding method according to claim 45.

The type of reference used: further including decoding information related to existing, transformed, or constructed;
47. The decoding method according to claim 46.

Decoding the information associated with one or more transforms applied to the existing reference sequence.
A decoding method according to claim 47.

The block of descriptors is entropy decoded,
49. The decoding method according to claim 48.

Class P leads are obtained by decoding blocks of each type of descriptor, "pos", "rcomp", "flags", and "rlen",
Class N reads are obtained by decoding blocks of each type of descriptor, "pos", "rcomp", "flags", "rlen", "nmis",
Class M leads are obtained by decoding blocks of each type of descriptor, "pos", "rcomp", "flags", "rlen", "snpp", and "snpt",
Class I leads are obtained by decoding blocks of descriptors of each type: "pos", "rcomp", "flags", "rlen", "indp", "indt", and "indc";
The class U lead is a block of descriptors of each type of "pos", "rcomp", "flags", "rlen", "snpp", "snpt", "indc", "ureads", and "rtype". Obtained by decoding
50. The method according to claim 49.

Classes P, N, M, and I are also obtained by decoding a block of "pair" descriptors,
The class HM is also obtained by decoding a block of descriptors "pos", "rcomp", "flags", "rlen", "indp", "indt", "indc", and "ureads". ,
A decoding method according to claim 50.

Genome sequence data 209, a genomic encoder (210) for compressing said genomic sequence data 209 including nucleotide sequence reads, wherein the method comprises:
The genome encoder (210)
An aligner unit (201) configured to align the lead with one or more reference sequences, thereby creating an alignment lead;
A constructed reference generation unit (202) configured to generate the constructed reference sequence;
Data configured to classify said alignment reads according to a particular matching rule using one or more existing or constructed reference sequences, thereby creating a class of alignment reads (208). A classification unit (204);
One or more block coding units (205-207) configured to code the classified alignment read as a block of descriptors by selecting the descriptor according to the classified alignment read; ,
A multiplexer (2016) for multiplexing the compressed genomic data and metadata.

A reference sequence conversion unit (2019) configured to convert the existing reference and data class (208) to the converted data class (2018);
A genome encoder according to claim 52.

Said data classification unit (204) includes encoders of data classes N, M and I composed of a vector of thresholds producing subclasses of data classes N, M and I;
A genomic encoder according to claim 53.

The reference conversion unit (2019) applies the same reference conversion (300) to all classes and subclasses of data;
The genome encoder according to claim 54.

The reference conversion unit (2019) applies different reference conversions (301, 302, 303) to different classes and subclasses of data;
The genome encoder according to claim 54.

Further comprising coding means suitable for performing the coding method according to claim 12.
The genome encoder according to claim 54.

A genome decoder (218) for decompressing the compressed genome stream (211), wherein the genome decoder (218) comprises:
A demultiplexer (210) for demultiplexing the compressed genomic data and metadata;
Analysis means (212-214) configured to parse the compressed genome stream into genome blocks (215) of descriptors;
One or more block decoders (216-217) configured to decode genomic blocks of the descriptor into sorted reads of a sequence of nucleotides (211);
A genomic data class decoder configured to selectively decode said categorized reads of the sequence of nucleotides on one or more reference sequences to generate an uncompressed read of the sequence of nucleotides; ,including.

Further comprising a reference transform decoder (2113) configured to decode the reference transform descriptor (2112) and generate a transformed reference (2114) used by the genomic data class decoder (219).
A genomic decoder according to claim 58.

The one or more reference sequences are stored in a compressed genome stream (211);
A genomic decoder according to claim 59.

The one or more reference sequences are provided to the decoder via an out of band mechanism;
A genomic decoder according to claim 59.

The one or more reference sequences are constructed at a decoder;
A genomic decoder according to claim 59.

One or more reference sequences are converted at the decoder by a reference conversion decoder (2113);
A genomic decoder according to claim 59.

A computer readable medium comprising instructions for causing at least one processor to execute the coding method of claim 12.

60. A computer readable medium containing instructions for causing at least one processor to execute the decoding method of claim 59.

Support data for storing a genome encoded according to the method of claim 12.