JPWO2018235938A1

JPWO2018235938A1 - Methods for sequencing and analyzing nucleic acids

Info

Publication number: JPWO2018235938A1
Application number: JP2019525702A
Authority: JP
Inventors: 克之城口
Original assignee: RIKEN
Current assignee: RIKEN
Priority date: 2017-06-23
Filing date: 2018-06-22
Publication date: 2020-04-23
Anticipated expiration: 2038-06-22
Also published as: WO2018235938A1; JP7160349B2

Abstract

本発明は、分子バーコードを用いた核酸のデジタル定量法において生じるエラーの訂正方法を提供する。具体的には、検出頻度に応じてインデックス配列と分子バーコードのミスペアリングを特定する方法、同一クラスターに分類された塩基置換を有する分子バーコードを特定する方法、および固定塩基とランダム塩基とを含む分子バーコードを用いて挿入または欠失を有する分子バーコードを特定する方法が提供される。The present invention provides a method for correcting errors that occur in a digital nucleic acid quantification method using a molecular barcode. Specifically, a method of identifying the mispairing of the index sequence and the molecular barcode according to the detection frequency, a method of identifying the molecular barcode having base substitutions classified into the same cluster, and a fixed base and a random base. Methods of identifying molecular barcodes with insertions or deletions using molecular barcodes containing are provided.

Description

Reference to related applications

本願は、米国仮出願第62/523857（出願日：２０１７年６月２３日）の優先権の利益を享受する出願であり、引用することにより上記仮出願の全体は本願明細書に取り込まれたものとする。 This application is an application that enjoys the benefit of the priority of US Provisional Application No. 62/523857 (filing date: June 23, 2017), and the entire provisional application is incorporated herein by reference. I shall.

本発明は、核酸をシークエンシングする方法および解析する方法に関する。 The present invention relates to methods for sequencing and analyzing nucleic acids.

次世代シークエンサープラットフォームの発展により、１回のランで極めて多数の種類の核酸の配列を同時進行で解析できるようになった。サンプル中に存在する核酸分子の１分子毎に固有の分子バーコードを付加すると、固有の分子バーコードの種類の数を核酸分子数に対応させることができ、次世代シークエンサーのプラットフォームによって、核酸分子のデジタル定量の途が切り拓かれた（特許文献１および非特許文献１）。分子バーコードをランダム塩基として、塩基配列を長くすることによりバーコード配列に大きな多様性を付加することが容易にできるようになり、デジタル定量できる核酸分子のダイナミックレンジが拡大した（特許文献１および非特許文献１）。 With the development of next-generation sequencer platform, it has become possible to simultaneously analyze the sequences of a large number of nucleic acid types in a single run. By adding a unique molecular barcode to each nucleic acid molecule present in a sample, the number of unique molecular barcode types can be made to correspond to the number of nucleic acid molecules. Has opened up the way for digital quantification (Patent Document 1 and Non-Patent Document 1). By using a molecular barcode as a random base and lengthening the nucleotide sequence, it becomes possible to easily add a great diversity to the barcode sequence, and the dynamic range of nucleic acid molecules that can be digitally quantified is expanded (Patent Document 1 and Non-Patent Document 1).

しかしながら、デジタル定量では、分析途中に分子バーコードの配列が変化してしまうことがあり、これによって新しく生成された分子バーコードが核酸分子の定量精度に影響を与えることがある。しかしながら、分子バーコードの配列は、ランダムに設計されたものであると、配列が変化したことを把握することが困難である。その他、分子バーコードの配列がランダムであることに起因して、デジタル定量においてどのようなエラーが発生し得るのかの解析が困難であり、その解決策を提示することもまた困難であった。 However, in digital quantification, the sequence of the molecular barcode may change during the analysis, and the newly generated molecular barcode may affect the quantification accuracy of nucleic acid molecules. However, if the sequence of the molecular barcode is randomly designed, it is difficult to understand that the sequence has changed. In addition, it was difficult to analyze what kind of error could occur in digital quantification due to the random arrangement of the molecular barcode, and it was also difficult to provide a solution.

米国特許第9,260,753号U.S. Patent No. 9,260,753

Shiroguchi, K. et al., Proc. Natl. Acad. Sci. USA 109, 1347-1352 (2012).Shiroguchi, K. et al., Proc. Natl. Acad. Sci. USA 109, 1347-1352 (2012).

本発明は、核酸をシークエンシングする方法および解析する方法を提供する。 The present invention provides methods for sequencing and analyzing nucleic acids.

本発明者らは、インデックスとバーコードとを用いた目的核酸分子のデジタル定量方法において、複数のサンプルを混合して目的核酸分子を定量する場合に、インデックスが想定外の異なるサンプルに由来する核酸に付加されてしまう、ミスインデックスが発生し得ることを明らかとした。本発明者らはまた、同一のバーコードに２つの異なるインデックスが付加されている場合に、最も頻度高いペアを正しいペアであるとし、それ以外のいずれかまたは全てをミスインデックスとして除外することにより、デジタル定量法の精度が向上し得ることを明らかにした。
本発明者らは、バーコード配列の種類の数をカウントする際に、バーコード配列内に変異（例えば、挿入、置換、および欠失）が生じ、同じと判断されるべき配列が異なる配列として認識される問題が発生し得ることが明らかになった。本発明者らは、一定の配列類似性を有する配列を一群にクラスタリングし、クラスター数に基づいて目的核酸分子の定量を行うことで、デジタル定量法の精度が向上し得ることを明らかにした。
本発明者らは、核酸をデジタルカウントする際に、鋳型を誤同定してしまう問題が生じ得ることが明らかになった。本発明者らはまた、同一のバーコードに２つの異なる目的核酸配列が付加されている場合に、最も頻度高いペアを正しいペアであるとし、それ以外のいずれかまたは全てを誤同定として除外することにより、デジタル定量法の精度が向上し得ることを明らかにした。In the digital quantification method of a target nucleic acid molecule using an index and a barcode, the present inventors have found that when a plurality of samples are mixed to quantify the target nucleic acid molecule, the index is a nucleic acid derived from a different sample than expected. It has been clarified that a misindex, which is added to, may occur. The inventors have also determined that if two different indexes are added to the same barcode, the most frequent pair is regarded as the correct pair and any or all other pairs are excluded as a miss index. , Clarified that the accuracy of the digital quantitative method can be improved.
The present inventors have found that when counting the number of types of barcode sequences, mutations (eg, insertions, substitutions, and deletions) occur in the barcode sequences, and sequences that should be judged to be the same are identified as different sequences. It became clear that perceived problems could occur. The present inventors have clarified that the accuracy of the digital quantification method can be improved by clustering sequences having a certain sequence similarity into a group and quantifying the target nucleic acid molecule based on the number of clusters.
The present inventors have revealed that when digitally counting nucleic acids, a problem of misidentifying a template may occur. The inventors have also determined that when two different nucleic acid sequences of interest are added to the same barcode, the most frequent pair is the correct pair and any or all other pairs are excluded as misidentifications. By doing so, it was clarified that the accuracy of the digital quantitative method can be improved.

すなわち、本発明によれば以下の発明が提供される。
（１Ａ）核酸の解析方法であって：
（Ｉ）分子バーコードとインデックスが付加された複数の目的核酸分子の混合物をシークエンシングに供して配列情報を得る工程と、
（ＩＩ）上記（Ｉ）で得られた配列情報から特定のインデックスを有する配列若しくはこれと類似する配列、及び／又は特定の分子バーコードを有する配列若しくはこれと類似する配列を選択し、選択された配列により群を作成する工程と、
（ＩＩＩ）上記（ＩＩ）で作成された群において、検出頻度の最も高いインデックスと分子バーコードのペアをインデックスと分子バーコードの正しいペアと決定する工程と、
を含む、方法。
（２Ａ）少なくとも分子バーコードが付加された目的核酸分子が、工程（Ｉ）の前に増幅に供されている、上記（１Ａ）に記載の方法。
（３Ａ）工程（ＩＩ）における特定の分子バーコードを有する配列と類似する配列が、当該特定の分子バーコードを有する配列と所定の塩基数以下のミスマッチ塩基を分子バーコード配列部分に含む配列である、上記（１Ａ）または（２Ａ）に記載の方法。
（４Ａ）分子バーコードが、特定の位置に固定塩基を有する、上記（１Ａ）〜（３Ａ）のいずれかに記載の方法。
（５Ａ）工程（ＩＩ）における特定の分子バーコードを有する配列と類似する配列が、当該特定の位置に当該固定塩基を含むこと、および／または、当該固定塩基の位置が当該特定の位置からシフトしていることに基づいて選択される、上記（４Ａ）に記載の方法。
（６Ａ）当該特定の位置に当該固定塩基を含まない分子バーコードを有する配列を解析から除外することをさらに含む、上記（４Ａ）に記載の方法。
（７Ａ）工程（ＩＩＩ）において、決定された正しいペア以外のインデックスと分子バーコードのペアを、インデックスと分子バーコードのミスペアと決定して除外する、
上記（１Ａ）〜（５Ａ）のいずれかに記載の方法。
（８Ａ）特定の分子バーコードを有する配列若しくはこれと類似する配列により作成された群の数に基づいて、目的核酸分子が由来するサンプルに含まれる目的核酸分子の数を決定する工程をさらに含む、上記（１Ａ）〜（７Ａ）のいずれかに記載の方法。
（９Ａ）核酸の解析方法であって：
（Ｉ）分子バーコードが付加された複数の核酸分子の混合物をシークエンシングに供して配列情報を得る工程と、
（ＩＩ）上記（Ｉ）で得られた配列情報から特定の分子バーコードを有する配列若しくはこれと類似する配列を選択し、選択された配列により群を作成する工程と、
を含む、方法。
（１０Ａ）工程（ＩＩ）における特定の分子バーコードを有する配列と類似する配列が、当該特定の分子バーコードを有する配列と所定の塩基数以下のミスマッチ塩基を分子バーコード配列部分に含む配列である、上記（９Ａ）に記載の方法。
（１１Ａ）分子バーコードが、特定の位置に固定塩基を有する、上記（９Ａ）または（１０Ａ）に記載の方法。
（１２Ａ）工程（ＩＩ）における特定の分子バーコードを有する配列と類似する配列が、当該特定の位置に当該固定塩基を含むこと、および／または、当該固定塩基の位置が当該特定の位置からシフトしていることに基づいて選択される、上記（１１Ａ）に記載の方法。
（１３Ａ）当該特定の位置に当該固定塩基を含まない分子バーコードを有する配列を解析から除外する工程をさらに含む、上記（１１Ａ）に記載の方法。
（１４Ａ）特定の分子バーコードを有する配列若しくはこれと類似する配列により作成された群の数に基づいて、目的核酸分子が由来するサンプルに含まれる目的核酸分子の数を決定する工程をさらに含む、上記（９Ａ）〜（１３Ａ）のいずれかに記載の方法。
（１５Ａ）少なくとも分子バーコードが付加された目的核酸分子が、工程（Ｉ）の前に増幅に供されている、上記（９Ａ）〜（１４Ａ）のいずれかに記載の方法。
（１６Ａ）核酸の解析方法であって：
（Ｉ）特定の位置に固定塩基を有する分子バーコードが付加された複数の核酸分子の混合物をシークエンシングに供して配列情報を得る工程と、
（ＩＩａ）当該特定の位置に当該固定塩基を含まない分子バーコードを有する配列を解析から除外する工程；
（ＩＩｂ）工程（Ｉ）において、若しくは、工程（Ｉ）の後で、当該特定の位置に当該固定塩基を含む配列からなる配列情報を得る工程；または
（ＩＩｃ）工程（ＩＩ）として上記（Ｉ）で得られた配列情報から特定の分子バーコードを有する配列若しくはこれと類似する配列を選択し、選択された配列により群を作成する工程をさらに含み、かつ工程（ＩＩ）において、若しくは工程（ＩＩ）の後で、当該特定の位置に当該固定塩基を含む配列からなる群を得る工程と、
を含む、方法。That is, according to the present invention, the following inventions are provided.
(1A) A method for analyzing a nucleic acid, which comprises:
(I) subjecting a mixture of a plurality of target nucleic acid molecules having a molecular barcode and an index to sequencing to obtain sequence information;
(II) A sequence having a specific index or a sequence similar thereto and / or a sequence having a specific molecular barcode or a sequence similar thereto is selected from the sequence information obtained in (I) above, and selected. Creating a group with the arranged array,
(III) in the group created in (II) above, determining the pair of the index and the molecular barcode with the highest detection frequency as the correct pair of the index and the molecular barcode,
Including the method.
(2A) The method according to (1A) above, wherein the nucleic acid molecule of interest to which at least the molecular barcode is added is subjected to amplification before step (I).
(3A) A sequence similar to the sequence having the specific molecular barcode in the step (II) is a sequence containing a mismatch base having a predetermined number of bases or less in the molecular barcode sequence portion with the sequence having the specific molecular barcode. The method according to (1A) or (2A) above.
(4A) The method according to any one of (1A) to (3A) above, wherein the molecular barcode has a fixed base at a specific position.
(5A) A sequence similar to the sequence having the specific molecular barcode in step (II) contains the fixed base at the specific position, and / or the position of the fixed base is shifted from the specific position. The method according to (4A) above, which is selected based on
(6A) The method according to (4A) above, further comprising excluding a sequence having a molecular barcode that does not include the fixed base at the specific position from analysis.
(7A) In step (III), a pair of an index and a molecular barcode other than the determined correct pair is determined and excluded as a mispair of the index and the molecular barcode.
The method according to any one of (1A) to (5A) above.
(8A) The method further comprises the step of determining the number of the target nucleic acid molecule contained in the sample from which the target nucleic acid molecule is derived, based on the number of groups created by the sequence having a specific molecular barcode or a sequence similar thereto. , The method according to any one of (1A) to (7A) above.
(9A) A method for analyzing nucleic acid, which comprises:
(I) subjecting a mixture of a plurality of nucleic acid molecules having a molecular barcode to sequencing to obtain sequence information;
(II) a step of selecting a sequence having a specific molecular barcode or a sequence similar thereto from the sequence information obtained in (I) above, and creating a group by the selected sequence;
Including the method.
(10A) A sequence similar to the sequence having the specific molecular barcode in the step (II) is a sequence containing a mismatch base having a predetermined number of bases or less with the sequence having the specific molecular barcode in the molecular barcode sequence portion. The method according to (9A) above.
(11A) The method according to (9A) or (10A) above, wherein the molecular barcode has a fixed base at a specific position.
(12A) A sequence similar to the sequence having the specific molecular barcode in step (II) contains the fixed base at the specific position, and / or the position of the fixed base is shifted from the specific position. The method according to (11A) above, wherein the method is selected based on:
(13A) The method according to (11A) above, further including a step of excluding a sequence having a molecular barcode that does not include the fixed base at the specific position from analysis.
(14A) further comprising the step of determining the number of the target nucleic acid molecule contained in the sample from which the target nucleic acid molecule is derived, based on the number of groups created by the sequence having a specific molecular barcode or a sequence similar thereto. The method according to any one of (9A) to (13A) above.
(15A) The method according to any one of (9A) to (14A) above, wherein the nucleic acid molecule of interest to which at least the molecular barcode is added is subjected to amplification before step (I).
(16A) A method for analyzing a nucleic acid, which comprises:
(I) subjecting a mixture of a plurality of nucleic acid molecules having a molecular barcode having a fixed base at a specific position to sequencing to obtain sequence information;
(IIa) excluding from the analysis a sequence having a molecular barcode that does not include the fixed base at the specific position;
(IIb) In step (I) or after step (I), a step of obtaining sequence information consisting of a sequence containing the fixed base at the specific position; or (IIc) the step (II) above (I). ) Further comprising the step of selecting a sequence having a specific molecular barcode or a sequence similar thereto from the sequence information obtained in step (4), and forming a group with the selected sequence, and in the step (II) or the step (II) After II), obtaining a group of sequences comprising the fixed base at the particular position;
Including the method.

本発明によればまた、以下の発明が提供される。
（１Ｂ）複数の核酸分子を含むサンプル毎に固有のインデックス及び各核酸分子に固有のまたは任意の分子バーコードが付加された目的核酸分子を含む複数のサンプルの混合物を用いたシークエンシングより得られた配列情報から、目的核酸分子に付加されたインデックスと分子バーコードの正しいペア又はミスペアを決定する方法であって、
（Ｅ）得られた配列情報から、特定のインデックスを有する配列若しくはこれと類似する配列、特定の分子バーコードを有する配列若しくはこれと類似する配列、または目的核酸分子を含む配列若しくはこれと類似する配列を選択し、選択された配列により群を作成する工程と、
（Ｆ）上記（Ｅ）で作成された群において、検出頻度の最も高いインデックスと分子バーコードのペアをインデックスと分子バーコードの正しいペアと決定する、および／または、検出頻度の低いインデックスと分子バーコードのペアの少なくともいずれか１つまたは全てをインデックスと分子バーコードのミスペアと決定する工程と、
を含む、方法。
（２Ｂ）工程（Ｅ）において、特定のインデックスを有する配列を選択してインデックス毎に群を作成し、
工程（Ｆ）において、複数の群に出現した分子バーコードを有する核酸配列に関して、最もリード数が多いバーコードとインデックスのペアを、バーコードとインデックスの正しいペアと決定する、または、検出頻度の最も高いインデックスと分子バーコードのペアをインデックスと分子バーコードの正しいペアと決定する、
上記（１Ｂ）に記載の方法。
（３Ｂ）工程（Ｅ）において、特定の分子バーコードを有する配列を選択して分子バーコード毎に群を作成し、
工程（Ｆ）において、作成された群のうち検出頻度の最も高いインデックスと分子バーコードのペアをインデックスと分子バーコードの正しいペアと決定する、
上記（１Ｂ）に記載の方法。
（４Ｂ）工程（Ｅ）において、目的核酸分子の配列を含む配列を選択して群を作成し、
工程（Ｆ）において、さらに当該群から特定のインデックスを有する配列を選択してサブグループを作成し、複数のサブグループに出現した分子バーコードを有する核酸配列に関して、最もリード数が多いバーコードとインデックスのペアを、バーコードとインデックスの正しいペアと決定する、または、検出頻度の最も高いインデックスと分子バーコードのペアをインデックスと分子バーコードの正しいペアと決定する、
上記（１Ｂ）に記載の方法。
（５Ｂ）工程（Ｅ）において、目的核酸分子の配列を含む配列を選択して群を作成し、
工程（Ｆ）において、さらに当該群から特定の分子バーコードを有する分子を選択してサブグループを作成し、作成された一つのサブグループにおいて検出頻度の最も高いインデックスと分子バーコードのペアをインデックスと分子バーコードの正しいペアと決定する、
上記（１Ｂ）に記載の方法。
（６Ｂ）工程（Ｆ）において、決定された正しいペア以外のインデックスと分子バーコードのペアの少なくともいずれか１つまたは全てを、インデックスと分子バーコードのミスペアと決定する、
上記（２Ｂ）〜（５Ｂ）に記載の方法。
（７Ｂ）工程（Ｅ）において特定のインデックスを有する分子を選択してインデックス毎に群を作成し、
工程（Ｆ）において、複数の群に出現した分子バーコードを有する配列に関して、検出頻度の低いインデックスと分子バーコードのペアをインデックスの少なくともいずれか１つまたは全てと分子バーコードのミスペアと決定する、
上記（１Ｂ）に記載の方法。
（８Ｂ）工程（Ｅ）において特定の分子バーコードを有する配列を選択して分子バーコード毎に群を作成し、
工程（Ｆ）において作成された群のうち検出頻度の低いインデックスと分子バーコードのペアをインデックスと分子バーコードの少なくともいずれか１つまたは全てのミスペアと決定する、
上記（１Ｂ）に記載の方法。
（９Ｂ）工程（Ｅ）において目的核酸分子を含む配列を選択して群を作成し、
工程（Ｆ）においてさらに当該群から特定のインデックスを有する分子を選択してサブグループを作成し、複数のサブグループに出現した分子バーコードを有する核酸分子に関して、検出頻度の低いインデックスと分子バーコードのペアの少なくともいずれか１つまたは全てをインデックスと分子バーコードのミスペアと決定する、
上記（１Ｂ）に記載の方法。
（１０Ｂ）工程（Ｅ）において目的核酸分子を含む分子を選択して群を作成し、
工程（Ｆ）においてさらに当該群から特定の分子バーコードを有する分子を選択してサブグループを作成し、作成された一つのサブグループにおいて検出頻度の低いインデックスと分子バーコードのペアの少なくともいずれか１つまたは全てをインデックスと分子バーコードのミスペアと決定する、
上記（１Ｂ）に記載の方法。
（１１Ｂ）工程（Ｅ）において、群を作成する工程が、配列同一性または類似性に基づいて判断される同一配列を有していたと推定される分子を一群としてクラスタリングすることによって群を作成することによって行われる、
上記（１Ｂ）〜（１０Ｂ）に記載の方法。
（１２Ｂ）工程（Ｅ）において、クラスタリングが、
（i）分子バーコード部分の配列において、固有の分子バーコードの配列と同一の配列を有する核酸分子群を同じクラスターに分類することにより行われる；
（ii）分子バーコード部分の配列において、固有の分子バーコードの配列と１ベースまでのミスマッチを有する配列を有する核酸分子群を同じクラスターに分類することにより行われる；
（iii）分子バーコード部分の配列において、固有の分子バーコードの配列と２ベースまでのミスマッチを有する配列を有する核酸分子群を同じクラスターに分類することにより行われる；または
（iv）分子バーコード部分の配列において、固有の分子バーコードの配列と３ベースまでのミスマッチを有する配列を有する核酸分子群を同じクラスターに分類することにより行われる、
上記（１１Ｂ）に記載の方法。
（１３Ｂ）工程（Ｅ）において、クラスタリングが、
分子バーコード部分の配列において、塩基の挿入または欠失（indel）を有するとしてシークエンスされた配列を有する核酸分子群を同じクラスターに分類することにより行われる、
上記（１１Ｂ）または（１２Ｂ）に記載の方法。
（１４Ｂ）工程（Ｅ）において、クラスタリングが、
分子バーコード部分の配列において、塩基の挿入または欠失（indel）を有するとしてシークエンスされた配列を除外して得られた核酸分子群に対して行われる、
上記（１１Ｂ）または（１２Ｂ）に記載の方法。
（１５Ｂ）前記塩基の挿入または欠失が、核酸分子に連結する全ての分子バーコード配列中に配置された１以上の固定塩基それぞれの位置と、配列解読された分子バーコード配列部分の配列における１以上の固定塩基それぞれの位置との、位置の相違により特定することをさらに含む、上記（１３Ｂ）または（１４Ｂ）に記載の方法。
（１６Ｂ）複数の核酸分子を含むサンプル毎に固有のインデックス及び各核酸分子に固有のまたは任意の分子バーコードが付加された目的核酸分子を含む複数のサンプルの混合物を用いたシークエンシングより得られた配列情報から、特定の元々のサンプルに含まれる目的核酸分子の数を決定する方法であって、
（ｅ）得られた配列情報から、目的核酸分子の配列を含む核酸分子を選択することと、
（ｆ）上記（ｅ）で選択された核酸分子を固有の分子バーコードの配列毎にクラスタリングし、その後、インデックス核酸分子部分において複数の配列を有するクラスターを特定することと、
（ｇ）上記（ｆ）において特定されたクラスターそれぞれにおいて、検出頻度の最も高いインデックスと分子バーコードのペアを正しくインデックスされた目的核酸分子として特定し、それ以外のインデックスと分子バーコードのペアをミスペアであると決定することと、
を含み、
正しくインデックスされた目的核酸分子に連結した固有の分子バーコードの配列の種類の数（または、正しくインデックスされた目的核酸分子のクラスターの数）が、当該インデックスに対応するサンプルに含まれる目的核酸分子の数である、
方法。
（１７Ｂ）前記（ｆ）において、クラスタリングが、
（i）分子バーコード部分の配列において、固有の分子バーコードの配列と同一の配列を有する核酸分子群を同じクラスターに分類することにより行われる；
（ii）分子バーコード部分の配列において、固有の分子バーコードの配列と１ベースまでのミスマッチを有する配列を有する核酸分子群を同じクラスターに分類することにより行われる；
（iii）分子バーコード部分の配列において、固有の分子バーコードの配列と２ベースまでのミスマッチを有する配列を有する核酸分子群を同じクラスターに分類することにより行われる；または
（iv）分子バーコード部分の配列において、固有の分子バーコードの配列と３ベースまでのミスマッチを有する配列を有する核酸分子群を同じクラスターに分類することにより行われる、
上記（１６Ｂ）に記載の方法。
（１８Ｂ）前記（ｅ）において、クラスタリングが、
分子バーコード部分の配列において、塩基の挿入または欠失（indel）を有するとしてシークエンスされた配列を有する核酸分子群を同じクラスターに分類することにより行われる、
上記（１６Ｂ）または（１７Ｂ）に記載の方法。
（１９Ｂ）前記（ｅ）において、クラスタリングが、
分子バーコード部分の配列において、塩基の挿入または欠失（indel）を有するとしてシークエンスされた配列を除外して得られた核酸分子群に対して行われる、
上記（１６Ｂ）または（１７Ｂ）に記載の方法。
（２０Ｂ）前記塩基の挿入または欠失が、核酸分子に連結する全ての分子バーコード配列中に配置された１以上の固定塩基それぞれの位置と、配列解読された分子バーコード配列部分の配列における１以上の固定塩基それぞれの位置との、相違により特定することをさらに含む、上記（１８Ｂ）または（１９Ｂ）に記載の方法。
（２１Ｂ）バーコード配列を用いた目的核酸分子のデジタル定量法において、得られた核酸配列の情報から、変異後の分子バーコードが有する配列を配列類似性を有する他の配列と一緒に１群にクラスタリングし、得られたクラスター数に基づいて目的核酸分子の数を推定する、方法。
（２２Ｂ）上記（２１Ｂ）に記載の方法であって、クラスタリングが、
（i）分子バーコード部分の配列において、固有の分子バーコードの配列と同一の配列を有する核酸分子群を同じクラスターに分類することにより行われる；
（ii）分子バーコード部分の配列において、固有の分子バーコードの配列と１ベースまでのミスマッチを有する配列を有する核酸分子群を同じクラスターに分類することにより行われる；
（iii）分子バーコード部分の配列において、固有の分子バーコードの配列と２ベースまでのミスマッチを有する配列を有する核酸分子群を同じクラスターに分類することにより行われる；または
（iv）分子バーコード部分の配列において、固有の分子バーコードの配列と３ベースまでのミスマッチを有する配列を有する核酸分子群を同じクラスターに分類することにより行われる、方法。
（２３Ｂ）上記（２１Ｂ）または（２２Ｂ）に記載の方法であって、クラスタリングが、分子バーコード部分の配列において、塩基（例えば、１ベースまで、２ベースまで、または３ベースまで）の挿入または欠失（indel）を有するとしてシークエンスされた配列を有する核酸分子群を同じクラスターに分類することにより行われる、方法。
（２４Ｂ）上記（２１Ｂ）〜（２３Ｂ）のいずれか一項に記載の方法であって、
核酸分子に連結する全ての分子バーコード配列中に配置された１以上の固定塩基それぞれの位置と、配列解読された分子バーコード配列部分の配列における１以上の固定塩基それぞれの位置との、相対的位置を比較することによって特定することによって塩基の挿入または欠失（indel）を検出し、
クラスタリングが、分子バーコード部分の配列において、塩基の挿入または欠失（indel）を有するとしてシークエンスされた配列を有する核酸分子群を同じクラスターに分類することにより行われるか、または、
クラスタリングが、分子バーコード部分の配列において、塩基の挿入または欠失（indel）を有するとしてシークエンスされた配列を除外して得られた核酸分子群に対して行われる、方法。
（２５Ｂ）バーコード配列を用いた目的核酸分子のデジタル定量法において、バーコード中の塩基の挿入または欠失（indel）を検出する方法であって、核酸分子に連結する全ての分子バーコード配列中に配置された１以上の固定塩基それぞれの位置と、配列解読された分子バーコード配列部分の配列における１以上の固定塩基それぞれの位置との、相対的位置を比較することによって特定することによって塩基の挿入または欠失（indel）を検出することを含む、方法。According to the present invention, the following inventions are also provided.
(1B) Obtained by sequencing using a mixture of a plurality of samples containing a target nucleic acid molecule having a unique index for each sample containing a plurality of nucleic acid molecules and a unique or arbitrary molecular barcode for each nucleic acid molecule A method for determining the correct pair or mispair of the index and the molecular barcode added to the target nucleic acid molecule from the sequence information,
(E) From the obtained sequence information, a sequence having a specific index or a sequence similar thereto, a sequence having a specific molecular barcode or a sequence similar thereto, or a sequence containing a target nucleic acid molecule or a sequence similar thereto Selecting sequences and creating groups with the selected sequences;
(F) In the group created in (E) above, the pair of the index and the molecular barcode having the highest detection frequency is determined as the correct pair of the index and the molecular barcode, and / or the index and the molecule having the low detection frequency are determined. Determining at least any one or all of the barcode pairs as an index and a molecular barcode mispair.
Including the method.
(2B) In step (E), an array having a specific index is selected to create a group for each index,
In the step (F), regarding a nucleic acid sequence having a molecular barcode that appears in a plurality of groups, the pair of barcode and index having the largest number of reads is determined as the correct pair of barcode and index, or the detection frequency Determine the highest index and molecular barcode pair as the correct index and molecular barcode pair,
The method according to (1B) above.
(3B) In step (E), a sequence having a specific molecular barcode is selected to create a group for each molecular barcode,
In the step (F), the pair of the index and the molecular barcode having the highest detection frequency among the created groups is determined as the correct pair of the index and the molecular barcode.
The method according to (1B) above.
(4B) In step (E), a sequence containing the sequence of the target nucleic acid molecule is selected to form a group,
In the step (F), a sequence having a specific index is further selected from the group to create a subgroup, and a barcode having the largest number of reads is selected for a nucleic acid sequence having a molecular barcode that appears in a plurality of subgroups. Determining the index pair as the correct barcode and index pair, or determining the most frequently detected index and molecular barcode pair as the correct index and molecular barcode pair,
The method according to (1B) above.
(5B) In step (E), a sequence containing the sequence of the target nucleic acid molecule is selected to form a group,
In the step (F), a molecule having a specific molecular barcode is further selected from the group to create a subgroup, and the index with the highest detection frequency and the molecular barcode pair is indexed in one created subgroup. And determine the correct pair of molecular barcodes,
The method according to (1B) above.
(6B) In the step (F), at least any one or all of the pairs of the index and the molecular barcode other than the determined correct pair are determined as the mispair of the index and the molecular barcode.
The method according to (2B) to (5B) above.
(7B) In step (E), a molecule having a specific index is selected to create a group for each index,
In the step (F), with respect to the sequences having the molecular barcodes appearing in a plurality of groups, at least one or all of the indices and the molecular barcode pairs having a low detection frequency are determined as the mispairs of the molecular barcodes. ,
The method according to (1B) above.
(8B) In step (E), a sequence having a specific molecular barcode is selected to create a group for each molecular barcode,
Determining a pair of an index and a molecular barcode having a low detection frequency among the group created in the step (F) as at least one or all of the indexes and the molecular barcode.
The method according to (1B) above.
(9B) In step (E), a sequence containing the target nucleic acid molecule is selected to form a group,
In step (F), a molecule having a specific index is further selected from the group to form a subgroup, and a nucleic acid molecule having a molecular barcode that appears in a plurality of subgroups has a low detection frequency and a molecular barcode. Determine at least any one or all of the pairs of as the pair of index and molecular barcode,
The method according to (1B) above.
(10B) In step (E), molecules containing the target nucleic acid molecule are selected to form a group,
In the step (F), a molecule having a specific molecular barcode is further selected from the group to create a subgroup, and at least one of a pair of an index and a molecular barcode having a low detection frequency in one created subgroup. Determining one or all as the index and molecular barcode mispair,
The method according to (1B) above.
(11B) In step (E), the step of creating a group creates a group by clustering molecules that are presumed to have the same sequence judged based on the sequence identity or similarity as a group. Done by
The method according to (1B) to (10B) above.
(12B) In step (E), the clustering is
(I) In the sequence of the molecular barcode portion, a nucleic acid molecule group having the same sequence as the unique molecular barcode sequence is classified into the same cluster;
(Ii) In the sequence of the molecular barcode part, the nucleic acid molecule group having a sequence having a mismatch of up to 1 base with the sequence of the unique molecular barcode is classified into the same cluster;
(Iii) In the sequence of the molecular barcode portion, a nucleic acid molecule group having a unique sequence of the molecular barcode and a sequence having a mismatch of up to 2 bases is classified into the same cluster; or (iv) the molecular barcode In a partial sequence, a group of nucleic acid molecules having a unique molecular barcode sequence and a sequence having a mismatch of up to 3 bases is classified into the same cluster.
The method according to (11B) above.
(13B) In step (E), the clustering is
In the sequence of the molecular barcode part, the nucleic acid molecule group having a sequence sequenced as having a base insertion or deletion (indel) is classified into the same cluster,
The method according to (11B) or (12B) above.
(14B) In the step (E), the clustering is
Performed on a group of nucleic acid molecules obtained by excluding the sequence sequenced as having a base insertion or deletion (indel) in the sequence of the molecular barcode portion,
The method according to (11B) or (12B) above.
(15B) The insertion or deletion of the above-mentioned bases in the positions of each of one or more fixed bases arranged in all the molecular barcode sequences linked to the nucleic acid molecule, and in the sequence of the sequenced molecular barcode sequence portion. The method according to (13B) or (14B) above, further comprising identifying by the difference in position from the position of each of the one or more fixed bases.
(16B) Obtained by sequencing using a mixture of a plurality of samples containing a target nucleic acid molecule having a unique index for each sample containing a plurality of nucleic acid molecules and a unique or arbitrary molecular barcode for each nucleic acid molecule A method for determining the number of target nucleic acid molecules contained in a specific original sample from the sequence information
(E) selecting a nucleic acid molecule containing the sequence of the target nucleic acid molecule from the obtained sequence information;
(F) clustering the nucleic acid molecules selected in (e) above for each unique molecular barcode sequence, and then identifying clusters having a plurality of sequences in the index nucleic acid molecule portion;
(G) In each of the clusters identified in (f) above, the pair of the index and the molecular barcode with the highest detection frequency is identified as the correctly indexed target nucleic acid molecule, and the other pairs of the index and the molecular barcode are identified. Deciding to be a mispair,
Including,
The number of unique molecular barcode sequence types linked to the correctly indexed target nucleic acid molecule (or the number of correctly indexed clusters of the target nucleic acid molecule) is contained in the sample corresponding to the index. Is the number of
Method.
(17B) In (f) above, the clustering is
(I) In the sequence of the molecular barcode portion, a nucleic acid molecule group having the same sequence as the unique molecular barcode sequence is classified into the same cluster;
(Ii) In the sequence of the molecular barcode part, the nucleic acid molecule group having a sequence having a mismatch of up to 1 base with the sequence of the unique molecular barcode is classified into the same cluster;
(Iii) by classifying into the same cluster nucleic acid molecule groups having a sequence having a mismatch of up to 2 bases with the sequence of the unique molecular barcode in the sequence of the molecular barcode portion; or (iv) the molecular barcode In a partial sequence, a group of nucleic acid molecules having a unique molecular barcode sequence and a sequence having a mismatch of up to 3 bases is classified into the same cluster.
The method according to (16B) above.
(18B) In (e) above, the clustering is
In the sequence of the molecular barcode part, the nucleic acid molecule group having a sequence sequenced as having a base insertion or deletion (indel) is classified into the same cluster,
The method according to (16B) or (17B) above.
(19B) In (e) above, the clustering is
Performed on a group of nucleic acid molecules obtained by excluding the sequence sequenced as having a base insertion or deletion (indel) in the sequence of the molecular barcode portion,
The method according to (16B) or (17B) above.
(20B) The insertion or deletion of the above-mentioned bases in the position of each of one or more fixed bases arranged in all the molecular barcode sequences linked to the nucleic acid molecule and in the sequence of the sequenced molecular barcode sequence portion. The method according to (18B) or (19B) above, which further comprises specifying by the difference from the position of each of the one or more fixed bases.
(21B) In a digital quantification method of a target nucleic acid molecule using a barcode sequence, based on the information of the obtained nucleic acid sequence, the sequence possessed by the molecular barcode after mutation is grouped together with other sequences having sequence similarity. A method of estimating the number of target nucleic acid molecules based on the obtained number of clusters.
(22B) The method according to (21B) above, wherein the clustering is
(I) In the sequence of the molecular barcode portion, a nucleic acid molecule group having the same sequence as the unique molecular barcode sequence is classified into the same cluster;
(Ii) In the sequence of the molecular barcode part, the nucleic acid molecule group having a sequence having a mismatch of up to 1 base with the sequence of the unique molecular barcode is classified into the same cluster;
(Iii) In the sequence of the molecular barcode portion, a nucleic acid molecule group having a unique sequence of the molecular barcode and a sequence having a mismatch of up to 2 bases is classified into the same cluster; or (iv) the molecular barcode The method is performed by grouping nucleic acid molecules having a sequence having a mismatch of up to 3 bases with a sequence of a unique molecular barcode in a partial sequence into the same cluster.
(23B) The method according to (21B) or (22B) above, wherein the clustering comprises inserting bases (for example, up to 1 base, up to 2 bases, or up to 3 bases) in the sequence of the molecular barcode portion, or A method, which is performed by grouping nucleic acid molecules having sequences sequenced as having an indel into the same cluster.
(24B) The method according to any one of (21B) to (23B) above,
The position of each of one or more fixed bases arranged in all the molecular barcode sequences linked to the nucleic acid molecule and the position of each of the one or more fixed bases in the sequence of the sequenced molecular barcode sequence portion The insertion or deletion (indel) of the base by identifying by comparing
Clustering is performed by classifying into the same cluster nucleic acid molecules having a sequence sequenced as having a base insertion or deletion (indel) in the sequence of the molecular barcode portion, or
A method in which clustering is performed on a group of nucleic acid molecules obtained by excluding a sequence sequenced as having a base insertion or deletion (indel) in a sequence of a molecular barcode portion.
(25B) A method for detecting insertion or deletion (indel) of a base in a barcode in a digital quantification method for a nucleic acid molecule of interest using a barcode sequence, wherein all molecular barcode sequences linked to the nucleic acid molecule By identifying the relative position of each of the one or more fixed bases located within and the position of each of the one or more fixed bases in the sequence of the sequenced molecular barcode sequence portion. A method comprising detecting a base insertion or indel.

本発明によればまた、以下の発明が提供される。
（１Ｃ）複数の核酸分子を含むサンプル毎に固有のインデックス（インデックス配列核酸分子を意味し、各サンプルに固有であれば複数種のインデックス核酸分子を含んでいてもよい）及び各核酸分子に固有のまたは任意の分子バーコード（バーコード配列核酸分子）が付加された目的核酸分子（例えば、ＤＮＡまたはＲＮＡ）を含む複数のサンプルの混合物を用いたシークエンシング（すなわち、マルチプレックスシークエンシング）より得られた配列情報から、目的核酸分子に付加されたインデックスと分子バーコードの正しいペア又はミスペアを決定する方法であって、
（Ａ）核酸分子（例えば、ＤＮＡまたはＲＮＡ）を含む複数のサンプルを別々に取得する工程と｛サンプルの少なくとも１つには目的核酸分子が含まれる｝、
（Ｂ）｛例えば、得られた複数のサンプルそれぞれにおいて、｝サンプルに含まれる核酸分子を増幅する前に、目的核酸分子それぞれに各核酸分子に固有のまたは任意の分子バーコードを連結して、それぞれ異なる分子バーコードが連結した目的核酸分子を得る工程と、
（Ｃ）｛例えば、複数のサンプルを混合する前に、｝複数の目的核酸分子を含むサンプル毎に固有のインデックスを目的核酸分子に付加し、由来するサンプル毎に異なるインデックスが連結した目的核酸分子のライブラリを得る工程と（工程Ｂの後に工程Ｃを行ってもよいし、工程Ｃの後に工程Ｂを行ってもよい；また、工程BまたはCの後で核酸分子を増幅して目的核酸分子の増幅産物を得ることができる）、
（Ｄ）上記（Ｂ）と（Ｃ）の後に得られた核酸分子の増幅産物を含む混合物中で（サンプルを混合するのは工程（Ｃ）の後であり、サンプルを混合した後に工程（Ｂ）を行っても良く、工程（Ｂ）を行った後に全サンプルを混合してもよい。また、分子バーコードが連結した核酸分子の増幅産物を得るのは工程（Ｂ）の後であり、増幅産物を得る前にサンプルを混合してもよく、増幅産物を得た後に当該増幅産物を含むサンプルを混合してもよい）、サンプル毎に固有のインデックス及び各目的核酸分子に固有のまたは任意の分子バーコードが付加された核酸分子をシークエンシングして、１核酸分子毎にインデックス部分の配列と分子バーコード部分の配列と必要に応じてそれに連結した目的核酸分子部分の配列を決定する工程と、
（Ｅ）得られた配列情報から、｛例えば、配列同一性または類似性に基づいて行うことができるが｝特定のインデックスを有する配列若しくはこれと類似する配列、特定の分子バーコードを有する配列若しくはこれと類似する配列、または目的核酸分子を含む配列若しくはこれと類似する配列を選択し、選択された配列により群を作成する工程と、
（Ｆ）上記（Ｅ）で作成された群において、検出頻度の最も高いインデックスと分子バーコードのペアをインデックスと分子バーコードの正しいペアと決定する、および／または、検出頻度の低いインデックスと分子バーコードのペア（例えば、一定の基準値より低い検出頻度のペアであって、一定の基準値とは群において99.5％以下、99％以下、90%以下、80％以下、70％以下、60％以下、50%以下、40%以下、30%以下、20%以下、10%以下、5%以下、1%以下の値が挙げられるがこれらに限定されない。また、例えば２番目以降の検出頻度のペアであってもよい。）の少なくともいずれか１つまたは全てをインデックスと分子バーコードのミスペアと決定する工程と、
を含む、方法。
（２Ｃ）工程（Ｅ）において、特定のインデックスを有する配列を選択してインデックス毎に群を作成し、
工程（Ｆ）において、複数の群に出現した分子バーコードを有する核酸配列に関して、最もリード数が多いバーコードとインデックスのペアを、バーコードとインデックスの正しいペアと決定する、または、検出頻度の最も高いインデックスと分子バーコードのペアをインデックスと分子バーコードの正しいペアと決定する、
上記（１Ｃ）に記載の方法。
（３Ｃ）工程（Ｅ）において、特定の分子バーコードを有する配列を選択して分子バーコード毎に群を作成し、
工程（Ｆ）において、作成された群のうち検出頻度の最も高いインデックスと分子バーコードのペアをインデックスと分子バーコードの正しいペアと決定する、
上記（１Ｃ）に記載の方法。
（４Ｃ）工程（Ｅ）において、目的核酸分子の配列を含む配列を選択して群を作成し、
工程（Ｆ）において、さらに当該群から特定のインデックスを有する配列を選択してサブグループを作成し、複数のサブグループに出現した分子バーコードを有する核酸配列に関して、最もリード数が多いバーコードとインデックスのペアを、バーコードとインデックスの正しいペアと決定する、または、検出頻度の最も高いインデックスと分子バーコードのペアをインデックスと分子バーコードの正しいペアと決定する、
上記（１Ｃ）に記載の方法。
（５Ｃ）工程（Ｅ）において、目的核酸分子の配列を含む配列を選択して群を作成し、
工程（Ｆ）において、さらに当該群から特定の分子バーコードを有する分子を選択してサブグループを作成し、作成された一つのサブグループにおいて検出頻度の最も高いインデックスと分子バーコードのペアをインデックスと分子バーコードの正しいペアと決定する、
上記（１Ｃ）に記載の方法。
（６Ｃ）工程（Ｆ）において、決定された正しいペア以外のインデックスと分子バーコードのペアの少なくともいずれか１つまたは全てを、インデックスと分子バーコードのミスペアと決定する、
上記（２Ｃ）〜（５Ｃ）のいずれかに記載の方法。
（７Ｃ）工程（Ｅ）において特定のインデックスを有する分子を選択してインデックス毎に群を作成し、
工程（Ｆ）において、複数の群に出現した分子バーコードを有する配列に関して、検出頻度の低いインデックスと分子バーコードのペア（例えば、一定の基準値より低い検出頻度のペアであって、一定の基準値とは群において50%以下、40%以下、30%以下、20%以下、10%以下、5%以下、1%以下の値が挙げられるがこれらに限定されない。また、例えば２番目以降の検出頻度のペアであってもよい。）をインデックスの少なくともいずれか１つまたは全てと分子バーコードのミスペアと決定する、
上記（１Ｃ）に記載の方法。
（８Ｃ）工程（Ｅ）において特定の分子バーコードを有する配列を選択して分子バーコード毎に群を作成し、
工程（Ｆ）において作成された群のうち検出頻度の低いインデックスと分子バーコードのペア（例えば、一定の基準値より低い検出頻度のペアであって、一定の基準値とは群において50%以下、40%以下、30%以下、20%以下、10%以下、5%以下、1%以下の値が挙げられるがこれらに限定されない。また、例えば２番目以降の検出頻度のペアであってもよい。）をインデックスと分子バーコードの少なくともいずれか１つまたは全てのミスペアと決定する、
上記（１Ｃ）に記載の方法。
（９Ｃ）工程（Ｅ）において目的核酸分子を含む配列を選択して群を作成し、
工程（Ｆ）においてさらに当該群から特定のインデックスを有する分子を選択してサブグループを作成し、複数のサブグループに出現した分子バーコードを有する核酸分子に関して、検出頻度の低いインデックスと分子バーコードのペア（例えば、一定の基準値より低い検出頻度のペアであって、一定の基準値とは群において50%以下、40%以下、30%以下、20%以下、10%以下、5%以下、1%以下の値が挙げられるがこれらに限定されない。また、例えば２番目以降の検出頻度のペアであってもよい。）の少なくともいずれか１つまたは全てをインデックスと分子バーコードのミスペアと決定する、
上記（１Ｃ）に記載の方法。
（１０Ｃ）工程（Ｅ）において目的核酸分子を含む分子を選択して群を作成し、
工程（Ｆ）においてさらに当該群から特定の分子バーコードを有する分子を選択してサブグループを作成し、作成された一つのサブグループにおいて検出頻度の低いインデックスと分子バーコードのペア（例えば、一定の基準値より低い検出頻度のペアであって、一定の基準値とは群において50%以下、40%以下、30%以下、20%以下、10%以下、5%以下、1%以下の値が挙げられるがこれらに限定されない。また、例えば２番目以降の検出頻度のペアであってもよい。）の少なくともいずれか１つまたは全てをインデックスと分子バーコードのミスペアと決定する、
上記（１Ｃ）に記載の方法。
（１１Ｃ）工程（Ｅ）において、群を作成する工程が、｛好ましくは、分子バーコード部分の配列において｝配列同一性または類似性に基づいて判断される同一配列を有していた｛例えば、工程（Ａ）〜（Ｄ）の工程のいずれかによって配列が変化することがある｝と推定される分子を一群としてクラスタリングすることによって群を作成することによって行われる、
上記（１Ｃ）〜（１０Ｃ）に記載の方法。
（１２Ｃ）工程（Ｅ）において、クラスタリングが、
（i）分子バーコード部分の配列において、固有の分子バーコードの配列と同一の配列を有する核酸分子群｛すなわち、Distance = 0｝を同じクラスターに分類することにより行われる；
（ii）分子バーコード部分の配列において、固有の分子バーコードの配列と１ベースまでのミスマッチを有する配列を有する核酸分子群｛すなわち、Distance = 1｝を同じクラスターに分類することにより行われる；
（iii）分子バーコード部分の配列において、固有の分子バーコードの配列と２ベースまでのミスマッチを有する配列を有する核酸分子群｛すなわち、Distance = 2｝を同じクラスターに分類することにより行われる；または
（iv）分子バーコード部分の配列において、固有の分子バーコードの配列と３ベースまでのミスマッチを有する配列を有する核酸分子群｛すなわち、Distance = 3｝を同じクラスターに分類することにより行われる、
上記（１１Ｃ）に記載の方法。
（１３Ｃ）工程（Ｅ）において、クラスタリングが、
分子バーコード部分の配列において、塩基（例えば、１ベースまで、２ベースまで、または３ベースまで）の挿入または欠失（indel）を有するとしてシークエンスされた配列を有する核酸分子群を同じクラスターに分類することにより行われる、
上記（１１Ｃ）または（１２Ｃ）に記載の方法。
（１４Ｃ）工程（Ｅ）において、クラスタリングが、
分子バーコード部分の配列において、塩基（例えば、１ベースまで、２ベースまで、または３ベースまで）の挿入または欠失（indel）を有するとしてシークエンスされた配列を除外して得られた核酸分子群に対して行われる、
上記（１１Ｃ）または（１２Ｃ）に記載の方法。
（１５Ｃ）前記塩基の挿入または欠失が、核酸分子に連結する全ての分子バーコード配列中に配置された１以上（例えば、１つ、２つ、３つ、４つ、５つ、または６つ以上）の固定塩基それぞれの位置と、配列解読された分子バーコード配列部分の配列における１以上の固定塩基それぞれの位置との、位置の相違により特定することをさらに含む、請求項１３または１４に記載の方法｛例えば、それぞれの固定塩基は、Ａ、Ｔ、ＧおよびＣからなる群から選択されるいずれか１つの塩基となるように設計され得る；または、ＡとＴの組合せ、ＡとＧの組合せ、ＡとＣの組合せ、ＴとＧの組合せ、ＴとＣの組合せ、ＧとＣの組合せ、ＡとＴとＧとの組合せ、ＡとＴとＣとの組合せ、ＡとＧとＣとの組合せ、およびＴとＧとＣとの組合せからなる群から選択されるいずれか１つの組合せに含まれる塩基からランダムに選択される塩基となるように設計され得る｝。
（１６Ｃ）複数の核酸分子を含むサンプル毎に固有のインデックス（インデックス配列核酸分子）及び各核酸分子に固有のまたは任意の分子バーコード（バーコード配列核酸分子）が付加された目的核酸分子（例えば、ＤＮＡまたはＲＮＡ）を含む複数のサンプルの混合物を用いたシークエンシング（すなわち、マルチプレックスシークエンシング）より得られた配列情報から、特定の元々（original）のサンプルに含まれる目的核酸分子の数を決定する方法であって、
（ａ）核酸分子（例えば、ＤＮＡまたはＲＮＡ）を含む複数のサンプルを別々に取得する工程と｛サンプルの少なくとも１つには目的核酸分子が含まれる｝、
（ｂ）サンプルに含まれる核酸分子を増幅する前に、得られた複数のサンプルそれぞれにおいて、目的核酸分子それぞれに任意の分子バーコードを連結して、それぞれ異なる分子バーコードが連結した目的核酸分子を得る工程と、
（ｃ）複数のサンプルを混合する前に、複数の目的核酸分子を含むサンプル毎に固有のインデックスを目的核酸分子に付加し、由来するサンプル毎に異なるインデックスが連結した目的核酸分子のライブラリを得る工程と（工程Ｂと工程Ｃの順序はどちらが先でもよい；また、工程BまたはCの後で核酸分子を増幅して目的核酸分子の増幅産物を得ることができる）、
（ｄ）上記（Ｂ）と（Ｃ）の後得られた核酸分子の増幅産物を含む混合物中で（サンプルを混合するのは工程（Ｃ）の後であり、サンプルを混合した後に工程（Ｂ）を行っても良く、工程（Ｂ）を行った後に全サンプルを混合してもよい。また、分子バーコードが連結した核酸分子の増幅産物を得るのは工程（Ｂ）の後であり、増幅産物を得る前にサンプルを混合してもよく、増幅産物を得た後に当該増幅産物を含むサンプルを混合してもよい）、サンプル毎に固有のインデックス及び各目的核酸分子に固有のまたは任意の分子バーコードが付加された核酸分子をシークエンシングして、１核酸分子毎にインデックス部分の配列と分子バーコード部分の配列とそれに連結した核酸分子部分の配列を同定する工程と、
（ｅ）得られた配列情報から、目的核酸分子の配列を含む核酸分子を選択することと、
（ｆ）上記（ｅ）で選択された核酸分子を固有の分子バーコードの配列毎にクラスタリングし、その後、インデックス核酸分子部分において複数の配列を有するクラスターを特定することと、
（ｇ）上記（ｆ）において特定されたクラスターそれぞれにおいて、検出頻度の最も高いインデックスと分子バーコードのペアを正しくインデックスされた目的核酸分子として特定し、それ以外のインデックスと分子バーコードのペアをミスペアであると決定することと、
を含み｛ここで、ミスペアにおいてインデックスが誤っていると決定することをさらに含んでいてもよい｝、
正しくインデックスされた目的核酸分子に連結した固有の分子バーコードの配列の種類の数（または、正しくインデックスされた目的核酸分子のクラスターの数）が、当該インデックスに対応するサンプルに含まれる目的核酸分子の数である、
方法。
（１７Ｃ）前記（ｆ）において、クラスタリングが、
（i）分子バーコード部分の配列において、固有の分子バーコードの配列と同一の配列を有する核酸分子群を同じクラスターに分類することにより行われる；
（ii）分子バーコード部分の配列において、固有の分子バーコードの配列と１ベースまでのミスマッチを有する配列を有する核酸分子群を同じクラスターに分類することにより行われる；
（iii）分子バーコード部分の配列において、固有の分子バーコードの配列と２ベースまでのミスマッチを有する配列を有する核酸分子群を同じクラスターに分類することにより行われる；または
（iv）分子バーコード部分の配列において、固有の分子バーコードの配列と３ベースまでのミスマッチを有する配列を有する核酸分子群を同じクラスターに分類することにより行われる、
上記（１６Ｃ）に記載の方法。
（１８Ｃ）前記（ｅ）において、クラスタリングが、
分子バーコード部分の配列において、塩基（例えば、１ベースまで、２ベースまで、または、３ベースまで）の挿入または欠失（indel）を有するとしてシークエンスされた配列を有する核酸分子群を同じクラスターに分類することにより行われる、
上記（１６Ｃ）または（１７Ｃ）に記載の方法。
（１９Ｃ）前記（ｅ）において、クラスタリングが、
分子バーコード部分の配列において、塩基（例えば、１ベースまで、２ベースまで、または、３ベースまで）の挿入または欠失（indel）を有するとしてシークエンスされた配列を除外して得られた核酸分子群に対して行われる、
上記（１６Ｃ）または（１７Ｃ）に記載の方法。
（２０Ｃ）前記塩基の挿入または欠失が、核酸分子に連結する全ての分子バーコード配列中に配置された１以上（例えば、１つ、２つ、３つ、４つ、５つ、または６つ以上）の固定塩基それぞれの位置と、配列解読された分子バーコード配列部分の配列における１以上の固定塩基それぞれの位置との、相違により特定することをさらに含む、請求項１８または１９に記載の方法｛例えば、それぞれの固定塩基は、Ａ、Ｔ、ＧおよびＣからなる群から選択されるいずれか１つの塩基となるように設計され得る；または、ＡとＴの組合せ、ＡとＧの組合せ、ＡとＣの組合せ、ＴとＧの組合せ、ＴとＣの組合せ、ＧとＣの組合せ、ＡとＴとＧとの組合せ、ＡとＴとＣとの組合せ、ＡとＧとＣとの組合せ、およびＴとＧとＣとの組合せからなる群から選択されるいずれか１つの組合せに含まれる塩基からランダムに選択される塩基となるように設計され得る｝。According to the present invention, the following inventions are also provided.
(1C) An index unique to each sample containing a plurality of nucleic acid molecules (meaning an index sequence nucleic acid molecule, and may include a plurality of types of index nucleic acid molecules if it is unique to each sample) and unique to each nucleic acid molecule Obtained by sequencing (ie, multiplex sequencing) with a mixture of multiple samples containing a nucleic acid molecule of interest (eg, DNA or RNA) to which any or any molecular barcode (barcode sequence nucleic acid molecule) has been added From the sequence information provided, a method for determining the correct pair or mispair of the index and the molecular barcode added to the target nucleic acid molecule,
(A) separately obtaining a plurality of samples containing nucleic acid molecules (for example, DNA or RNA), and {at least one of the samples contains a target nucleic acid molecule},
(B) {for example, in each of the obtained plurality of samples}, before amplifying the nucleic acid molecule contained in the sample, each nucleic acid molecule of interest is linked with a unique or arbitrary molecular barcode of each nucleic acid molecule, Obtaining a target nucleic acid molecule in which different molecular barcodes are linked,
(C) {For example, before mixing a plurality of samples} A target nucleic acid molecule in which a unique index is added to a target nucleic acid molecule for each sample containing a plurality of target nucleic acid molecules and a different index is linked to each derived sample And (step C may be performed after step B, or step B may be performed after step C; and the nucleic acid molecule may be amplified after step B or C to amplify the nucleic acid molecule of interest. Amplification product) can be obtained),
(D) In a mixture containing the amplification products of the nucleic acid molecules obtained after (B) and (C) above (the sample is mixed after the step (C), and after the sample is mixed, the step (B ) May be carried out, and all the samples may be mixed after carrying out step (B), and the amplification product of the nucleic acid molecule linked with the molecular barcode is obtained after step (B), The sample may be mixed before obtaining the amplification product, and the sample containing the amplification product may be mixed after obtaining the amplification product), the index unique to each sample and the unique or arbitrary to each nucleic acid molecule of interest. Sequence of the nucleic acid molecule to which the molecular barcode is added to determine the sequence of the index portion, the sequence of the molecular barcode portion, and the sequence of the target nucleic acid molecule portion linked thereto, if necessary, for each nucleic acid molecule. When,
(E) From the obtained sequence information {for example, it can be performed based on sequence identity or similarity}, a sequence having a specific index or a sequence similar thereto, a sequence having a specific molecular barcode, or Selecting a sequence similar to this, or a sequence containing the nucleic acid molecule of interest or a sequence similar thereto, and creating a group with the selected sequence,
(F) In the group created in (E) above, the pair of the index and the molecular barcode having the highest detection frequency is determined as the correct pair of the index and the molecular barcode, and / or the index and the molecule having the low detection frequency are determined. A pair of barcodes (for example, a pair having a detection frequency lower than a certain reference value, and the certain reference value is 99.5% or less, 99% or less, 90% or less, 80% or less, 70% or less, 60% or less in a group. Examples include, but are not limited to, values of% or less, 50% or less, 40% or less, 30% or less, 20% or less, 10% or less, 5% or less, 1% or less. At least any one or all of), and determining that at least one or all of them are mispairs of the index and the molecular barcode,
Including the method.
(2C) In step (E), an array having a specific index is selected to create a group for each index,
In the step (F), regarding a nucleic acid sequence having a molecular barcode that appears in a plurality of groups, the pair of barcode and index having the largest number of reads is determined as the correct pair of barcode and index, or the detection frequency Determine the highest index and molecular barcode pair as the correct index and molecular barcode pair,
The method according to (1C) above.
(3C) In step (E), a sequence having a specific molecular barcode is selected to create a group for each molecular barcode,
In the step (F), the pair of the index and the molecular barcode having the highest detection frequency among the created groups is determined as the correct pair of the index and the molecular barcode.
The method according to (1C) above.
(4C) In step (E), a sequence containing the sequence of the target nucleic acid molecule is selected to form a group,
In the step (F), a sequence having a specific index is further selected from the group to create a subgroup, and a barcode having the largest number of reads is selected for a nucleic acid sequence having a molecular barcode that appears in a plurality of subgroups. Determining the index pair as the correct barcode and index pair, or determining the most frequently detected index and molecular barcode pair as the correct index and molecular barcode pair,
The method according to (1C) above.
(5C) In step (E), a sequence containing the sequence of the target nucleic acid molecule is selected to form a group,
In the step (F), a molecule having a specific molecular barcode is further selected from the group to create a subgroup, and the index with the highest detection frequency and the molecular barcode pair is indexed in one created subgroup. And determine the correct pair of molecular barcodes,
The method according to (1C) above.
(6C) In step (F), at least any one or all of the pairs of the index and the molecular barcode other than the determined correct pair are determined as the pair of the index and the molecular barcode.
The method according to any one of (2C) to (5C) above.
(7C) In step (E), a molecule having a specific index is selected to create a group for each index,
In the step (F), with respect to the sequences having the molecular barcodes appearing in a plurality of groups, a pair of an index and a molecular barcode having a low detection frequency (for example, a pair having a detection frequency lower than a certain reference value and having a constant The reference value includes, but is not limited to, values of 50% or less, 40% or less, 30% or less, 20% or less, 10% or less, 5% or less, 1% or less in the group. Of at least any one or all of the indices and the mispair of the molecular barcode.
The method according to (1C) above.
(8C) In step (E), a sequence having a specific molecular barcode is selected to create a group for each molecular barcode,
A pair of an index and a molecular barcode having a low detection frequency in the group created in the step (F) (for example, a pair having a detection frequency lower than a certain reference value, and the certain reference value is 50% or less in the group). , 40% or less, 30% or less, 20% or less, 10% or less, 5% or less, 1% or less, but are not limited to these values. Good)) as at least one and / or all mispairs of the index and the molecular barcode,
The method according to (1C) above.
(9C) In step (E), a sequence containing the target nucleic acid molecule is selected to form a group,
In step (F), a molecule having a specific index is further selected from the group to form a subgroup, and a nucleic acid molecule having a molecular barcode that appears in a plurality of subgroups has a low detection frequency and a molecular barcode. Pair (for example, a pair with a detection frequency lower than a certain reference value, and a certain reference value is 50% or less, 40% or less, 30% or less, 20% or less, 10% or less, 5% or less in a group. , A value of 1% or less, but not limited to these, and may be, for example, a pair of detection frequencies after the second. decide,
The method according to (1C) above.
(10C) In step (E), molecules containing the target nucleic acid molecule are selected to form a group,
In the step (F), a molecule having a specific molecular barcode is further selected from the group to create a subgroup, and a pair of an index and a molecular barcode having a low detection frequency (for example, a fixed value) is created in one created subgroup. A pair of detection frequency lower than the reference value of, a certain reference value is 50% or less, 40% or less, 30% or less, 20% or less, 10% or less, 5% or less, 1% or less in the group However, the present invention is not limited to these, and may be, for example, at least any one or all of the second and subsequent detection frequency pairs.
The method according to (1C) above.
(11C) In step (E), the step of creating a group had an identical sequence {preferably in the sequence of the molecular barcode portion} having an identical sequence determined on the basis of sequence identity or similarity {eg, The sequence may be changed by any one of the steps (A) to (D)} is performed by clustering the molecules that are presumed to form a group,
The method according to (1C) to (10C) above.
(12C) In the step (E), the clustering is
(I) In the sequence of the molecular barcode portion, a nucleic acid molecule group having the same sequence as the unique molecular barcode sequence {ie, Distance = 0} is classified into the same cluster;
(Ii) In the sequence of the molecular barcode portion, a nucleic acid molecule group having a sequence having a mismatch of up to 1 base with the sequence of the unique molecular barcode (ie, Distance = 1) is classified into the same cluster;
(Iii) In the sequence of the molecular barcode portion, the nucleic acid molecule group {ie, Distance = 2} having a sequence having a mismatch of up to 2 bases with the sequence of the unique molecular barcode is classified into the same cluster; Or (iv) In the sequence of the molecular barcode part, the nucleic acid molecule group {ie, Distance = 3} having a sequence having a mismatch of up to 3 bases with the sequence of the unique molecular barcode is classified into the same cluster. ,
The method according to (11C) above.
(13C) In the step (E), the clustering is
Grouping nucleic acid molecules having sequences sequenced in the sequence of the molecular barcode portion as having insertions or deletions (indels) of bases (eg up to 1 base, up to 2 bases, or up to 3 bases) into the same cluster Is done by
The method according to (11C) or (12C) above.
(14C) In the step (E), the clustering is
Nucleic acid molecule group obtained by excluding the sequence sequenced as having insertion or deletion (indel) of bases (eg, up to 1 base, up to 2 bases, or up to 3 bases) in the sequence of the molecular barcode portion Done against,
The method according to (11C) or (12C) above.
(15C) The insertion or deletion of the base is one or more (eg, 1, 2, 3, 4, 5, or 6) arranged in all molecular barcode sequences linked to the nucleic acid molecule. One or more) fixed bases and the position of each of the one or more fixed bases in the sequence of the sequenced molecular barcode sequence portion is further specified by the position difference. {Eg, each fixed base can be designed to be any one base selected from the group consisting of A, T, G and C; or a combination of A and T, A and T G combination, A and C combination, T and G combination, T and C combination, G and C combination, A and T and G combination, A and T and C combination, A and G and Consists of a combination with C and a combination of T, G and C It may be designed to be a base selected at random from bases contained in any one combination selected from}.
(16C) A target nucleic acid molecule having a unique index (index sequence nucleic acid molecule) for each sample containing a plurality of nucleic acid molecules and a unique or arbitrary molecular barcode (barcode sequence nucleic acid molecule) added to each nucleic acid molecule (for example, , DNA or RNA) is used to determine the number of target nucleic acid molecules contained in a particular original sample from the sequence information obtained by sequencing (ie, multiplex sequencing) using a mixture of a plurality of samples. How to decide,
(A) separately obtaining a plurality of samples containing nucleic acid molecules (for example, DNA or RNA); {at least one of the samples contains a target nucleic acid molecule},
(B) A target nucleic acid molecule in which, before amplification of the nucleic acid molecule contained in the sample, an arbitrary molecular barcode is linked to each target nucleic acid molecule in each of the obtained plurality of samples, and different molecular barcodes are linked to each To obtain
(C) Before mixing a plurality of samples, a unique index for each sample containing a plurality of target nucleic acid molecules is added to the target nucleic acid molecule to obtain a library of target nucleic acid molecules in which different indexes are linked to each derived sample. Step (the order of Step B and Step C may be either first; or, after Step B or C, the nucleic acid molecule can be amplified to obtain an amplification product of the target nucleic acid molecule),
(D) In a mixture containing amplification products of the nucleic acid molecules obtained after (B) and (C) above (the sample is mixed after the step (C), and after the sample is mixed, the step (B ) May be carried out, and all the samples may be mixed after carrying out step (B), and the amplification product of the nucleic acid molecule linked with the molecular barcode is obtained after step (B), The sample may be mixed before obtaining the amplification product, and the sample containing the amplification product may be mixed after obtaining the amplification product), the index unique to each sample and the unique or arbitrary to each nucleic acid molecule of interest. Sequencing the nucleic acid molecule to which the molecular barcode is added to identify the sequence of the index portion, the sequence of the molecular barcode portion and the sequence of the nucleic acid molecule portion linked thereto for each nucleic acid molecule,
(E) selecting a nucleic acid molecule containing the sequence of the target nucleic acid molecule from the obtained sequence information;
(F) clustering the nucleic acid molecules selected in (e) above for each unique molecular barcode sequence, and then identifying clusters having a plurality of sequences in the index nucleic acid molecule portion;
(G) In each of the clusters identified in (f) above, the pair of the index and the molecular barcode with the highest detection frequency is identified as the correctly indexed target nucleic acid molecule, and the other pairs of the index and the molecular barcode are identified. Deciding to be a mispair,
{May further include determining that the index is incorrect in Mispair},
The number of unique molecular barcode sequence types linked to the correctly indexed target nucleic acid molecule (or the number of correctly indexed clusters of the target nucleic acid molecule) is contained in the sample corresponding to the index. Is the number of
Method.
(17C) In (f) above, the clustering is
(I) In the sequence of the molecular barcode portion, a nucleic acid molecule group having the same sequence as the unique molecular barcode sequence is classified into the same cluster;
(Ii) In the sequence of the molecular barcode part, the nucleic acid molecule group having a sequence having a mismatch of up to 1 base with the sequence of the unique molecular barcode is classified into the same cluster;
(Iii) by classifying into the same cluster nucleic acid molecule groups having a sequence having a mismatch of up to 2 bases with the sequence of the unique molecular barcode in the sequence of the molecular barcode portion; or (iv) the molecular barcode In a partial sequence, a group of nucleic acid molecules having a unique molecular barcode sequence and a sequence having a mismatch of up to 3 bases is classified into the same cluster.
The method according to (16C) above.
(18C) In (e) above, the clustering is
Nucleic acid molecules having sequences sequenced in the sequence of the molecular barcode portion as having insertions or deletions (indels) of bases (eg, up to 1 base, up to 2 bases, or up to 3 bases) into the same cluster. Done by classifying,
The method according to (16C) or (17C) above.
(19C) In (e) above, the clustering is
A nucleic acid molecule obtained by excluding a sequence sequenced as having an insertion or deletion (indel) of bases (for example, up to 1 base, up to 2 bases, or up to 3 bases) in the sequence of a molecular barcode portion. Performed on a group,
The method according to (16C) or (17C) above.
(20C) The insertion or deletion of the base is one or more (eg, 1, 2, 3, 4, 5, or 6) arranged in all molecular barcode sequences linked to the nucleic acid molecule. 20 or more) and the position of each of the one or more fixed bases in the sequence of the sequenced molecular barcode sequence portion is further specified by the difference. (Eg, each fixed base can be designed to be any one base selected from the group consisting of A, T, G and C; or a combination of A and T, a combination of A and G) Combination, A and C combination, T and G combination, T and C combination, G and C combination, A and T and G combination, A and T and C combination, A and G and C combination From the group consisting of the combination of, and the combination of T, G, and C It may be designed to be a base selected at random from bases contained in any one of the combinations to be-option}.

図１は、核酸分子のデジタル定量法とその有効性について説明する図である。（Ａ）図１のパネルＡでは、デジタルカウントのスキームが示されている。各々の目的核酸分子に分子バーコードを固有に付加する（固有の分子バーコードを付加する）。増幅後、目的核酸部分とバーコード部分の両方をシークエンスする。コピー数は、リード数ではなく、固有のバーコードの数によって決定される。点線の枠は、本実施例において用いた実験デザインを示す。（Ｂ）図１のパネルＢは、正確なデジタルカウントのための第１の要件：それぞれの目的核酸分子は、異なるバーコードによって標識されなければならないことを説明する図である。バーコード配列の多様性を増加させるにつれて、測定される固有のバーコードの数が一定になる場合、その範囲のバーコード配列の多様性は、第１の要件を満たす。（Ｃ）図１のパネルＣでは、正確なデジタルカウントのための第２の要件：目的核酸分子に付加された全てのバーコード配列（少なくとも１つのリード）が検出されなければならないことを説明する図である。シークエンス深度を増加させるにつれて、測定される固有のバーコードの数が一定になる場合、その範囲のシークエンス深度は、第２の要件を満たす。FIG. 1 is a diagram for explaining a digital quantification method for nucleic acid molecules and its effectiveness. (A) Panel A of FIG. 1 shows a digital counting scheme. A unique molecular barcode is added to each target nucleic acid molecule (add unique molecular barcode). After amplification, both the nucleic acid portion of interest and the barcode portion are sequenced. The number of copies is determined by the number of unique barcodes, not the number of reads. The dotted frame shows the experimental design used in this example. (B) Panel B of FIG. 1 illustrates the first requirement for accurate digital counts: each nucleic acid molecule of interest must be labeled with a different barcode. If the number of unique barcodes measured becomes constant as the diversity of barcode sequences increases, then the diversity of barcode sequences in that range meets the first requirement. (C) Panel C of FIG. 1 illustrates the second requirement for accurate digital counting: all barcode sequences (at least one read) added to the nucleic acid molecule of interest must be detected. It is a figure. If the number of unique barcodes measured becomes constant as the sequence depth is increased, then the sequence depth in that range meets the second requirement. 図２は、正確なデジタル定量のための２つの要件に適用されたランダム塩基バーコードを用いたデジタルカウントの観察された本来的な特徴を示す。（Ａ）図２のパネルＡでは、ランダム塩基の数（塩基長）に対する検出されるクラスター（グレーで示す固有のバーコード）の数の依存性を表す。ＳＴ１の結果が示されている。グレーの線は、固有のバーコードの数を示す。青の線は、クラスタリング（Distance=3）後の観察されたクラスターの数を示す。緑の線は、クラスタリング（Distance=3）および固定塩基マッチフィルタリング（固定塩基数＝６）後のクラスターの数を示す。（Ｂ）図２のパネルＢは、ランダム塩基の数に対するバーコードクラスターの数の依存性を示す。黄色の線は、クラスタリング（Distance=3）と固定塩基マッチフィルタリング（固定塩基数＝６）後のインデックスＡおよびインデックスＢを有するＳＴ１に関するクラスター数を示す。赤の点線は、クラスタリング（Distance=3）と固定塩基マッチフィルタリング（固定塩基数＝６）後のインデックスＡおよびインデックスＢを有する短い鋳型全てに関するバーコードクラスター数を示す。（Ｃ）図２のパネルＣは、シークエンスカバー率に対する検出される分子数の依存性を示す。グレーの線は、ランダム塩基の数に対する観察される固有のバーコード配列の数の依存性を表す。青の線は、クラスタリング（Distance=3）後の観察されたクラスターの数を示す。緑の線は、クラスタリング（Distance=3）および固定塩基マッチフィルタリング（固定塩基数＝６）後のクラスターの数を示す。黄色の線は、クラスタリング（Distance=3）、固定塩基マッチフィルタリング（固定塩基数＝６）、およびミスインデックス（例えば、混入インデックス）の除外後の、インデックスＡを有するＳＴ１に関するバーコードクラスターの数を示す。赤の点線は、クラスタリング（Distance=3）、固定塩基マッチフィルタリング（固定塩基数＝６）、ならびにミスインデックス（例えば、混入インデックス）および誤同定除外後の、インデックスＡを有するＳＴ１に関するバーコードクラスターの数を示す。エラーバーは、標準偏差を示す（ｎ＝８）。FIG. 2 shows the observed intrinsic features of digital counts with random base barcodes applied to two requirements for accurate digital quantification. (A) Panel A of FIG. 2 shows the dependence of the number of detected clusters (unique bar code shown in gray) on the number of random bases (base length). The results of ST1 are shown. Gray lines indicate the number of unique barcodes. The blue line indicates the number of observed clusters after clustering (Distance = 3). The green line indicates the number of clusters after clustering (Distance = 3) and fixed base match filtering (fixed base number = 6). (B) Panel B of FIG. 2 shows the dependence of the number of barcode clusters on the number of random bases. The yellow line indicates the number of clusters for ST1 having index A and index B after clustering (Distance = 3) and fixed base match filtering (fixed base number = 6). The red dotted line shows the number of barcode clusters for all short templates with index A and index B after clustering (Distance = 3) and fixed base match filtering (fixed base number = 6). (C) Panel C of FIG. 2 shows the dependence of the number of detected molecules on the sequence coverage. Gray lines represent the dependence of the number of unique barcode sequences observed on the number of random bases. The blue line indicates the number of observed clusters after clustering (Distance = 3). The green line indicates the number of clusters after clustering (Distance = 3) and fixed base match filtering (fixed base number = 6). The yellow line indicates the number of barcode clusters for ST1 with index A after clustering (Distance = 3), fixed base match filtering (fixed base number = 6), and exclusion of miss index (eg contaminated index). Show. The red dotted line indicates the bar code cluster of ST1 with index A after clustering (Distance = 3), fixed base match filtering (fixed base number = 6), and misindex (eg, contamination index) and misidentification exclusion. Indicates a number. Error bars indicate standard deviation (n = 8). 図３は、Distanceと固定塩基を用いた解析の結果を示す。図３では、インデックスＡ（丸で示される）とインデックスＢ（三角で示される）を有するＳＴ１の結果が示されている。ランダム塩基バーコードの長さは２４であった。（Ａ）図３のパネルＡは、クラスター数に対する異なるDistanceでのクラスタリングの影響を示す。（Ｂ）図３のパネルＢは、Distance=3での固定塩基の位置の依存性を示す。アスタリスクはフィルタリングなしを示す。１つの固定塩基のみを用いてフィルタリングがなされた。（Ｃ）図３のパネルＣで、Distance=3での固定塩基の数の依存性を示す。アスタリスクはフィルタリングなしを示す。シークエンスプライマー部位から最も遠い塩基を用いてフィルタリングがなされた。FIG. 3 shows the results of analysis using Distance and fixed base. In FIG. 3, the results of ST1 with index A (indicated by a circle) and index B (indicated by a triangle) are shown. The random base barcode length was 24. (A) Panel A of FIG. 3 shows the effect of clustering at different Distances on the number of clusters. (B) Panel B of FIG. 3 shows the dependence of the positions of fixed bases at Distance = 3. Asterisk indicates no filtering. Filtering was done using only one fixed base. (C) Panel C of FIG. 3 shows the dependency of the number of fixed bases at Distance = 3. Asterisk indicates no filtering. Filtering was done using the base furthest from the sequence primer site. 図４は、各鋳型の絶対カウントを示す。（Ａ）図４のパネルＡは、インデックスＡに関して、カバー率の関数として決定されたクラスター数を示す。Distance=2、固定塩基数＝4、ランダム塩基数＝20。また、ミスインデックス（例えば、混入インデックス）の影響や誤同定の影響が除外された。各鋳型配列の開始コピー数を括弧内に示す。（Ｂ）図４のパネルＢは、ランダム塩基の数に対する検出される分子数の依存性を示す。Distance=2、固定塩基数＝4。全リードの10%をランダムにサンプリングした（短い鋳型に対するカバー率は13.4〜20.3であり、長い鋳型に対するカバー率は、12.6〜20.9であった）。（Ｃ）図４のパネルＣは、インプット（すなわち、ＰＣＲ増幅前の分子数、ｘ軸）とアウトプット（すなわち、デジタルカウントの結果、ｙ軸）との相関を示す。アウトプット数は、大きなシンボルで示される12.6〜20.9のカバー率で図４Ａと図１１から決定された。グレーの線は、対数目盛で傾き１を有する回帰直線を示す。丸および三角はそれぞれ、インデックスＡおよびインデックスＢに対応する。直線回帰のピアソンの積率相関係数ｒと決定係数Ｒ^２が示されている。エラーバーは、標準偏差を示す（ｎ＝８）。Figure 4 shows the absolute counts for each template. (A) Panel A of FIG. 4 shows the number of clusters determined for index A as a function of coverage. Distance = 2, fixed base number = 4, random base number = 20. Also, the effects of misindexes (eg, mixed indexes) and misidentifications were excluded. The starting copy number of each template sequence is shown in parentheses. (B) Panel B of FIG. 4 shows the dependence of the number of detected molecules on the number of random bases. Distance = 2, fixed number of bases = 4. Randomly sampled 10% of all reads (coverage for short templates was 13.4-20.3 and coverage for long templates was 12.6-20.9). (C) Panel C of FIG. 4 shows the correlation between the input (ie, the number of molecules before PCR amplification, x-axis) and the output (ie, the result of digital counting, y-axis). The number of outputs was determined from Figures 4A and 11 with a coverage of 12.6 to 20.9 indicated by large symbols. The gray line shows a regression line with a slope of 1 on a logarithmic scale. Circles and triangles correspond to index A and index B, respectively. The Pearson product moment correlation coefficient r and the coefficient of determination R ^{2 of the} linear regression are shown. Error bars indicate standard deviation (n = 8). 図５は、デジタルカウントのためのランダム塩基の必要な数を示す。（Ａ）図５のパネルＡでは、ｘ軸は、測定しようとする分子のインプット数を示し、ｙ軸は、図４のパネルＢおよび図５のパネルＢにおける各々の曲線が０．９５の相対的クラスター数に達するときのランダム塩基の数を示す。（Ｂ）図５のパネルＢは、ランダム塩基の数に対するバーコードクラスターの数の依存性を示す。クラスタリング（Distance=2）、固定塩基マッチフィルタリング（固定塩基数＝4）後のインデックスＡおよびＢを有する全ての鋳型、全リードの10%をランダムにサンプリング（例えば、カバー率は12.6〜20.9）。色は、図４のパネルＡにおけるプロットで表されるサンプルに対応する。エラーバーは、標準偏差を表す（ｎ＝８）。FIG. 5 shows the required number of random bases for digital counting. (A) In panel A of FIG. 5, the x-axis shows the number of inputs of the molecule to be measured, and the y-axis shows relative values of 0.95 for each curve in panel B of FIG. 4 and panel B of FIG. The number of random bases when reaching the number of target clusters is shown. (B) Panel B of FIG. 5 shows the dependence of the number of barcode clusters on the number of random bases. Clustering (Distance = 2), all templates having indexes A and B after fixed base match filtering (fixed base number = 4), 10% of all reads were randomly sampled (for example, the coverage rate is 12.6 to 20.9). The color corresponds to the sample represented by the plot in panel A of FIG. Error bars represent standard deviation (n = 8). 図６Ａは、増幅、インデックス付加、混合およびシークエンスを含む従来の方法を示す。図６Ａでは、バーコード配列が用いられず、増幅された配列にサンプル毎に固有のインデックスが付加され、混合してシークエンスされる。インデックスを増幅の前に付加してもよい。FIG. 6A shows a conventional method involving amplification, indexing, mixing and sequencing. In FIG. 6A, the barcode sequence is not used and a unique index is added to the amplified sequence for each sample, mixed and sequenced. The index may be added before amplification. 図６Ｂは、増幅、インデックス付加、混合およびシークエンスを含む従来の方法を示す。ここで、従来法では、ミスインデックス付加が生じ得るが、生じたミスインデックスを同定できない。FIG. 6B shows a conventional method involving amplification, indexing, mixing and sequencing. Here, in the conventional method, misindex addition may occur, but the misindex that occurred cannot be identified. 図６Ｃは、分子バーコードの使用法を示す。配列１を有する目的核酸配列それぞれに対して固有のバーコード配列が標識され、各分子が固有に標識される。FIG. 6C illustrates the use of molecular barcodes. A unique barcode sequence is labeled for each nucleic acid sequence of interest having sequence 1, and each molecule is uniquely labeled. 図６Ｄは、分子バーコード付加、増幅、インデックス付加、混合およびシークエンスを含む分子バーコードの使用法を示す。核酸分子に固有のバーコード付加とサンプル毎に固有のインデックスが付加され、複数のサンプルを混合してシークエンスするスキームを示す。インデックスを、分子バーコードの付加の後、増幅の前に付加してもよい。FIG. 6D shows the use of molecular barcodes including molecular barcode addition, amplification, indexing, mixing and sequencing. A scheme for adding a unique barcode to a nucleic acid molecule and a unique index for each sample, and mixing and sequencing a plurality of samples is shown. The index may be added after the addition of the molecular barcode and before the amplification. 図６Ｅは、分子バーコード付加、増幅、インデックス付加、混合およびシークエンスを含む分子バーコードの使用法を示す。本発明の第１の実施形態におけるミスインデックスの同定方法の一例のスキームを示す。ミスインデックスが生じ得るが、本発明の第１の実施形態では、生じたミスインデックスを同定できる。同一分子からの増幅産物は同一インデックスを有しており、ミスインデックスされた分子の数は、正しくインデックス付加された分子の数よりも小さいと仮定する。本発明の第１の実施形態では、同一のバーコード配列に２種類のインデックスが付加されている場合には、リード数において最頻のインデックスを正しいインデックスと決定する。FIG. 6E shows the use of molecular barcodes including molecular barcode addition, amplification, indexing, mixing and sequencing. The scheme of an example of the identification method of the misindex in the 1st Embodiment of this invention is shown. Misindexes may occur, but in the first embodiment of the invention, the misindexes that occur can be identified. Amplification products from the same molecule have the same index, and it is assumed that the number of misindexed molecules is less than the number of correctly indexed molecules. In the first embodiment of the present invention, when two types of indexes are added to the same barcode array, the most frequent index in the number of reads is determined as the correct index. 図７Ａは、複数のサンプルに含まれる目的核酸分子に対してバーコードを付加するスキームを示す。FIG. 7A shows a scheme for adding a barcode to a nucleic acid molecule of interest contained in multiple samples. 図７Ｂは、インデックス付加および増幅のスキームを示し、他のインデックスがコンタミネーションしたインデックスにより部分的なスイッチングが生じた場合を示す。FIG. 7B shows an indexing and amplification scheme, showing the case where partial switching occurs due to indexes contaminated with other indexes. 図７Ｃは、バーコード数のカウント、同一バーコードの確認、エラー（インデックスとバーコードとのミスペア）の同定を示す。同一バーコード配列に対して異なるインデックスＡおよびＢが付加されているときに、リード数（コピー数）の少ない分子をミスインデックスとして同定するスキームを示す。FIG. 7C shows counting of the number of barcodes, confirmation of the same barcode, and identification of an error (mispair between index and barcode). A scheme for identifying a molecule with a small read number (copy number) as a misindex when different indexes A and B are added to the same barcode sequence is shown. 図８は、図２の補足的図面であり、ランダム塩基を有するバーコードを用いたデジタルカウントの観察された本来的な特徴を示す。（Ａ）図８のパネルＡでは、ＳＴ１、ＳＴ２、ＬＴ１およびＬＴ２に関して、ランダム塩基の数（塩基長）に対する検出されるクラスターの数の依存性を表す。グレーの線は、ランダム塩基の数に対する観察される固有のバーコード配列の数の依存性を表す。青の線は、クラスタリング（Distance=3）後の観察されたクラスターの数を示す。緑の線は、クラスタリング（Distance=3）および固定塩基マッチフィルタリング（固定塩基数＝６）後のクラスターの数を示す。（Ｂ）図８のパネルＢは、ランダム塩基の数に対するバーコードクラスターの数の依存性を示す。黄色の線は、クラスタリング（Distance=3）、固定塩基マッチフィルタリング（ＳＴ２に対しては固定塩基数＝6、ＬＴ２に対しては固定塩基数＝12）、ミスインデックス（例えば、混入ミスインデックス）の除外後の、インデックスＡおよびＢを有するＳＴ２（上パネル）とＬＴ２（下パネル）についてのクラスター数を示す。濃黄色の線は、クラスタリング（Distance=3）、固定塩基マッチフィルタリング（固定塩基数＝１２）、ミスインデックス（例えば、混入インデックス）の除外後の、インデックスＡおよびインデックスＢを有するＬＴ１（下パネル）についてのクラスター数を示す。赤の点線は、クラスタリング（Distance=3）、固定塩基マッチフィルタリング（固定塩基数＝６）、ミスインデックス（例えば、混入インデックス）と誤同定の除外後の、インデックスＡおよびインデックスＢを有する全ての長い鋳型についてのバーコードクラスター数を示す。（Ｃ）図８のパネルＣは、ＳＴ１、ＳＴ２、ＬＴ１およびＬＴ２に関してのシークエンスカバー率に対する検出される分子数の依存性を示す。図８パネルＣでは、線の色は、濃黄色以外、上記図８のパネルＡおよびＢと同様である。エラーバーは、標準偏差を示す（ｎ＝８）。FIG. 8 is a supplementary drawing of FIG. 2 and shows the observed intrinsic features of digital counting using barcodes with random bases. (A) Panel A of FIG. 8 shows the dependence of the number of detected clusters on the number of random bases (base length) for ST1, ST2, LT1 and LT2. Gray lines represent the dependence of the number of unique barcode sequences observed on the number of random bases. The blue line indicates the number of observed clusters after clustering (Distance = 3). The green line indicates the number of clusters after clustering (Distance = 3) and fixed base match filtering (fixed base number = 6). (B) Panel B of FIG. 8 shows the dependence of the number of barcode clusters on the number of random bases. Yellow lines indicate clustering (Distance = 3), fixed base match filtering (fixed base number = 6 for ST2, fixed base number = 12 for LT2), and misindex (for example, misindex of contamination). The number of clusters for ST2 (top panel) and LT2 (bottom panel) with indices A and B after exclusion is shown. The dark yellow line is LT1 with index A and index B (lower panel) after clustering (Distance = 3), fixed base match filtering (fixed base number = 12), and exclusion of miss index (for example, contamination index). Shows the number of clusters for. Red dotted lines are all long lines with index A and index B after clustering (Distance = 3), fixed base match filtering (fixed base number = 6), miss index (eg contaminated index) and misidentification exclusion. The number of barcode clusters for the template is shown. (C) Panel C of FIG. 8 shows the dependence of the number of detected molecules on the sequence coverage for ST1, ST2, LT1 and LT2. In FIG. 8 panel C, the color of the lines is the same as in panels A and B of FIG. 8 above, except for dark yellow. Error bars indicate standard deviation (n = 8). 図９は、図３の補足的図面であり、ＳＴ２、ＬＴ１およびＬＴ２に関するDistanceと固定塩基を用いた解析結果を示す。（Ａ）図９のパネルＡは、図３のパネルＡと同じであるが、ＳＴ２、ＬＴ１およびＬＴ２に関する、様々なDistanceパラメータを用いたクラスタリングのクラスター数に対する影響を示す。（Ｂ）図９のパネルＢは、図３のパネルＢと同じであるが、ＳＴ２、ＬＴ１およびＬＴ２に関する固定塩基の位置の依存性を示す。アスタリスクはフィルタリングなしを示す。（Ｃ）図９のパネルＣは、図３のパネルＣと同じであるが、ＳＴ２、ＬＴ１およびＬＴ２に関する固定塩基の数の依存性を示す。FIG. 9 is a supplementary drawing of FIG. 3 and shows the analysis results of ST2, LT1 and LT2 using Distance and a fixed base. (A) Panel A of FIG. 9 is the same as panel A of FIG. 3, but shows the effect of clustering with various Distance parameters on the number of clusters for ST2, LT1 and LT2. (B) Panel B of FIG. 9 is the same as panel B of FIG. 3, but shows the dependence of fixed base position on ST2, LT1 and LT2. Asterisk indicates no filtering. (C) Panel C of FIG. 9 is the same as panel C of FIG. 3, but shows the dependence of the number of fixed bases on ST2, LT1 and LT2. 図１０は、インデックスＡ（パネルＡ参照）およびインデックスＢ（パネルＢ参照）を有するＳＴ１に関する各クラスターにおけるリード数のヒストグラムを示す。色は、図２Ｃのプロットのサンプルの色と対応する。FIG. 10 shows a histogram of the number of reads in each cluster for ST1 with index A (see panel A) and index B (see panel B). The colors correspond to the sample colors in the plot of Figure 2C. 図１１は、図４の補足的図面であり、各鋳型の絶対カウントを示す。図４のパネルＡと同じであるが、インデックスＢ（三角で示される）に関して、カバー率に対する決定されたクラスター数を示す。エラーバーは、標準偏差を示す（ｎ＝８）。FIG. 11 is a supplementary drawing of FIG. 4, showing absolute counts for each mold. Same as panel A in FIG. 4, but for index B (indicated by triangles), the determined number of clusters versus coverage is shown. Error bars indicate standard deviation (n = 8). 図１２は、図５の補足的図面であり、デジタルカウントのためのランダム塩基の必要な数の推定を示す。本プロットは、ランダム塩基数３８（本実施例で最長）までの対数目盛における直線回帰（R²＝0.971）であることを除いては、図５Ａと同じである。この事例における上記回帰曲線の数式は、ｙ＝２．２＊ｌｏｇ（ｘ）＋５．５であった。FIG. 12 is a supplementary drawing of FIG. 5, showing an estimation of the required number of random bases for digital counting. This plot is the same as FIG. 5A except that it is a linear regression (R ² = 0.971) on a logarithmic scale up to 38 random bases (the longest in this example). The equation for the regression curve above in this case was y = 2.2 * log (x) +5.5. 図１３は、バーコードの設計と分子のインプット数を示す。ＬＴ１〜６の配列における大文字はＰＣＲ増幅プライマーの結合箇所である。バーコード（barcode）はランダム塩基および固定塩基を含むランダム領域を示し、標的（target）は目的核酸を示す。ＬＴ１〜６はＰＡＧＥ精製物であり、ＳＴ１〜５の５’末端はアミンで修飾されたものであった。ランダム塩基の間の固定塩基は、より低い増幅効率を有し得る長いホモポリマーバーコードの回避の助けとなる。Ｎは、Ａ、Ｔ、Ｇ、またはＣのいずれかであることを示す。FIG. 13 shows the barcode design and the number of molecular inputs. Uppercase letters in the sequences of LT1 to 6 are binding sites of PCR amplification primers. The barcode indicates a random region containing random bases and fixed bases, and the target indicates a target nucleic acid. LT1-6 were PAGE-purified products, and the 5'ends of ST1-5 were amine-modified. Fixed bases between random bases help avoid long homopolymer barcodes that may have lower amplification efficiencies. N indicates either A, T, G, or C. 図１４は、ライブラリーの調製のためのプライマー配列を示す。下線部は、インデックス配列を示す（インデックスＡは、Ｒｖｐｒｉｍｅｒ１、インデックスＢは、Ｒｖｐｒｉｍｅｒ２に含まれる）。全てのプライマーは、ＰＡＧＥ精製物であった。Figure 14 shows the primer sequences for library preparation. The underlined portion indicates the index array (the index A is included in Rv primer1, and the index B is included in Rv primer2). All primers were PAGE purified products. 図１５は、各プロセスにおけるリードの数を示す。＊この画分は、ミスインデックス（例えば、混入インデックス）の除外におけるリード数よりも大きくなり得る（実施例参照）。FIG. 15 shows the number of leads in each process. * This fraction can be larger than the number of reads in the exclusion of missed indexes (eg mixed indexes) (see examples).

Detailed explanation of the invention

本明細書では、「分子バーコード」とは、核酸分子に対して１分子毎に付加される固有の配列を有するタグである。「primer ID」、および「固有分子識別子（UMI）」などとも呼ばれる。核酸分子に対して１分子毎に異なる固有の配列を有する分子バーコードが付加されるようにすると、増幅などの処理に供される前のサンプルに含まれる当該核酸の分子数を、付加されたバーコードの種類の数に基づいて、デジタル的に（または定性的に）決定できることとなる。この核酸分子の決定法は、一度のランで大量の核酸配列の分析を可能とした次世代シークエンサーのプラットフォームが発展したことで一気に注目を浴びるようになり、分子バーコードを活用することで核酸分子数をデジタル的に決定する方法が様々に開発されてきた。この核酸分子数の決定法は、分子数をバーコードの種類の数（「固有のバーコードの数」ということがある）としてデジタル的にカウントできることから、「デジタルカウント法」や「デジタル定量法」等と呼ばれることがある。このデジタルカウント法は、測定系のノイズやバイアスの存在下であっても、サンプル中の分子の絶対数をデジタル的に正確に決定することができる。このデジタルカウント法が最も広く用いられているアプリケーションとしては、分子バーコードを用いたＲＮＡ−Ｓｅｑ、すなわち、デジタルＲＮＡ−Ｓｅｑ（ｄＲＮＡ−Ｓｅｑ）または定量的ＲＮＡ−Ｓｅｑが挙げられる。ｄＲＮＡ−Ｓｅｑは、サンプルが微量でも良好に機能するため、単一細胞の遺伝子発現解析によく用いられている。
デジタルカウント法はまた、大量のシークエンスデータを取得することができる次世代シークエンサーのプラットフォームにおいて多くの用途に用いられている。そのような用途としては、例えば、ＲＮＡ−Ｓｅｑに加えて、単一ヌクレオチド解像度ＵＶ架橋免疫沈降（ｉＣＬＩＰ：individual-nucleotide resolution UV cross-linking and immunoprecipitation）、抗体レパトワ解析、細菌１６ＳｒＲＮＡの遺伝子解析、およびエキソヌクレアーゼ、固有のバーコードおよび単一ライゲーションを介したヌクレオチド解像度のクロマチン免疫沈降実験（ＣｈＩＰ−ｎｅｘｕｓ：chromatin immunoprecipitation experiments with nucleotide resolution through exonuclease, unique barcode and single ligation）が挙げられる。
このデジタルカウントの方法において、サンプル中に存在する核酸分子の総数に対して十分に多くの種類の分子バーコードを用いることにより、同一のバーコードが元のサンプル中に存在する複数の核酸分子に付加される可能性を実質的に制限し、これによりバーコードの配列の種類数をサンプル中に存在していた核酸分子数に対応付けることができる。このようにして、十分な多様性を有するヌクレオチド配列を含む分子バーコードを用いることによって、サンプル中に存在する核酸分子の定量が可能である。分子バーコードは、例えば、ランダムな塩基を有する核酸群として得られ得る。分子バーコードは、測定する分子数を決定するために、その配列の種類の数に着目されるため、配列がランダム（配列が多様であり、かつ、ヒトが配列の内容を認識する必要がないように）に合成されたものであったとしてもよいということができる。あるいは、分子バーコードは、十分な多様性が得られるように設計された配列既知の核酸群であってもよい。本明細書では、分子バーコードを単にバーコードということがあり、また、分子バーコードの有する配列をバーコード配列ということがある。本明細書では、固有のバーコード配列の数とは、バーコード配列の多様性の程度を表す数である。固有のバーコード配列の数は、ｎ個の異なるバーコード配列が検出された場合には、ｎとなる｛ここで、ｎは自然数である｝。本明細書では、ランダム塩基の数とは、ランダム塩基の塩基長を意味する。本明細書では、ランダム塩基とは、ランダムな配列を有する連続した塩基を意味する。ランダム塩基は、２種類の塩基、３種類の塩基、または４種類の塩基からなるものとし得る。As used herein, a “molecular barcode” is a tag having a unique sequence added to a nucleic acid molecule on a molecule-by-molecule basis. It is also called “primer ID” and “unique molecular identifier (UMI)”. When a molecular barcode having a unique sequence different for each molecule is added to the nucleic acid molecule, the number of molecules of the nucleic acid contained in the sample before being subjected to treatment such as amplification is added. It will be possible to make a digital (or qualitative) decision based on the number of barcode types. This method for determining nucleic acid molecules has come to the forefront with the development of a next-generation sequencer platform that enables the analysis of a large amount of nucleic acid sequences in a single run. Various methods have been developed to digitally determine the number. This method for determining the number of nucleic acid molecules is capable of digitally counting the number of molecules as the number of barcode types (sometimes referred to as "the number of unique barcodes"). , Etc. This digital counting method can digitally accurately determine the absolute number of molecules in a sample even in the presence of noise and bias in the measurement system. The most widely used application of this digital counting method is RNA-Seq using a molecular barcode, that is, digital RNA-Seq (dRNA-Seq) or quantitative RNA-Seq. Since dRNA-Seq functions well even in a small amount of sample, it is often used for single-cell gene expression analysis.
Digital counting methods are also used in many applications in next-generation sequencer platforms that can acquire large amounts of sequence data. Examples of such applications include, in addition to RNA-Seq, single-nucleotide resolution UV cross-linking immunoprecipitation (iCLIP), antibody repertoire analysis, gene analysis of bacterial 16S rRNA, And exonucleases, unique barcodes and nucleotide resolution through exonuclease, unique barcode and single ligation (ChIP-nexus: chromatin immunoprecipitation experiments with nucleotide resolution through exonuclease, unique barcode and single ligation).
In this digital counting method, by using a sufficient number of types of molecular barcodes for the total number of nucleic acid molecules present in the sample, the same barcode is applied to multiple nucleic acid molecules present in the original sample. The possibility of being added is substantially limited, which allows the number of types of barcode sequences to be related to the number of nucleic acid molecules present in the sample. In this way, the quantification of nucleic acid molecules present in a sample is possible by using a molecular barcode containing nucleotide sequences of sufficient diversity. The molecular barcode can be obtained, for example, as a nucleic acid group having random bases. Since the molecular barcode is focused on the number of kinds of sequences to determine the number of molecules to be measured, the sequences are random (the sequences are diverse, and humans do not need to recognize the contents of the sequences). It can be said that it may be the one that has been synthesized to. Alternatively, the molecular barcode may be a group of nucleic acids of known sequence designed to provide sufficient diversity. In the present specification, the molecular barcode may be simply referred to as a barcode, and the sequence possessed by the molecular barcode may be referred to as a barcode sequence. As used herein, the number of unique barcode sequences is a number that represents the degree of diversity of the barcode sequences. The number of unique barcode sequences will be n if n different barcode sequences are detected, where n is a natural number. As used herein, the number of random bases means the base length of random bases. As used herein, a random base means a continuous base having a random sequence. Random bases may consist of two bases, three bases, or four bases.

本明細書では、「インデックス」とは、核酸分子に対して、それが由来するサンプル毎に付加される固有の標識となる核酸である。例えば、サンプル毎に異なるヌクレオチド配列を有するインデックスが付加され得る。あるサンプルに由来する核酸分子には全て同一のインデックスを付加することによって、複数のサンプルを混合してシークエンス解析した場合に、付加されたインデックスの配列に基づいて個々の核酸分子が由来するサンプルを特定することができる。次世代シークエンサーのプラットフォームにおける１回のシークエンスのキャパシティが大きいことから、複数のサンプルを混合して１回のランでシークエンスすることが可能であり、インデックスは、例えば、このような場合に有用である。インデックスの付加は、核酸分子の処理（例えば、増幅）の前、間、または後に付加してもよい。 As used herein, an "index" is a nucleic acid that is a unique label attached to a nucleic acid molecule for each sample from which it is derived. For example, an index having a different nucleotide sequence may be added to each sample. By adding the same index to all nucleic acid molecules derived from a certain sample, when multiple samples are mixed and sequenced, the sample from which each nucleic acid molecule is derived is determined based on the sequence of the added index. Can be specified. Because of the large capacity of a single sequence on the next-generation sequencer platform, multiple samples can be mixed and sequenced in a single run, and the index is useful, for example, in such cases. is there. The index may be added before, during, or after treatment (eg, amplification) of the nucleic acid molecule.

本明細書では、「鋳型」、「標的核酸」、「標的核酸分子」、「目的核酸」または「目的核酸分子」とは、デジタル定量法において定量の対象となる核酸分子（例えば、ＤＮＡまたはＲＮＡ）を意味し、相互互換的に用いられ得る。本明細書では、目的核酸分子が元々有している配列（すなわち、解析のためにバーコードやインデックスが付加される前の配列）は、目的核酸配列と呼ばれる。 In the present specification, “template”, “target nucleic acid”, “target nucleic acid molecule”, “target nucleic acid” or “target nucleic acid molecule” means a nucleic acid molecule (eg, DNA or RNA) to be quantified in a digital quantification method. ), And can be used interchangeably. In the present specification, a sequence which the nucleic acid molecule of interest originally has (that is, a sequence before adding a barcode or index for analysis) is referred to as a nucleic acid sequence of interest.

本明細書では、「核酸」は、核酸配列を有する高分子を意味する。核酸としては、デオキシリボ核酸（ＤＮＡ）およびリボ核酸（ＲＮＡ）が挙げられる。リボ核酸としては、メッセンジャーＲＮＡ（ｍＲＮＡ）、ノンコーディングＲＮＡ、例えば、マイクロＲＮＡ、トランスファーＲＮＡ（ｒＲＮＡ）、およびリボソーマルＲＮＡ（ｒＲＮＡ）が挙げられる。 As used herein, "nucleic acid" means a macromolecule having a nucleic acid sequence. Nucleic acids include deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). Ribonucleic acids include messenger RNA (mRNA), non-coding RNA such as microRNA, transfer RNA (rRNA), and ribosomal RNA (rRNA).

本明細書では、「シークエンス深度」は、シークエンスする総量または総分子数を表す。例えば、シークエンス深度が高い（すなわち、より多くのシークエンス情報が得られる）と、サンプル中にわずかしか存在しない配列が検出される可能性が上昇する場合がある。本明細書では、「カバー率」とは、同一核酸分子に由来するとしてクラスタリングされて得られた各クラスターのリード数の平均（リード数／クラスター）を意味する。 As used herein, "sequence depth" refers to the total amount or total number of molecules sequenced. For example, higher sequence depths (ie, more sequence information available) may increase the likelihood that low-abundance sequences will be detected in the sample. In the present specification, the “coverage ratio” means the average number of reads of each cluster obtained by clustering as derived from the same nucleic acid molecule (read number / cluster).

本明細書では、「分子毎に固有の」とは、系に含まれる分子の少なくとも一部についてそれぞれ互いに異なることを意味する。「分子毎に固有の」とは、系に含まれる全ての分子、実質的に全ての分子、またはその大半の分子（例えば、５０%以上、６０％以上、７０％以上、８０％以上、９０％以上、９５％以上、９６％以上、９７％以上、９８％以上、または９９％以上）についてそれぞれ異なることを意味し得る。 As used herein, the term “unique to each molecule” means that at least some of the molecules included in the system are different from each other. The term “unique to each molecule” means all molecules, substantially all molecules, or most of the molecules contained in the system (eg, 50% or more, 60% or more, 70% or more, 80% or more, 90% or more). % Or more, 95% or more, 96% or more, 97% or more, 98% or more, or 99% or more).

核酸のデジタル定量法の従来の手順を以下説明する（図１のパネルＡ参照）。
ＲＮＡ分子、またはＤＮＡ（例えば、相補的ＤＮＡまたはｃＤＮＡ）分子等の核酸（目的核酸分子）のそれぞれに対して、多様な外来配列を含むＤＮＡ（分子バーコード）を固有に付加する（すなわち、核酸分子毎に異なる配列を有する分子バーコードを付加する）（例えば、図６Ｃ参照）。このように分子毎に固有の配列を有する分子バーコードが付加された核酸を「バーコード付加された核酸」と呼ぶことがある。バーコード付加された目的核酸分子（出発材料の核酸がＲＮＡである場合にはＲＮＡから得られるｃＤＮＡ）を増幅させる（例えば、図６Ｄ参照）。バーコード付加され増幅された核酸の目的核酸配列とバーコード配列をタンデムにシークエンスする（例えば、図６Ｄ参照）。理論的に提唱されているように、各々の目的核酸について、増幅された分子の数（いわゆる「リード数」）ではなく目的核酸配列に付加された固有のバーコードの数が定量され、元の（増幅前の）目的核酸分子の絶対的なコピー数が決定できる。このデジタル定量法においては、バーコード配列の種類の数が着目されるため、バーコード配列は、核酸分子毎に固有の配列を有するように目的核酸分子に付加されればよく、その具体的配列がどのようなものかは問われない。デジタル定量法において、具体的配列が既知のバーコードを用いてもよい。
次世代シークエンサーのプラットフォームが発展し、一度のシークエンシング（ラン）で大量の塩基配列を解読可能となった。これにより、単一サンプルの測定では、シークエンシングの能力を使い切れず、１回のランで複数のサンプルを同時にシークエンスすることに対するニーズが高まっている。１回のランで複数のサンプルをシークエンスする一方で、核酸がいずれのサンプルに由来するかを区別するために、サンプル毎に固有のインデックスの付加がなされ得る。本発明によれば、インデックスは、サンプル毎に固有であればどのような配列を有するように目的核酸分子に付加されていてもよく、その具体的配列がどのようなものかは問われない。デジタル定量法において、具体的配列が既知のインデックスを用いてもよい。
本発明によれば、インデックスは、目的核酸分子が増幅された後で増幅された目的核酸分子に対して付加されてもよいし、目的核酸分子が増幅される前に目的核酸分子に対して付加されてもよい。インデックスは、各サンプルで増幅を行った後に付加してもよい。例えば、インデックスの付加は、アダプターライゲーションによって増幅産物それぞれに対して行うことができる。あるいは、インデックスは、目的核酸分子が増幅される間に付加されてもよい。例えば、インデックスの付加は、プライマーの配列に含ませることによって核酸分子の増幅中に行われ得る。
本発明においてインデックスが増幅される前の目的核酸分子に付加される場合には、インデックスは、バーコード配列の付加の前に、同時に、または後で目的核酸分子に付加されてもよい。インデックス、バーコード配列、および目的核酸分子は、いずれの順番で連結されてもよい。インデックスは、バーコード配列と連結した状態で提供されてもよい。分子バーコードを利用して特定のサンプル内に含まれる目的核酸分子をデジタル定量に供する場合には、インデックスを指標として特定サンプルに由来する目的核酸分子を特定することができ、目的核酸配列に付加されたバーコード配列の種類の数（固有のバーコードの数）が定量され、元の（増幅前の）目的核酸分子の絶対的なコピー数が決定される（例えば、図６Ｄ参照）。The conventional procedure for digital quantification of nucleic acids is described below (see FIG. 1, panel A).
DNA (molecular barcode) containing various foreign sequences is uniquely added to each of nucleic acid (target nucleic acid molecule) such as RNA molecule or DNA (eg, complementary DNA or cDNA) molecule (ie, nucleic acid). Add a molecular barcode with a different sequence for each molecule) (see, eg, FIG. 6C). Such a nucleic acid to which a molecular barcode having a unique sequence for each molecule is added may be referred to as “barcode added nucleic acid”. The barcode-added target nucleic acid molecule (cDNA obtained from RNA when the starting material nucleic acid is RNA) is amplified (see, for example, FIG. 6D). The nucleic acid sequence of interest and the barcode sequence of the barcode-added and amplified nucleic acid are sequenced in tandem (see, for example, FIG. 6D). As theoretically suggested, for each nucleic acid of interest, the number of unique barcodes added to the nucleic acid sequence of interest, rather than the number of amplified molecules (the so-called "read number"), is quantified and the original The absolute copy number of the nucleic acid molecule of interest (before amplification) can be determined. In this digital quantification method, since the number of kinds of barcode sequences is focused, the barcode sequence may be added to the target nucleic acid molecule so as to have a unique sequence for each nucleic acid molecule. It doesn't matter what it is. In the digital quantification method, a barcode whose specific sequence is known may be used.
The next-generation sequencer platform has developed, and it has become possible to decode a large amount of nucleotide sequences with a single sequencing (run). This has increased the need for sequencing multiple samples in a single run without exhausting the sequencing capabilities in single sample measurements. A unique index can be added to each sample to distinguish which sample the nucleic acid came from, while sequencing multiple samples in one run. According to the present invention, the index may be added to the target nucleic acid molecule so as to have any sequence as long as it is unique for each sample, and it does not matter what the specific sequence is. In the digital quantification method, an index whose specific sequence is known may be used.
According to the present invention, the index may be added to the amplified target nucleic acid molecule after the target nucleic acid molecule is amplified, or may be added to the target nucleic acid molecule before the target nucleic acid molecule is amplified. May be done. The index may be added after amplification is performed on each sample. For example, the index can be added to each amplification product by adapter ligation. Alternatively, the index may be added while the nucleic acid molecule of interest is amplified. For example, indexing can be done during amplification of the nucleic acid molecule by inclusion in the sequence of the primer.
In the present invention, when the index is added to the nucleic acid molecule of interest before amplification, the index may be added to the nucleic acid molecule of interest before, simultaneously with, or after the addition of the barcode sequence. The index, barcode sequence, and nucleic acid molecule of interest may be linked in any order. The index may be provided in association with the barcode array. When a target nucleic acid molecule contained in a specific sample is subjected to digital quantification using a molecular barcode, the target nucleic acid molecule derived from the specific sample can be specified by using the index as an index and added to the target nucleic acid sequence. The number of different types of barcode sequences (the number of unique barcodes) generated is quantified and the absolute copy number of the original (prior to amplification) nucleic acid molecule of interest is determined (see, eg, Figure 6D).

本発明によれば、インデックスとバーコードとを用いた目的核酸分子のデジタル定量方法において、複数のサンプルを混合して目的核酸分子を定量する場合に、インデックスが想定外の異なるサンプルに由来する核酸に付加されてしまう問題が発生し得ることが明らかとなった（図６Ｅ参照、図７Ｂ参照）。この問題は、インデックスを用いる場合に生じ得るものであり、インデックススイッチング（index switching）、インデックスホッピング（index hopping）、ミスインデックス（misindexing）などといわれる。インデックススイッチングの問題の存在は既に指摘されているが（Sinha, R. et al. Index switching causes “spreading-of-signal” among multiplexed samples in Illumina HiSeq 4000 DNA sequencing. biorxiv, 10.1101/125724 (2017)）現在までに有効な解決手段は報告されていない。
本発明によればまた、バーコード配列の種類の数をカウントする際に、バーコード配列内に生じる変異（例えば、挿入、置換、および欠失）によって、同じと判断されるべき配列が異なる配列として認識される問題が発生し得ることが明らかになった。これらの問題は、インデックスを用いるか否かによらず生じ得る。According to the present invention, in a method for digitally quantifying a target nucleic acid molecule using an index and a barcode, when the target nucleic acid molecule is quantified by mixing a plurality of samples, the index is a nucleic acid derived from a different unexpected sample. It has become clear that a problem of being added to the above may occur (see FIG. 6E and FIG. 7B). This problem can occur when an index is used, and is called index switching, index hopping, misindexing, or the like. The existence of index switching problems has already been pointed out (Sinha, R. et al. Index switching causes “spreading-of-signal” among multiplexed samples in Illumina HiSeq 4000 DNA sequencing. Biorxiv, 10.1101 / 125724 (2017)) To date, no effective solution has been reported.
Also according to the present invention, when counting the number of types of barcode sequences, the sequences that should be judged to be the same are different due to mutations (eg, insertion, substitution, and deletion) that occur in the barcode sequences. It has become clear that there may be problems identified as. These problems can occur whether or not indexes are used.

本発明は、これらの問題それぞれに対して解決策を提供する。
サンプルの区別のためにサンプルに固有のインデックスを用いるデジタル定量方法においては、バーコードとインデックスが付加された目的核酸分子について、同一のバーコードに対して複数種のインデックスが付加されることはないと仮定できる（核酸一分子毎に固有のバーコードが付加されているためである）。これに対して本発明では、同一のバーコードが付加された核酸分子のクラスター中に複数のインデックスが見出された場合には、ミスインデックスが発生したと決定することができる（例えば、図６Ｅおよび図７Ｃ参照）。同一のバーコードが付加された核酸分子のクラスター中に複数のインデックスが見出された場合には、各インデックス配列の存在数を比較し、最も多く存在したインデックス配列を正しくインデックス付加された配列であると決定する（例えば、図６Ｅおよび図７Ｃ参照）。これにより、（例えば、１つのクラスター中の最も多く存在したインデックス配列以外の配列を除外することによって）ミスインデックスに対応することができる。この方法は、目的核酸分子の配列とは関係なく実施することができることは当然である。従って、この方法は、目的核酸配列を解読することを含まなくてよく、含んでいてもよい。この方法は、下記の第１の実施形態に対応する。The present invention provides a solution to each of these problems.
In the digital quantification method that uses a unique index for a sample to distinguish between samples, multiple types of indexes are not added to the same barcode for the target nucleic acid molecule to which the barcode and index are added. Can be assumed (because a unique barcode is added to each nucleic acid molecule). On the other hand, in the present invention, when multiple indices are found in the cluster of nucleic acid molecules to which the same barcode is added, it can be determined that a misindex has occurred (for example, FIG. 6E). And FIG. 7C). When multiple indices are found in a cluster of nucleic acid molecules with the same barcode, the number of existing index sequences is compared and the most abundant index sequence is determined by the correctly indexed sequences. Yes (see, eg, FIGS. 6E and 7C). This can accommodate misindexes (eg, by excluding sequences other than the most abundant index sequence in a cluster). It goes without saying that this method can be carried out independently of the sequence of the nucleic acid molecule of interest. Thus, this method may or may not include decoding the nucleic acid sequence of interest. This method corresponds to the first embodiment described below.

デジタル定量方法においては、インデックス配列およびバーコード配列はその異同の認定が定量の精度に影響する。例えば、バーコード配列は、インデックスを付加するかしないかに関わらず、配列内の塩基の変異（例えば、挿入、置換、および欠失）によって異なる配列と認識されれば、配列の種類の数を増幅などに供される前の元の分子数の決定に用いるデジタル定量では、分子数の決定が不正確になる。これに対して、本発明では、バーコード内の塩基の置換に対しては、一定の距離（Distance）に含まれる配列を１つのクラスターとしてクラスタリングし、クラスター数に基づいて分子数を決定することで塩基の置換によって本来同一であるが異なる配列と認識される問題に対応することができる。ここで、「距離（Distance）」とは、２つの所定のバーコード配列間で相違する塩基の数を意味する。例えば、あるバーコード配列が別のバーコード配列と、いずれか１つの位置での１つの塩基変化を除けば正確に同一となる場合、これら２つのバーコード配列間の距離（Distance）は１である。例えばまた、いずれか２つの位置での２つの塩基変化を除けば正確に同一となる場合、これら２つのバーコード配列間の距離（Distance）は２である。例えばまた、あるバーコード配列が別のバーコード配列と、いずれか３つの位置での３つの塩基変化を除けば正確に同一となる場合、これら２つのバーコード配列間の距離（Distance）は３である。バーコード配列の多様性が増大するほど、第１の実施形態の方法の精度は高まると考えられる。距離（Distance）の値は本開示に従って適宜決定すればよく限定するものではないが、例えば１〜１０、好ましくは１〜５、より好ましくは１〜３、さらに好ましくは３である。この方法は、目的核酸分子の配列とは関係なく実施することができることは当然である。従って、この方法は、目的核酸配列を解読することを含まなくてよく、含んでいてもよい。この方法は、下記の第２の実施形態に対応する。インデックスを付加する系において、インデックスの異同を決定する際にも同様に利用することができる。 In the digital quantification method, the recognition of the difference between the index sequence and the barcode sequence affects the quantification accuracy. For example, if a barcode sequence is recognized as a different sequence due to a mutation (eg, insertion, substitution, and deletion) of bases in the sequence, whether or not an index is added, the number of types of sequences is determined. Digital quantification, which is used to determine the original number of molecules before being subjected to amplification etc., makes the determination of the number of molecules inaccurate. On the other hand, in the present invention, for substitution of bases in a barcode, sequences included in a certain distance are clustered as one cluster, and the number of molecules is determined based on the number of clusters. It is possible to deal with the problem that the sequences are originally recognized as different sequences due to the substitution of bases. Here, “distance” means the number of bases that differ between two predetermined barcode sequences. For example, if one barcode sequence is exactly the same as another barcode sequence except for one base change at any one position, the distance between these two barcode sequences is 1. is there. Also, for example, the distance between these two barcode sequences is 2 if they are exactly the same except for two base changes at any two positions. For example, if one barcode sequence is exactly the same as another barcode sequence except for three base changes at any three positions, the distance between these two barcode sequences is 3 Is. It is believed that the greater the diversity of barcode sequences, the greater the accuracy of the method of the first embodiment. The value of the distance is not particularly limited as long as it is appropriately determined according to the present disclosure, but is, for example, 1 to 10, preferably 1 to 5, more preferably 1 to 3, and further preferably 3. It goes without saying that this method can be carried out independently of the sequence of the nucleic acid molecule of interest. Thus, this method may or may not include decoding the nucleic acid sequence of interest. This method corresponds to the second embodiment described below. In the system to which the index is added, it can be used similarly when determining the difference between the indexes.

また、例えば、バーコード配列内の塩基の挿入や欠失（挿入および欠失を総称して「indel」ということがある）に対しては、インデックスを付加するかしないかに関わらず、バーコードの固定位置の塩基を固定塩基とする（すなわち、バーコード配列中の所定の位置における塩基を特定または規定の塩基とする）ことで固定塩基が所定の位置に存在しないことを指標としてindelの発生を検出することができる（本明細書ではこの方法を「固定塩基マッチフィルタリング（fixed base match filtering）」と呼ぶことがある）。すなわち、シークエンスしたバーコード配列において、固定塩基の位置のいずれかに元の塩基と異なる塩基を含んでいる場合に、バーコード配列中で塩基の挿入または欠失が生じたと決定される。バーコード配列中の固定塩基数は本開示に従って適宜決定すればよく限定するものではないが、例えば１〜１５個、好ましくは２〜１２個、より好ましくは３〜１０個、さらに好ましくは４〜６個である。この方法は、目的核酸分子の配列とは関係なく実施することができることは当然である。従って、この方法は、目的核酸配列を解読することを含まなくてよく、含んでいてもよい。この方法は、下記の第３の実施形態に対応する。インデックスを付加する系において、インデックスの異同を決定する際にも同様に利用することができる。 In addition, for example, for insertions or deletions of bases in a barcode sequence (insertions and deletions may be collectively referred to as "indel"), regardless of whether an index is added or not, the barcode By making the base at the fixed position of the fixed base as the fixed base (that is, the base at the predetermined position in the barcode sequence is specified or the specified base), the indel is generated with the fixed base not existing at the predetermined position Can be detected (this method is sometimes referred to herein as "fixed base match filtering"). That is, it is determined that insertion or deletion of a base has occurred in the barcode sequence when the sequence includes a base different from the original base at any of the fixed base positions in the sequence. The number of fixed bases in the barcode sequence may be appropriately determined according to the present disclosure and is not limited, but is, for example, 1 to 15, preferably 2 to 12, more preferably 3 to 10, and further preferably 4 to. There are six. It goes without saying that this method can be carried out independently of the sequence of the nucleic acid molecule of interest. Thus, this method may or may not include decoding the nucleic acid sequence of interest. This method corresponds to the third embodiment described below. In the system to which the index is added, it can be used similarly when determining the difference between the indexes.

以下、第１の実施形態、第２の実施形態、および第３の実施形態それぞれについて説明する。なお、これらの実施形態は組み合わせて実施することもでき、本発明は、そのような可能な実施形態の組合せを包含するものである。下記の実施形態は、組み合わせた実施態様の非限定的な例を含むものとなっている。 Hereinafter, each of the first embodiment, the second embodiment, and the third embodiment will be described. It should be noted that these embodiments can be implemented in combination, and the present invention encompasses such a combination of possible embodiments. The embodiments below include non-limiting examples of combined embodiments.

本発明の第１の実施形態
すなわち、本発明の第１の実施形態によれば、
複数の核酸分子を含むサンプル毎に固有のインデックス及び各核酸分子に固有のまたは任意の分子バーコードが付加された目的核酸分子を含む複数のサンプルの混合物を用いたシークエンシング（マルチプレックスシークエンシング）より得られた配列情報から、目的核酸分子に付加されたインデックスと分子バーコードの正しいペア又はミスペアを決定する方法であって、
（Ｅ）得られた配列情報から、特定のインデックスを有する配列若しくはこれと類似する配列、特定の分子バーコードを有する配列若しくはこれと類似する配列、または目的核酸分子を含む配列若しくはこれと類似する配列を選択し、選択された配列により群を作成する工程と、
（Ｆ）上記（Ｅ）で作成された群において、検出頻度の最も高いインデックスと分子バーコードのペアをインデックスと分子バーコードの正しいペアと決定する、および／または、検出頻度の低いインデックスと分子バーコードのペア（例えば、一定の基準値より低い検出頻度のペアであって、一定の基準値とは群において99.5％以下、99％以下、90％以下、80％以下、70％以下、60％以下、50％以下、40％以下、30％以下、20％以下、10％以下、5％以下、1％以下の値が挙げられるがこれらに限定されない。また、例えば、２番目以降の検出頻度のペアであってもよい。）の少なくともいずれか１つまたは全てをインデックスと分子バーコードのミスペアと決定する工程と、
を含む、方法が提供される。 A first embodiment of the present invention That is, according to the first embodiment of the present invention,
Sequencing using a mixture of a plurality of samples containing a target nucleic acid molecule having a unique index for each sample containing a plurality of nucleic acid molecules and a unique or arbitrary molecular barcode for each nucleic acid molecule (multiplex sequencing) From the sequence information obtained, a method for determining the correct pair or mispair of the index and the molecular barcode added to the target nucleic acid molecule,
(E) From the obtained sequence information, a sequence having a specific index or a sequence similar thereto, a sequence having a specific molecular barcode or a sequence similar thereto, or a sequence containing a target nucleic acid molecule or a sequence similar thereto Selecting sequences and creating groups with the selected sequences;
(F) In the group created in (E) above, the pair of the index and the molecule barcode with the highest detection frequency is determined as the correct pair of the index and the molecule barcode, and / or the index and the molecule with the low detection frequency are determined. A pair of barcodes (for example, a pair having a detection frequency lower than a certain reference value, and the certain reference value is 99.5% or less, 99% or less, 90% or less, 80% or less, 70% or less, 60% or less in a group. Examples include, but are not limited to, values of% or less, 50% or less, 40% or less, 30% or less, 20% or less, 10% or less, 5% or less, 1% or less. A pair of frequencies), and determining at least any one or all of them as mispairs of the index and the molecular barcode.
A method is provided, including:

本発明の第１の実施形態では、本発明の方法は、
（Ａ）核酸分子（例えば、ＤＮＡまたはＲＮＡ）を含む複数のサンプルを別々に取得する工程と｛サンプルの少なくとも１つには目的核酸分子が含まれる｝、
（Ｂ）｛例えば、得られた複数のサンプルそれぞれにおいて、｝サンプルに含まれる核酸分子を増幅する前に、目的核酸分子それぞれに各核酸分子に固有のまたは任意の分子バーコードを連結して、それぞれ異なる分子バーコードが連結した目的核酸分子を得る工程と、
（Ｃ）｛例えば、複数のサンプルを混合する前に、｝複数の目的核酸分子を含むサンプル毎に固有のインデックスを目的核酸分子に付加し、由来するサンプル毎に異なるインデックスが連結した目的核酸分子のライブラリーを得る工程と（工程（Ｂ）の後に工程（Ｃ）を行ってもよいし、工程（Ｃ）の後に工程（Ｂ）を行ってもよい；また、工程（Ｂ）または（Ｃ）の後で核酸分子を増幅して目的核酸分子の増幅産物を得ることができる）、
（Ｄ）上記（Ｂ）と（Ｃ）の後に得られた核酸分子の増幅産物を含む混合物中で（サンプルを混合するのは工程（Ｃ）の後であり、サンプルを混合した後に工程（Ｂ）を行ってもよく、工程（Ｂ）を行った後に全サンプルを混合してもよい。また、分子バーコードが連結した核酸分子の増幅産物を得るのは工程（Ｂ）の後であり、増幅産物を得る前にサンプルを混合してもよく、増幅産物を得た後に当該増幅産物を含むサンプルを混合してもよい）、サンプル毎に固有のインデックス及び各目的核酸分子に固有のまたは任意の分子バーコードが付加された核酸分子をシークエンシングして、１核酸分子毎にインデックス部分の配列と分子バーコード部分の配列と必要に応じてそれに連結した目的核酸分子部分の配列を決定する工程と、
をさらに含んでいてもよい。In a first embodiment of the invention, the method of the invention comprises
(A) separately obtaining a plurality of samples containing nucleic acid molecules (for example, DNA or RNA), and {at least one of the samples contains a target nucleic acid molecule},
(B) {for example, in each of the obtained plurality of samples}, before amplifying the nucleic acid molecule contained in the sample, each nucleic acid molecule of interest is linked with a unique or arbitrary molecular barcode of each nucleic acid molecule, Obtaining a target nucleic acid molecule in which different molecular barcodes are linked,
(C) {For example, before mixing a plurality of samples} A target nucleic acid molecule in which a unique index is added to a target nucleic acid molecule for each sample containing a plurality of target nucleic acid molecules and a different index is linked to each derived sample And the step (C) may be carried out after the step (B), or the step (B) may be carried out after the step (C); and the step (B) or (C). A), the nucleic acid molecule can be amplified to obtain an amplification product of the target nucleic acid molecule),
(D) In a mixture containing the amplification products of nucleic acid molecules obtained after (B) and (C) above (the sample is mixed after the step (C), and the sample is mixed after the step (B). ) May be carried out, and all the samples may be mixed after carrying out step (B), and the amplification product of the nucleic acid molecule linked with the molecular barcode is obtained after step (B), The sample may be mixed before obtaining the amplification product, and the sample containing the amplification product may be mixed after obtaining the amplification product), the index unique to each sample and the unique or arbitrary to each nucleic acid molecule of interest. Sequence of the nucleic acid molecule to which the molecular barcode is added to determine the sequence of the index portion and the sequence of the molecular barcode portion and the sequence of the target nucleic acid molecule portion linked to the nucleic acid molecule sequence for each nucleic acid molecule. When,
May be further included.

本発明の第１の実施形態では、インデックスは、サンプル毎に固有の塩基配列を有するものであれば、任意の配列を有するものを用いることができる。インデックスは、所定の配列を有するものとすることができるが（例えば、配列を参照することでいずれのサンプルに由来するものかが確定できるようにしてもよいが）、配列が不明なものであってもよい（例えば、配列を参照してもいずれのサンプルに由来するのかは確定できず、配列が異なることで異なるサンプルに由来することが分かるものであってもよい）。 In the first embodiment of the present invention, as the index, any index can be used as long as it has a unique base sequence for each sample. The index may have a given sequence (for example, it may be possible to determine which sample it came from by referencing the sequence), but the sequence is unknown. (For example, it is not possible to determine from which sample the sample is derived by referring to the sequence, and it may be possible to know that the sample is derived from a different sample due to the different sequence).

本発明の第１の実施形態では、分子バーコードは、サンプル中の核酸分子数に対して十分な多様性を有するように作製することができる。分子バーコードは、サンプル中の核酸分子数に対して十分な多様性を有する限り、どのような塩基配列を有するものであってもよい。配列の設計の手間を省く目的等のために、分子バーコードの配列は、無作為に決定された配列（ランダムに決定された配列）とすることができる。例えば、分子バーコードは、ランダムに決定された塩基（すなわち、ランダム塩基）を複数含むことによって上記十分な多様性を有するものであってもよい。分子バーコードの多様性を確保するためには、分子バーコードの塩基配列の長さを長くすることができる。所定の多様性を有する目的核酸のデジタル定量においてランダム塩基を用いる場合、必要な分子バーコードの塩基配列中のランダム塩基の数を、図１２に例示されるようなグラフに基づいて実験的に決定してもよい。本発明を限定するものではないが、例えば、分子バーコードの塩基配列中のランダム塩基の数を３８以上にすることで、１０^１５に及ぶ分子の数をデジタル定量するに十分な多様性を確保することができることが実施例から理解できる。４つの塩基をランダムに配列させると塩基長が３８である場合、分子バーコードの多様性は理論上４^３８（すなわち、約７．５６×１０^２２）に及ぶ。分子バーコードにおけるランダム塩基の数は、配列の多様性確保のために、例えば、６以上、７以上、８以上、９以上、１０以上、１１以上、１２以上、１３以上、１４以上、１５以上、１６以上、１７以上、１８以上、１９以上、または２０以上とすることができる。あるいは、ランダム塩基の数は、２５以上、３０以上、３５以上、４０以上であってもよい。In the first embodiment of the present invention, the molecular barcode can be made to have sufficient diversity with respect to the number of nucleic acid molecules in the sample. The molecular barcode may have any base sequence as long as it has sufficient diversity with respect to the number of nucleic acid molecules in the sample. The sequence of the molecular barcode can be a randomly determined sequence (randomly determined sequence) for the purpose of saving the effort of designing the sequence. For example, the molecular barcode may have the above sufficient diversity by including a plurality of randomly determined bases (that is, random bases). In order to secure the diversity of the molecular barcode, the length of the nucleotide sequence of the molecular barcode can be increased. When random bases are used in the digital quantification of target nucleic acids having a predetermined diversity, the number of random bases in the base sequence of the required molecular barcode is experimentally determined based on the graph illustrated in FIG. You may. Although not limiting the present invention, for example, by setting the number of random bases in the base sequence of the molecular barcode to be 38 or more, sufficient diversity is secured to digitally quantify the number of molecules up to 10 ^15. It can be understood from the examples that what can be done. If four bases are randomly arranged and the base length is 38, the diversity of the molecular barcode theoretically reaches 4 ³⁸ (that is, about 7.56 × 10 ²² ). The number of random bases in the molecular barcode is, for example, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more, 11 or more, 12 or more, 13 or more, 14 or more, 15 or more in order to secure sequence diversity. , 16 or more, 17 or more, 18 or more, 19 or more, or 20 or more. Alternatively, the number of random bases may be 25 or more, 30 or more, 35 or more, 40 or more.

本発明の第１の実施形態では、複数のサンプルとは、２以上、３以上、４以上、５以上、６以上、７以上、８以上、９以上、または１０以上のサンプルであり、インデックスにより区別可能な数であるが、特に数に上限は無い。 In the first embodiment of the present invention, the plurality of samples are 2 or more, 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, or 10 or more samples, and are It is a distinguishable number, but there is no upper limit to the number.

本発明の第１の実施形態では、上記（Ｅ）では、配列同一性に基づいて、特定のインデックスを有する配列、特定の分子バーコードを有する配列、および／または、目的核酸分子を含む配列を選択し、選択された配列により群を形成することができる。ここで、特定分子バーコードを有する配列を選択し、選択された配列により分子バーコード毎に群を形成することで、分子バーコードの種類の数に対応した数の群を形成することができる。また、特定のインデックスを有する配列を選択し、選択された配列によりインデックス毎に群を形成することで、インデックスの数（例えば、サンプル毎に異なるインデックスを付加する場合はサンプル数）に対応した数の群を形成することができる。また、特定の目的核酸を有する配列を選択し、選択された配列により群を形成することで、目的核酸を含む核酸群を得ることができる。 In the first embodiment of the present invention, in the above (E), a sequence having a specific index, a sequence having a specific molecular barcode, and / or a sequence containing a nucleic acid molecule of interest is based on the sequence identity. Groups can be formed by the selected sequences. Here, by selecting a sequence having a specific molecular barcode and forming a group for each molecular barcode by the selected sequence, it is possible to form a number of groups corresponding to the number of types of molecular barcodes. . In addition, by selecting an array having a specific index and forming a group for each index by the selected array, the number corresponding to the number of indexes (for example, the number of samples when a different index is added to each sample) Can form a group of. Further, by selecting a sequence having a specific target nucleic acid and forming a group with the selected sequence, a nucleic acid group containing the target nucleic acid can be obtained.

本発明の第１の実施形態では、上記（Ｅ）は、群を作成する工程が、｛好ましくは、分子バーコード部分の配列において｝配列同一性または類似性に基づいて同一配列を有していた｛例えば、工程（Ａ）〜（Ｄ）の工程のいずれかによって配列が変化することがある｝と推定される分子を一群としてクラスタリングすることによって群を作成することによって行われ得る。
本発明の第１の実施形態では、例えば、上記（Ｅ）は、第２の実施形態と組み合わせて実施することもできる。詳細は、第２の実施形態において説明する。
本発明の第１の実施形態ではさらにまた、例えば、上記（Ｅ）は、第２の実施形態および第３の実施形態と組み合わせて実施することができる。詳細は、第３の実施形態において説明する。In the first embodiment of the present invention, in the above (E), the step of forming groups has the same sequence {preferably in the sequence of the molecular barcode portion} based on the sequence identity or similarity. (For example, the sequence may be changed by any of the steps (A) to (D)), the clustering is performed as a group to create a group.
In the first embodiment of the present invention, for example, the above (E) can be carried out in combination with the second embodiment. Details will be described in the second embodiment.
Furthermore, in the first embodiment of the present invention, for example, the above (E) can be implemented in combination with the second embodiment and the third embodiment. Details will be described in the third embodiment.

本発明の第１の実施形態では、上記（Ｆ）では、上記（Ｅ）で作成された群それぞれについて、検出頻度の最も高いインデックスと分子バーコードのペアをインデックスと分子バーコードの正しいペアと決定することができる。本発明の第１の実施形態では、検出頻度の低いインデックスと分子バーコードのペアの少なくともいずれか１つまたは全てをインデックスと分子バーコードのミスペアと決定することができる。本発明の第１の実施形態では、検出頻度の最も高いインデックスと分子バーコードのペアをインデックスと分子バーコードの正しいペアと決定し、かつ、検出頻度の低いインデックスと分子バーコードのペアの少なくともいずれか１つまたは全てをインデックスと分子バーコードのミスペアと決定することができる。本発明の第１の実施形態では、ミスペアと決定された核酸分子は、分子数のカウントから除外することができる。正しいペアの決定、およびミスペアの決定はそれぞれ、目的核酸分子の配列に関係なく実施することができる。例えば、目的核酸分子を選択した上で、正しいペアの決定、およびミスペアの決定をそれぞれ行ってもよいが；または、正しいペアの決定、およびミスペアの決定を行った上で、目的核酸分子を選択してもよい。 In the first embodiment of the present invention, in the above (F), for each of the groups created in the above (E), the pair of the index and the molecular barcode with the highest detection frequency is set as the correct pair of the index and the molecular barcode. You can decide. In the first embodiment of the present invention, at least any one or all of the pairs of the index and the molecular barcode having a low detection frequency can be determined as the mispair of the index and the molecular barcode. In the first embodiment of the present invention, the pair of the index and the molecular barcode having the highest detection frequency is determined as the correct pair of the index and the molecular barcode, and at least the pair of the index and the molecular barcode having a low detection frequency is determined. Any one or all can be determined as mispairs of the index and molecular barcode. In the first embodiment of the invention, nucleic acid molecules determined to be mispairs can be excluded from counting the number of molecules. The determination of the correct pair and the determination of the mispair can be performed independently of the sequence of the nucleic acid molecule of interest. For example, the target nucleic acid molecule may be selected and then the correct pair and Mispair may be determined respectively; or, the correct pair and Mispair may be determined and then the target nucleic acid molecule may be selected. You may.

例えば、ある態様では、上記（Ｅ）において、特定の分子バーコードを有する配列を選択して分子バーコード毎に群を作成した場合には、
(i)工程（Ｆ）において、作成された群のうち検出頻度の最も高いインデックスと分子バーコードのペアをインデックスと分子バーコードの正しいペアと決定することができる；または
(ii)工程（Ｆ）において作成された群のうち検出頻度の低いインデックスと分子バーコードのペア（例えば、一定の基準値より低い検出頻度のペアであって、一定の基準値とは群において50%以下、40%以下、30%以下、20%以下、10%以下、5%以下、1%以下の値が挙げられるがこれらに限定されない。また、例えば２番目以降の検出頻度のペアであってもよい。）をインデックスと分子バーコードの少なくともいずれか１つまたは全てのミスペアと決定することができる。For example, in one aspect, in the above (E), when a sequence having a specific molecular barcode is selected to create a group for each molecular barcode,
(i) In step (F), it is possible to determine the pair of the index and the molecular barcode having the highest detection frequency among the created groups as the correct pair of the index and the molecular barcode; or
(ii) A pair of an index and a molecular barcode having a low detection frequency in the group prepared in step (F) (for example, a pair having a detection frequency lower than a certain reference value, and the certain reference value is a group). Examples include, but are not limited to, values of 50% or less, 40% or less, 30% or less, 20% or less, 10% or less, 5% or less, 1% or less. May be present) can be determined to be at least one and / or all mispairs of the index and the molecular barcode.

例えば、ある態様では、上記（Ｅ）において、特定のインデックスを有する配列を選択してインデックス毎に群を作成した場合には、
(iii)工程（Ｆ）において、複数の群に出現した分子バーコードを有する核酸配列に関して、最もリード数が多いバーコードとインデックスのペアを、バーコードとインデックスの正しいペアと決定する、または、検出頻度の最も高いインデックスと分子バーコードのペアをインデックスと分子バーコードの正しいペアと決定することができる；または
(iv)工程（Ｆ）において、複数の群に出現した分子バーコードを有する配列に関して、検出頻度の低いインデックスと分子バーコードのペア（例えば、一定の基準値より低い検出頻度のペアであって、一定の基準値とは群において50%以下、40%以下、30%以下、20%以下、10%以下、5%以下、1%以下の値が挙げられるがこれらに限定されない。また、例えば２番目以降の検出頻度のペアであってもよい。）をインデックスの少なくともいずれか１つまたは全てと分子バーコードのミスペアと決定することができる。For example, in one aspect, in (E), when an array having a specific index is selected and a group is created for each index,
(iii) In the step (F), regarding the nucleic acid sequences having the molecular barcodes appearing in a plurality of groups, the barcode-index pair having the largest number of reads is determined as the correct barcode-index pair, or The most frequently detected pair of index and molecular barcode can be determined as the correct pair of index and molecular barcode; or
(iv) In the step (F), for a sequence having a molecular barcode that appears in a plurality of groups, a pair of an index and a molecular barcode having a low detection frequency (for example, a pair having a detection frequency lower than a certain reference value, The constant reference value includes, but is not limited to, values of 50% or less, 40% or less, 30% or less, 20% or less, 10% or less, 5% or less, 1% or less in the group. The second and subsequent detection frequency pairs may be determined to be at least one or all of the indexes and the mispair of the molecular barcode.

例えば、ある態様では、上記（Ｅ）において、目的核酸分子を含む配列を選択して群を作成した場合には、
(v)工程（Ｆ）において、さらに当該群から特定のインデックスを有する配列を選択してサブグループを作成し、複数のサブグループに出現した分子バーコードを有する核酸配列に関して、最もリード数が多いバーコードとインデックスのペアを、バーコードとインデックスの正しいペアと決定する、または、検出頻度の最も高いインデックスと分子バーコードのペアをインデックスと分子バーコードの正しいペアと決定することができる；
(vi)工程（Ｆ）において、さらに当該群から特定の分子バーコードを有する分子を選択してサブグループを作成し、作成された一つのサブグループにおいて検出頻度の最も高いインデックスと分子バーコードのペアをインデックスと分子バーコードの正しいペアと決定することができる；
(vii)工程（Ｆ）においてさらに当該群から特定のインデックスを有する分子を選択してサブグループを作成し、複数のサブグループに出現した分子バーコードを有する核酸分子に関して、検出頻度の低いインデックスと分子バーコードのペア（例えば、一定の基準値より低い検出頻度のペアであって、一定の基準値とは群において50%以下、40%以下、30%以下、20%以下、10%以下、5%以下、1%以下の値が挙げられるがこれらに限定されない。また、例えば２番目以降の検出頻度のペアであってもよい。）の少なくともいずれか１つまたは全てをインデックスと分子バーコードのミスペアと決定することができる；または
(viii)工程（Ｆ）においてさらに当該群から特定の分子バーコードを有する分子を選択してサブグループを作成し、作成された一つのサブグループにおいて検出頻度の低いインデックスと分子バーコードのペア（例えば、一定の基準値より低い検出頻度のペアであって、一定の基準値とは群において50%以下、40%以下、30%以下、20%以下、10%以下、5%以下、1%以下の値が挙げられるがこれらに限定されない。また、例えば２番目以降の検出頻度のペアであってもよい。）の少なくともいずれか１つまたは全てをインデックスと分子バーコードのミスペアと決定することができる。For example, in one embodiment, in the above (E), when a sequence containing the target nucleic acid molecule is selected to create a group,
(v) In step (F), a sequence having a specific index is further selected from the group to form a subgroup, and the nucleic acid sequence having a molecular barcode that appears in a plurality of subgroups has the highest number of reads. The barcode and index pair can be determined as the correct barcode and index pair, or the most frequently detected index and molecular barcode pair can be determined as the correct index and molecular barcode pair;
(vi) In step (F), a molecule having a specific molecular barcode is further selected from the group to create a subgroup, and the index and the molecular barcode with the highest detection frequency in one created subgroup are selected. The pair can be determined as the correct pair of index and molecular barcode;
(vii) In the step (F), a molecule having a specific index is further selected from the group to form a subgroup, and a nucleic acid molecule having a molecular barcode that appears in a plurality of subgroups has an index with a low detection frequency. A pair of molecular barcodes (e.g., a pair having a detection frequency lower than a certain reference value, and a certain reference value is 50% or less, 40% or less, 30% or less, 20% or less, 10% or less in a group, 5% or less, 1% or less, but not limited to these. For example, it may be a pair of detection frequencies from the second onward). Can be determined to be a mispair; or
(viii) In step (F), a molecule having a specific molecular barcode is further selected from the group to create a subgroup, and a pair of an index and a molecular barcode having a low detection frequency in one created subgroup ( For example, a pair with a detection frequency lower than a certain reference value, and the certain reference value is 50% or less, 40% or less, 30% or less, 20% or less, 10% or less, 5% or less, 1% in the group. The following values are included, but not limited to these, and may be, for example, the second and subsequent detection frequency pairs.) At least any one or all of them are determined as mispairs of the index and the molecular barcode. You can

このようにして、本発明の第１の実施形態では、バーコード配列とインデックス配列との正しいペアを決定することができ、および／または、ミスペアを決定することができる。後述する実施例で示されたように、ミスペアをカウントしないことによって、目的核酸分子のデジタル定量の精度が向上し得る。 In this way, in the first embodiment of the present invention, the correct pair of barcode array and index array can be determined and / or the mispair can be determined. As shown in Examples described later, by not counting mispairs, the accuracy of digital quantification of the target nucleic acid molecule can be improved.

本発明の第２の実施形態
バーコード配列を用いた核酸分子のデジタル定量法においては、解析中にバーコード配列内に変異（挿入、置換、または欠失）が生じること、および変異が定量精度に影響することが明らかとなった。本発明の第２の実施形態は、バーコード配列を用いた目的核酸分子のデジタル定量法において、得られた核酸配列の情報から、変異後の分子バーコードが有する配列を配列類似性を有する他の配列と一緒に１群に分類する（クラスタリング）ことに関連する。これにより、解析中に生じるバーコード配列内の変異の影響を最小化しようとするものである。第２の実施形態は、例えば、分子バーコードに類似する配列が含まれる可能性が低い環境下では、類似する配列は、同一配列から変異（挿入、置換、または欠失）によって生じた可能性が高いことに基づくものであり、実際に実施例においてもこのクラスタリングによってデジタル定量の精度が向上することが示唆された。
より具体的には、例えば、群を作成する工程が、｛好ましくは、分子バーコード部分の配列において｝配列同一性または類似性に基づいて判断される同一配列を有していた｛例えば、工程（Ａ）〜（Ｄ）を実施した場合、これらの工程のいずれかによって配列が変異することがある｝と推定される分子を一群としてクラスタリングすることによって群を作成することであり得る。従って、特定のインデックスを有する配列と類似性を有する配列とは、特定のインデックスを有する配列、および、特定のインデックスを有する配列と類似性を有する配列を含む。 Second Embodiment of the Present Invention In a digital quantification method of a nucleic acid molecule using a barcode sequence, a mutation (insertion, substitution or deletion) occurs in the barcode sequence during analysis, and the mutation has a quantification accuracy. It has been revealed that The second embodiment of the present invention is a method for digitally quantifying a target nucleic acid molecule using a barcode sequence, and based on the information of the obtained nucleic acid sequence, the sequence possessed by the molecular barcode after mutation has sequence similarity. And clustering together with the sequences of (clustering). This seeks to minimize the effects of mutations within the barcode sequence that occur during analysis. In the second embodiment, for example, in an environment in which a sequence similar to a molecular barcode is unlikely to be contained, the similar sequence may be generated by mutation (insertion, substitution, or deletion) from the same sequence. It is suggested that this clustering improves the accuracy of digital quantification even in the practical examples.
More specifically, for example, the step of forming a group had an identical sequence {preferably in the sequence of the molecular barcode portion} having the same sequence determined on the basis of sequence identity or similarity {eg, the step When (A) to (D) are carried out, the sequence may be mutated by any of these steps.} It may be possible to create a group by clustering the putative molecules. Therefore, a sequence having similarity with a sequence having a specific index includes a sequence having a specific index and a sequence having similarity with a sequence having a specific index.

本発明の第２の実施形態では、例えば、バーコード配列を用いた目的核酸分子のデジタル定量法において、得られた核酸配列を配列の類似性に基づいてインデックス、バーコードおよび／または目的核酸分子を群に分ける（クラスタリングする）ことができる。本発明の第２の実施形態のある態様では、例えば、クラスタリングは、
（i）分子バーコード部分の配列において、固有の分子バーコードの配列と同一の配列を有する核酸分子群を同じクラスターに分類することにより行われる（すなわち、Distance=0）；
（ii）分子バーコード部分の配列において、固有の分子バーコードの配列と１ベースまでのミスマッチを有する配列を有する核酸分子群を同じクラスターに分類することにより行われる（すなわち、Distance=1）；
（iii）分子バーコード部分の配列において、固有の分子バーコードの配列と２ベースまでのミスマッチを有する配列を有する核酸分子群を同じクラスターに分類することにより行われる（すなわち、Distance=2）；または
（iv）分子バーコード部分の配列において、固有の分子バーコードの配列と３ベースまでのミスマッチを有する配列を有する核酸分子群を同じクラスターに分類することにより行われる（すなわち、Distance=3）。このようにすることで、デジタル定量法において生じ得る０〜３塩基の変異による核酸配列の種類の人為的増加を是正する。
第２の実施形態のこの態様は、第１の実施形態と組み合わせる場合には、上記（Ｅ）の工程において実施することができる。In the second embodiment of the present invention, for example, in a digital quantification method of a nucleic acid molecule of interest using a barcode sequence, the obtained nucleic acid sequence is subjected to an index, barcode and / or nucleic acid molecule of interest based on sequence similarity. Can be divided into groups (clustering). In an aspect of the second embodiment of the present invention, for example, the clustering is
(I) In the sequence of the molecular barcode portion, the nucleic acid molecule group having the same sequence as the unique molecular barcode sequence is classified into the same cluster (that is, Distance = 0);
(Ii) In the sequence of the molecular barcode portion, a nucleic acid molecule group having a unique molecular barcode sequence and a sequence having a mismatch of up to 1 base is classified into the same cluster (that is, Distance = 1);
(Iii) In the sequence of the molecular barcode portion, a nucleic acid molecule group having a unique molecular barcode sequence and a sequence having a mismatch of up to 2 bases is classified into the same cluster (that is, Distance = 2); Or (iv) In the sequence of the molecular barcode portion, a nucleic acid molecule group having a sequence having a mismatch of up to 3 bases with the sequence of the unique molecular barcode is classified into the same cluster (that is, Distance = 3) . By doing so, the artificial increase in the kinds of nucleic acid sequences due to the mutation of 0 to 3 bases that can occur in the digital quantification method is corrected.
This aspect of the second embodiment, when combined with the first embodiment, can be implemented in step (E) above.

本発明の第２の実施形態では、例えば、バーコード配列を用いた目的核酸分子のデジタル定量法において、得られた核酸配列を配列の類似性に基づいてインデックス、バーコードおよび／または目的核酸分子を群に分ける（クラスタリングする）ことができる。本発明の第２の実施形態のある態様では、クラスタリングは、例えば、分子バーコード部分の配列において、塩基（例えば、１ベースまで、２ベースまで、または３ベースまで）の挿入または欠失（indel）を有するとしてシークエンスされた配列を有する核酸分子群を同じクラスターに分類することにより行われる。
第２の実施形態のこの態様は、第１の実施形態と組み合わせる場合には、上記（Ｅ）の工程において実施することができる。In the second embodiment of the present invention, for example, in a digital quantification method of a nucleic acid molecule of interest using a barcode sequence, the obtained nucleic acid sequence is subjected to an index, barcode and / or nucleic acid molecule of interest based on sequence similarity. Can be divided into groups (clustering). In an aspect of the second embodiment of the present invention, the clustering is performed by, for example, inserting or deleting bases (eg up to 1 base, up to 2 bases, or up to 3 bases) in the sequence of the molecular barcode portion. ) Are grouped into the same cluster.
This aspect of the second embodiment, when combined with the first embodiment, can be implemented in step (E) above.

本発明の第２の実施形態では、例えば、バーコード配列を用いた目的核酸分子のデジタル定量法において、得られた核酸配列を配列の類似性に基づいてインデックス、バーコードおよび／または目的核酸分子を群に分ける（クラスタリングする）ことができる。本発明の第２の実施形態のある態様では、クラスタリングは、例えば、分子バーコード部分の配列において、塩基（例えば、１ベースまで、２ベースまで、または３ベースまで）の挿入または欠失（indel）を有するとしてシークエンスされた配列を除外して得られた核酸分子群に対して行われる。
第２の実施形態のこの態様は、第１の実施形態と組み合わせる場合には、上記（Ｅ）の工程において実施することができる。In the second embodiment of the present invention, for example, in a digital quantification method of a nucleic acid molecule of interest using a barcode sequence, the obtained nucleic acid sequence is subjected to an index, barcode and / or nucleic acid molecule of interest based on sequence similarity. Can be divided into groups (clustering). In an aspect of the second embodiment of the invention, the clustering is performed by, for example, inserting or deleting bases (eg up to 1 base, up to 2 bases, or up to 3 bases) in the sequence of the molecular barcode portion. ) Is performed on the nucleic acid molecule group obtained by excluding the sequence sequence.
This aspect of the second embodiment, when combined with the first embodiment, can be implemented in step (E) above.

また、例えば、第２の実施形態のある態様では核酸配列を特定のバーコードの配列を類似するか否かによって選択し、選択された配列により群を作成することができる。ここで、「類似する」とは、配列が、１塩基、２塩基、３塩基、またはそれ以上異なる（例えば、挿入、欠失または置換）が、それ以外の塩基は一致することを意味する。類似する塩基配列間では一致する塩基の割合が、例えば、５０％以上、５５％以上、６０％以上、６５％以上、７０％以上、７５％以上、８０％以上、８５％以上、９０％以上、９５％以上、９６％以上、９７％以上、９８％以上、または９９％以上であり得る。 Further, for example, in an aspect of the second embodiment, a nucleic acid sequence can be selected depending on whether or not a particular barcode sequence is similar, and a group can be created by the selected sequence. Here, “similar” means that the sequences differ by 1 base, 2 bases, 3 bases or more (for example, insertion, deletion or substitution), but the other bases are the same. The ratio of matching bases between similar base sequences is, for example, 50% or more, 55% or more, 60% or more, 65% or more, 70% or more, 75% or more, 80% or more, 85% or more, 90% or more. , 95% or more, 96% or more, 97% or more, 98% or more, or 99% or more.

本発明の第３の実施形態
バーコード配列を用いた目的核酸分子のデジタル定量法において、得られた核酸配列において、挿入または欠失（indel）が生じることがある。本発明の第３の実施形態では、核酸配列（特にバーコード配列）に対して生じ得るindelの検出において、核酸分子に連結する全てのバーコード配列中に配置された１以上（例えば、１つ、２つ、３つ、４つ、５つ、または６つ以上）の固定塩基の一部（１つ以上）または全部が元来の位置において所定の固定塩基以外の塩基に変化しているか否かにより検出され得る。本発明の第３の実施形態ではまた、核酸配列（特にバーコード配列）に対して生じ得るindelの検出において、核酸分子に連結する全てのバーコード配列中に配置された１以上（例えば、１つ、２つ、３つ、４つ、５つ、または６つ以上）の固定塩基それぞれの位置と、配列解読されたバーコード配列部分の配列における１以上の固定塩基それぞれの位置との、相対的位置を比較することによって特定することをさらに含み得る｛例えば、それぞれの固定塩基は、通常は、Ａ、Ｔ、ＧおよびＣからなる群から選択されるいずれか１つの塩基となるように設計され得る；または、それぞれの固定塩基は、ＡとＴの組合せ、ＡとＧの組合せ、ＡとＣの組合せ、ＴとＧの組合せ、ＴとＣの組合せ、ＧとＣの組合せ、ＡとＴとＧとの組合せ、ＡとＴとＣとの組合せ、ＡとＧとＣとの組合せ、およびＴとＧとＣとの組合せからなる群から選択されるいずれか１つの組合せに含まれる塩基から選択される塩基となるように設計され得る｝。これによって、１以上の固定塩基が所定の位置からずれた位置に存在することを指標として、および、好ましくは固定塩基が存在するべき位置に他の塩基が存在することを更なる指標として、indelを検出することができる。例えば、１以上、例えば、２以上の固定塩基が所定の位置からそれぞれ同じ塩基数ずれた位置に存在すれば、indelが検出されたと決定することができる。indelが検出されたときには、indelを有するとしてシークエンスされた配列を有する核酸分子群をindelを有しない配列と同じクラスターに分類してもよいし、indelを有するとしてシークエンスされた配列を有する核酸分子群を除外してもよい（例えば、得られた配列情報からindelを有するとしてシークエンスされた配列を有する核酸分子群を除外してもよいし、indelを有するとしてシークエンスされた配列を有する核酸分子群を除外して核酸分子群をクラスタリングしてもよい）。この態様では、固定塩基が２以上存在する場合には、固定塩基同士は、好ましくは、固定塩基間には１塩基以上の他の塩基を介在させ得る。ここで「固定塩基」とは、複数のバーコード配列において、バーコード配列の末端（５’末端、若しくは、３’末端、または、５’末端および３’末端）から所定の位置に存在する共通する塩基を意味する（ここで、共通する塩基は、上記のように複数のバーコード配列間で共通する設計によって決定された塩基としてもよい）。
第３の実施形態のこの態様は、第１の実施形態と組み合わせる場合には、上記（Ｅ）の工程において実施することができる。第３の実施形態のこの態様は、第２の実施形態と組み合わせる場合には、indelの検出において実施することができる。 Third Embodiment of the Invention In the digital quantification method of a target nucleic acid molecule using a barcode sequence, insertion or deletion (indel) may occur in the obtained nucleic acid sequence. In a third embodiment of the invention, in the detection of indels that may occur for nucleic acid sequences (particularly barcode sequences), one or more (eg one Whether some (one or more) or all of the two, three, four, five, or six or more fixed bases are changed to a base other than the predetermined fixed base at the original position. Can be detected. In a third embodiment of the invention, also in the detection of indels that may occur for nucleic acid sequences (particularly barcode sequences) one or more (eg 1 One, two, three, four, five, or six or more) positions of each fixed base, and the respective positions of one or more fixed bases in the sequence of the sequence-decoded barcode sequence portion. May further include identifying by comparing the target positions {eg, each fixed base is usually designed to be any one base selected from the group consisting of A, T, G and C. Or each fixed base can be a combination of A and T, a combination of A and G, a combination of A and C, a combination of T and G, a combination of T and C, a combination of G and C, A and T. And G, A, T and C Can be designed to be a base selected from the bases included in any one combination selected from the group consisting of a combination of A, G and C, and a combination of T, G and C} . As a result, one or more fixed bases are present at positions deviated from a predetermined position as an index, and preferably other bases are present at positions where fixed bases should be present as an additional index. Can be detected. For example, if one or more, for example, two or more fixed bases are present at positions deviating from the predetermined position by the same number of bases, it can be determined that indel has been detected. When indel is detected, a group of nucleic acid molecules having a sequence sequenced as having indel may be classified into the same cluster as a sequence having no indel, or a group of nucleic acid molecules having a sequence sequenced as having indel May be excluded (for example, the nucleic acid molecule group having a sequence sequenced as having indel may be excluded from the obtained sequence information, or the nucleic acid molecule group having a sequence sequence having as indel may be excluded. Nucleic acid molecule groups may be excluded and clustered). In this aspect, when there are two or more fixed bases, the fixed bases may preferably have one or more other bases interposed between the fixed bases. Here, the term “fixed base” means that a plurality of barcode sequences have a common position at a predetermined position from the ends (5 ′ end, 3 ′ end, or 5 ′ end and 3 ′ end) of the barcode sequence. (In this case, the common base may be a base determined by a common design among a plurality of barcode sequences as described above).
This aspect of the third embodiment, when combined with the first embodiment, can be implemented in step (E) above. This aspect of the third embodiment, when combined with the second embodiment, can be implemented in indel detection.

また、本発明の第１の実施形態は、
核酸の解析方法であって：
（Ｉ）分子バーコードとインデックスが付加された複数の目的核酸分子の混合物をシークエンシングに供して配列情報を得る工程と、
（ＩＩ）上記（Ｉ）で得られた配列情報から特定のインデックスを有する配列若しくはこれと類似する配列、及び／又は特定の分子バーコードを有する配列若しくはこれと類似する配列を選択し、選択された配列により群を作成する工程と、
（ＩＩＩ）上記（ＩＩ）で作成された群において、検出頻度の最も高いインデックスと分子バーコードのペアをインデックスと分子バーコードの正しいペアと決定する工程と、
を含む、方法であってもよい。In addition, the first embodiment of the present invention is
A method for analyzing nucleic acids:
(I) subjecting a mixture of a plurality of target nucleic acid molecules having a molecular barcode and an index to sequencing to obtain sequence information;
(II) A sequence having a specific index or a sequence similar thereto and / or a sequence having a specific molecular barcode or a sequence similar thereto is selected from the sequence information obtained in (I) above, and selected. Creating a group with the arranged array,
(III) in the group created in (II) above, determining the pair of the index and the molecular barcode with the highest detection frequency as the correct pair of the index and the molecular barcode,
May be included.

さらに、本発明の第２の実施形態は、核酸の解析方法であって：
（Ｉ）分子バーコードが付加された複数の核酸分子の混合物をシークエンシングに供して配列情報を得る工程と、
（ＩＩ）上記（Ｉ）で得られた配列情報から特定の分子バーコードを有する配列若しくはこれと類似する配列を選択し、選択された配列により群を作成する工程と、
を含む、方法であってもよい。Further, a second embodiment of the present invention is a method for analyzing nucleic acid, which comprises:
(I) subjecting a mixture of a plurality of nucleic acid molecules having a molecular barcode to sequencing to obtain sequence information;
(II) a step of selecting a sequence having a specific molecular barcode or a sequence similar thereto from the sequence information obtained in (I) above, and creating a group by the selected sequence;
May be included.

さらに、本発明の第３の実施形態は、核酸の解析方法であって：
（Ｉ）特定の位置に固定塩基を有する分子バーコードが付加された複数の核酸分子の混合物をシークエンシングに供して配列情報を得る工程と、
（ＩＩａ）当該特定の位置に当該固定塩基を含まない分子バーコードを有する配列を解析から除外する工程と、
を含む、方法であってもよい。Further, a third embodiment of the present invention is a method for analyzing nucleic acid, which comprises:
(I) subjecting a mixture of a plurality of nucleic acid molecules having a molecular barcode having a fixed base at a specific position to sequencing to obtain sequence information;
(IIa) excluding from the analysis a sequence having a molecular barcode that does not include the fixed base at the specific position,
May be included.

上記第１、第２、および第３の実施形態のそれぞれにおいて、少なくとも分子バーコードが付加された目的核酸分子が、工程（Ｉ）の前に増幅に供されていてもよい。ここで、少なくとも分子バーコードが付加された目的核酸分子とは、少なくとも分子バーコードが付加されていれば、インデックスがさらに付加されていてもよく、インデックスが付加されていなくてもよいことを意味する。 In each of the first, second, and third embodiments, the nucleic acid molecule of interest to which at least the molecular barcode is added may be subjected to amplification before step (I). Here, the target nucleic acid molecule to which at least the molecular barcode is added means that the index may be further added as long as at least the molecular barcode is added, and the index may not be added. To do.

上記第１、第２、および第３の実施形態のそれぞれにおいて、分子バーコードは、周知の方法、例えば、分子バーコード配列を含むプライマーを用いて目的核酸分子を増幅するときに（例えば、ポリメラーゼ連鎖反応によって）目的核酸分子に付加することができる。
上記第１、第２、および第３の実施形態のそれぞれにおいて、インデックスが、分子バーコードが付加された目的核酸分子の増幅産物に対して付加されていてもよい。増幅産物に対してインデックスを付加する方法としては、周知の方法、例えば、インデックス配列を有するアダプターを用いたアダプターライゲーション法が挙げられる。
上記第１、第２、および第３の実施形態のそれぞれにおいて、インデックスは、分子バーコードと一緒に目的核酸分子に付加されてもよい。例えば、目的核酸分子にインデックスおよび分子バーコードを付加する方法としては、周知の方法、例えば、インデックスおよび分子バーコードの配列を含むプライマーを用いて目的核酸分子を増幅（例えば、ポリメラーゼ連鎖反応）する方法が挙げられる。In each of the first, second, and third embodiments described above, the molecular barcode is a well-known method, for example, when a nucleic acid molecule of interest is amplified using a primer containing a molecular barcode sequence (eg, polymerase). It can be added to the nucleic acid molecule of interest (by chain reaction).
In each of the first, second, and third embodiments, the index may be added to the amplification product of the target nucleic acid molecule to which the molecular barcode is added. As a method for adding an index to the amplification product, a well-known method, for example, an adapter ligation method using an adapter having an index sequence can be mentioned.
In each of the first, second, and third embodiments above, the index may be added to the nucleic acid molecule of interest along with the molecular barcode. For example, as a method for adding an index and a molecular barcode to a target nucleic acid molecule, a well-known method, for example, amplification of the target nucleic acid molecule using a primer containing the index and the molecular barcode sequence (eg, polymerase chain reaction) There is a method.

上記第１の実施形態の方法は、第２の実施形態と組み合わせて実施することができる。例えば、上記第１、および第２の実施形態のそれぞれにおいて、工程（ＩＩ）における特定の分子バーコードを有する配列と類似する配列が、当該特定の分子バーコードを有する配列と所定の塩基数以下のミスマッチ塩基を分子バーコード配列部分に含む配列であってもよい。ここで、所定の塩基数とは、１〜１０、１〜９、１〜８、１〜７、１〜６、１〜５、１〜４、１〜３、若しくは１〜２の範囲の整数、または０、１、２、若しくは３であり得る。所定の塩基数以下のミスマッチ塩基を分子バーコード配列部分に含む配列は、ミスマッチ塩基以外の塩基は、特定の分子バーコードの配列と正確に一致する。 The method of the first embodiment can be implemented in combination with the second embodiment. For example, in each of the above-mentioned first and second embodiments, a sequence similar to the sequence having the specific molecular barcode in step (II) is equal to or less than the sequence having the specific molecular barcode in a predetermined number of bases. It may be a sequence containing the mismatched bases in the molecular barcode sequence part. Here, the predetermined number of bases is an integer in the range of 1 to 10, 1 to 9, 1 to 8, 1 to 7, 1 to 6, 1 to 5, 1 to 4, 1 to 3, or 1 to 2. , Or 0, 1, 2, or 3. In the sequence containing a mismatch base of a predetermined number of bases or less in the molecular barcode sequence portion, the bases other than the mismatch bases exactly match the sequence of the specific molecular barcode.

上記第１の実施形態の方法は、第３の実施形態と組み合わせて実施することができる。また、上記第２の実施形態の方法は、第３の実施形態と組み合わせて実施することができる。
例えば、第１および第２の実施形態のそれぞれにおいて、分子バーコードが、特定の位置に固定塩基を有するものであってもよい。
第１および第２の実施形態のそれぞれにおいて、工程（ＩＩ）における特定の分子バーコードを有する配列と類似する配列が、当該特定の位置に当該固定塩基を含むこと、および／または、当該固定塩基の位置が当該特定の位置からシフトしていることに基づいて選択されてもよい。
第１および第２の実施形態のそれぞれにおいて、当該特定の位置に当該固定塩基を含まない分子バーコードを有する配列を解析から除外することをさらに含んでいてもよい。例えば、この実施形態において、分子バーコードをDistance=0でクラスタリングする場合も、Distance=1以上でクラスタリングする場合も、当該特定の位置に当該固定塩基を含まない分子バーコードを有する配列を解析から除外することをさらに含んでいてもよい。この場合、当該特定の位置に当該固定塩基を含まない分子バーコードを有する配列を解析から除外することは、クラスタリングの前でも後でも最中であってもよい。
第１および第２の実施形態のそれぞれにおいて、当該特定の位置に当該固定塩基を含まない分子バーコードを有する配列を工程（Ｉ）の配列情報から除外してもよく、工程（ＩＩ）で作成した群から除外してもよく、解析から除外してもよい。
あるいはまた、第１、第２、および第３の実施形態のそれぞれにおいて、工程（Ｉ）において、または工程（Ｉ）の後で、当該特定の位置に当該固定塩基を含む配列からなる配列情報を得てもよい。あるいはまた、第１の実施形態では、工程（ＩＩ）において、または工程（ＩＩ）の後で、当該特定の位置に当該固定塩基を含む配列からなる群を得てもよい。すなわち、第３の実施形態の核酸の解析方法では、工程（ＩＩａ）に代えて、工程（ＩＩｂ）：工程（Ｉ）において、若しくは、工程（Ｉ）の後で、分子バーコード部分において当該特定の位置に当該固定塩基を含む配列からなる配列情報を得てもよいし；または、工程（ＩＩ）：上記（Ｉ）で得られた配列情報から特定の分子バーコードを有する配列若しくはこれと類似する配列を選択し、選択された配列により群を作成する工程を含み、かつ工程（ＩＩｃ）：工程（ＩＩ）において、若しくは工程（ＩＩ）の後で、分子バーコード部分において当該特定の位置に当該固定塩基を含む配列からなる群を得てもよい。分子バーコード部分において当該特定の位置に当該固定塩基を含む配列からなる配列情報または群は、全ての特定の位置に固定塩基を含む配列からなるものであり得る。分子バーコード部分において当該特定の位置に当該固定塩基を含む配列からなる配列情報または群は、固定塩基の数がｎ個｛ここで、ｎは自然数である｝である場合には、ｎ個、またはｎ−ｍ個｛ここで、ｍは、１、２、３、または１からｎ−１の範囲の自然数であり得る｝の固定塩基を特定の位置に含む配列からなるものであり得る。The method of the first embodiment can be carried out in combination with the third embodiment. Moreover, the method of the second embodiment can be implemented in combination with the method of the third embodiment.
For example, in each of the first and second embodiments, the molecular barcode may have a fixed base at a specific position.
In each of the first and second embodiments, a sequence similar to the sequence having the specific molecular barcode in step (II) contains the fixed base at the specific position, and / or the fixed base. Position may be selected based on the position being shifted from the particular position.
Each of the first and second embodiments may further include excluding from the analysis sequences having a molecular barcode that does not include the fixed base at the particular position. For example, in this embodiment, when the molecular barcode is clustered at Distance = 0 and also when the clustering is performed at Distance = 1 or more, the sequence having the molecular barcode that does not include the fixed base at the specific position is analyzed. It may further include excluding. In this case, the exclusion of the sequence having the molecular barcode not containing the fixed base at the specific position from the analysis may be performed before or after the clustering.
In each of the first and second embodiments, a sequence having a molecular barcode that does not include the fixed base at the specific position may be excluded from the sequence information of step (I), and is created in step (II). May be excluded from the analysis group or may be excluded from the analysis.
Alternatively, in each of the first, second, and third embodiments, in step (I) or after step (I), sequence information including a sequence containing the fixed base at the specific position is provided. You may get it. Alternatively, in the first embodiment, in step (II) or after step (II), a group consisting of a sequence containing the fixed base at the specific position may be obtained. That is, in the method for analyzing a nucleic acid according to the third embodiment, instead of the step (IIa), the step (IIb): in the step (I) or after the step (I), the identification is performed in the molecular barcode portion. Sequence information consisting of a sequence containing the fixed base at the position may be obtained; or Step (II): A sequence having a specific molecular barcode from the sequence information obtained in the above (I) or a sequence similar thereto. And selecting a sequence to form a group with the selected sequence, and step (IIc): in step (II) or after step (II) at the specific position in the molecular barcode portion. You may obtain the group which consists of the sequence containing the said fixed base. The sequence information or the group consisting of the sequence containing the fixed base at the specific position in the molecular barcode portion may be the sequence information containing the fixed base at all the specific positions. When the number of fixed bases is n (where n is a natural number), the sequence information or the group consisting of the sequence containing the fixed base at the specific position in the molecular barcode part is n, Alternatively, it may consist of a sequence containing mn fixed bases (where m can be 1, 2, 3, or a natural number ranging from 1 to n-1) at a specific position.

上記第１、第２、および第３の実施形態のそれぞれにおいて、工程（ＩＩＩ）において、決定された正しいペア以外のインデックスと分子バーコードのペアを、インデックスと分子バーコードのミスペアと決定してもよく、または決定されたミスペアを解析から除外してもよい。 In each of the first, second, and third embodiments described above, in step (III), a pair of an index and a molecular barcode other than the determined correct pair is determined as an index and a molecular barcode mispair. Or determined mispairs may be excluded from the analysis.

上記第１、第２、および第３の実施形態のそれぞれにおいて、核酸の解析方法は、特定の分子バーコードを有する配列若しくはこれと類似する配列により作成された群の数に基づいて、目的核酸分子が由来するサンプルに含まれる目的核酸分子の数を決定する工程をさらに含んでもよい。 In each of the above-mentioned first, second, and third embodiments, the method for analyzing a nucleic acid is based on the number of groups prepared based on a sequence having a specific molecular barcode or a sequence similar thereto. It may further comprise the step of determining the number of nucleic acid molecules of interest contained in the sample from which the molecules are derived.

当業者であれば、本発明の第１の実施形態、本発明の第２の実施形態、および本発明の第３の実施形態は、それぞれ自由に組み合わせて実施することができることが理解できる。例えば、本発明の第１の実施形態は、本発明の第２の実施形態と組み合わせることができるし、本発明の第１の実施形態は、本発明の第３の実施形態と組み合わせることもできる。本発明の第１の実施形態は、本発明の第２および第３の実施形態と組み合わせてもよい。さらには、本発明の第２の実施形態は、本発明の第３の実施形態を組み合わせることができる。 Those skilled in the art can understand that the first embodiment of the present invention, the second embodiment of the present invention, and the third embodiment of the present invention can be freely combined and implemented. For example, the first embodiment of the present invention can be combined with the second embodiment of the present invention, and the first embodiment of the present invention can be combined with the third embodiment of the present invention. . The first embodiment of the invention may be combined with the second and third embodiments of the invention. Furthermore, the second embodiment of the invention can be combined with the third embodiment of the invention.

本発明の第４の実施形態
本発明の第４の実施形態は、バーコード配列を用いた目的核酸分子のデジタル定量法であって、本発明の第１の実施形態、第２の実施形態、および第３の実施形態、並びにこれらの組合せからなる群から選択される実施形態の実施を含む、方法に関する。 Fourth Embodiment of the Present Invention A fourth embodiment of the present invention is a digital quantification method of a target nucleic acid molecule using a barcode sequence, which is the first embodiment, the second embodiment of the present invention, And a third embodiment, as well as implementations of embodiments selected from the group consisting of combinations thereof.

本発明の第４の実施形態は、バーコード配列を用いた目的核酸分子のデジタル定量法であって、
（ｅ）得られた配列情報から、目的核酸分子の配列を含む核酸分子を選択することと、
（ｆ）上記（ｅ）で選択された核酸分子を固有の分子バーコードの配列毎にクラスタリングし、その後、インデックス核酸分子部分において複数の配列を有するクラスターを特定することと、
（ｇ）上記（ｆ）において特定されたクラスターそれぞれにおいて、検出頻度の最も高いインデックスと分子バーコードのペアを正しくインデックスされた目的核酸分子として特定し、それ以外のインデックスと分子バーコードのペアをミスペアであると決定することと、
を含み｛ここで、ミスペアにおいてインデックスが誤っていると決定することをさらに含んでいてもよい｝、
正しくインデックスされた目的核酸分子に連結した固有の分子バーコードの配列の種類の数（または、正しくインデックスされた目的核酸分子のクラスターの数）に基づいて、当該インデックスに対応するサンプルに含まれる目的核酸分子の数を決定する、
方法であり得る。ここで、ある態様では、工程（ｇ）において、正しくインデックスされた目的核酸分子に連結した固有の分子バーコードの配列の種類の数（または、正しくインデックスされた目的核酸分子のクラスターの数）を、当該インデックスに対応するサンプルに含まれる目的核酸分子の数と決定してもよく、リード数が増えるにつれて、原理的に定量の精度が高まると考えられる。A fourth embodiment of the present invention is a digital quantification method of a nucleic acid molecule of interest using a barcode sequence,
(E) selecting a nucleic acid molecule containing the sequence of the target nucleic acid molecule from the obtained sequence information;
(F) clustering the nucleic acid molecules selected in (e) above for each unique molecular barcode sequence, and then identifying clusters having a plurality of sequences in the index nucleic acid molecule portion;
(G) In each of the clusters identified in (f) above, the pair of the index and the molecular barcode with the highest detection frequency is identified as the correctly indexed target nucleic acid molecule, and the other pairs of the index and the molecular barcode are identified. Deciding to be a mispair,
{May further include determining that the index is incorrect in Mispair},
Based on the number of unique molecular barcode sequence types linked to the correctly indexed target nucleic acid molecule (or the number of correctly indexed clusters of the target nucleic acid molecule), the target contained in the sample corresponding to that index Determine the number of nucleic acid molecules,
Can be a method. Here, in one embodiment, in step (g), the number of types of sequences of the unique molecular barcode linked to the correctly indexed target nucleic acid molecule (or the number of correctly indexed clusters of the target nucleic acid molecule) is determined. Alternatively, it may be determined as the number of target nucleic acid molecules contained in the sample corresponding to the index, and it is considered that the accuracy of quantification increases in principle as the number of reads increases.

本発明の第４の実施形態は、
（ａ）核酸分子（例えば、ＤＮＡまたはＲＮＡ）を含む複数のサンプルを別々に取得する工程と｛サンプルの少なくとも１つには目的核酸分子が含まれる｝、
（ｂ）サンプルに含まれる核酸分子を増幅する前に、得られた複数のサンプルそれぞれにおいて、目的核酸分子それぞれに任意の分子バーコードを連結して、それぞれ異なる分子バーコードが連結した目的核酸分子を得る工程と、
（ｃ）複数のサンプルを混合する前に、複数の目的核酸分子を含むサンプル毎に固有のインデックスを目的核酸分子に付加し、由来するサンプル毎に異なるインデックスが連結した目的核酸分子のライブラリーを得る工程と（工程Ｂと工程Ｃの順序はどちらが先でもよい；また、工程（ｂ）または（ｃ）の後で核酸分子を増幅して目的核酸分子の増幅産物を得ることができる）、
（ｄ）上記（ｂ）と（ｃ）の後得られた核酸分子の増幅産物を含む混合物中で（サンプルを混合するのは工程（ｃ）の後であり、サンプルを混合した後に工程（ｂ）を行っても良く、工程（ｂ）を行った後に全サンプルを混合してもよい。また、分子バーコードが連結した核酸分子の増幅産物を得るのは工程（ｂ）の後であり、増幅産物を得る前にサンプルを混合してもよく、増幅産物を得た後に当該増幅産物を含むサンプルを混合してもよい）、サンプル毎に固有のインデックス及び各目的核酸分子に固有のまたは任意の分子バーコードが付加された核酸分子をシークエンシングして、１核酸分子毎にインデックス部分の配列と分子バーコード部分の配列とそれに連結した核酸分子部分の配列を同定する工程
をさらに含んでいてもよい。The fourth embodiment of the present invention is
(A) separately obtaining a plurality of samples containing nucleic acid molecules (for example, DNA or RNA); {at least one of the samples contains a target nucleic acid molecule},
(B) A target nucleic acid molecule in which, before amplification of the nucleic acid molecule contained in the sample, an arbitrary molecular barcode is linked to each target nucleic acid molecule in each of the obtained plurality of samples, and different molecular barcodes are linked to each To obtain
(C) Prior to mixing a plurality of samples, a unique index is added to the target nucleic acid molecule for each sample containing a plurality of target nucleic acid molecules, and a library of target nucleic acid molecules in which different indexes are linked to each derived sample A step of obtaining (the order of Step B and Step C may be either first; and after Step (b) or (c), the nucleic acid molecule can be amplified to obtain an amplification product of the target nucleic acid molecule),
(D) In the mixture containing the amplification product of the nucleic acid molecule obtained after (b) and (c) above (the sample is mixed after the step (c), and the step (b ) May be carried out, and all the samples may be mixed after carrying out step (b), and the amplification product of the nucleic acid molecule linked with the molecular barcode is obtained after step (b), The sample may be mixed before obtaining the amplification product, and the sample containing the amplification product may be mixed after obtaining the amplification product), the index unique to each sample and the unique or arbitrary to each nucleic acid molecule of interest. Further comprising the step of sequencing the nucleic acid molecule to which the molecular barcode is added to identify the sequence of the index portion, the sequence of the molecular barcode portion and the sequence of the nucleic acid molecule portion linked thereto for each nucleic acid molecule. Good.

本発明の第４の実施形態では例えば、第２の実施形態において説明したように、前記（ｆ）において、クラスタリングが、
（i）分子バーコード部分の配列において、固有の分子バーコードの配列と同一の配列を有する核酸分子群を同じクラスターに分類することにより行われる；
（ii）分子バーコード部分の配列において、固有の分子バーコードの配列と１ベースまでのミスマッチを有する配列を有する核酸分子群を同じクラスターに分類することにより行われる；
（iii）分子バーコード部分の配列において、固有の分子バーコードの配列と２ベースまでのミスマッチを有する配列を有する核酸分子群を同じクラスターに分類することにより行われる；または
（iv）分子バーコード部分の配列において、固有の分子バーコードの配列と３ベースまでのミスマッチを有する配列を有する核酸分子群を同じクラスターに分類することにより行われてもよい。In the fourth embodiment of the present invention, for example, as described in the second embodiment, in (f), the clustering is
(I) In the sequence of the molecular barcode portion, a nucleic acid molecule group having the same sequence as the unique molecular barcode sequence is classified into the same cluster;
(Ii) In the sequence of the molecular barcode part, the nucleic acid molecule group having a sequence having a mismatch of up to 1 base with the sequence of the unique molecular barcode is classified into the same cluster;
(Iii) In the sequence of the molecular barcode portion, a nucleic acid molecule group having a unique sequence of the molecular barcode and a sequence having a mismatch of up to 2 bases is classified into the same cluster; or (iv) the molecular barcode It may be carried out by classifying nucleic acid molecule groups having a sequence having a mismatch of up to 3 bases with a sequence of a unique molecular barcode in a partial sequence into the same cluster.

本発明の第４の実施形態では例えば、第２の実施形態において説明したように、前記（ｅ）において、クラスタリングが、
分子バーコード部分の配列において、塩基（例えば、１ベースまで、２ベースまで、または、３ベースまで）の挿入または欠失（indel）を有するとしてシークエンスされた配列を有する核酸分子群を同じクラスターに分類することにより行われてもよい。この際に、第３の実施形態で説明した、固定塩基を有する分子バーコードを用いてもよい。In the fourth embodiment of the present invention, for example, as described in the second embodiment, in (e), the clustering is
Nucleic acid molecules having sequences sequenced with insertions or indels of bases (eg, up to 1 base, up to 2 bases, or up to 3 bases) in the sequence of the molecular barcode portion are grouped into the same cluster. It may be performed by classifying. At this time, the molecular barcode having a fixed base described in the third embodiment may be used.

本発明の第４の実施形態では例えば、第２の実施形態において説明したように、前記（ｅ）において、クラスタリングが、
分子バーコード部分の配列において、塩基（例えば、１ベースまで、２ベースまで、または、３ベースまで）の挿入または欠失（indel）を有するとしてシークエンスされた配列を除外して得られた核酸分子群に対して行われてもよい。この際に、第３の実施形態で説明した、固定塩基を有する分子バーコードを用いてもよい。In the fourth embodiment of the present invention, for example, as described in the second embodiment, in (e), the clustering is
A nucleic acid molecule obtained by excluding a sequence sequenced as having an insertion or a deletion (indel) of bases (for example, up to 1 base, up to 2 bases, or up to 3 bases) in the sequence of a molecular barcode portion. It may be performed on groups. At this time, the molecular barcode having a fixed base described in the third embodiment may be used.

このようにすることで、デジタル定量法において、生じ得る核酸配列のエラーを補正し、デジタル定量の精度を改善し得る。すなわち、本発明によれば、サンプル中の元の目的核酸分子の数に比較して十分に多い数の分子バーコードを使用して、各々の目的核酸分子を互いに異なる配列を有する分子バーコードによって標識し、そして、元の目的核酸分子の数に比較して十分に多い数のリードを得て、各々の目的核酸分子に付加された分子バーコードを全て検出することによって、正確なデジタル定量が可能になる。 By doing so, it is possible to correct possible nucleic acid sequence errors in the digital quantification method and improve the accuracy of digital quantification. That is, according to the present invention, by using a sufficiently large number of molecular barcodes as compared to the number of original nucleic acid molecules of interest in a sample, each nucleic acid molecule of interest is labeled by a molecular barcode having a different sequence. Accurate digital quantification is achieved by labeling and obtaining a sufficiently large number of reads relative to the original number of target nucleic acid molecules to detect all molecular barcodes attached to each target nucleic acid molecule. It will be possible.

現代のビッグデータ時代の生物学において、システムワイドな測定における生物分子の正確な定量が必要とされている。なぜなら、分析の質は最初の生データに高度に依存するからである。このため、ＤＮＡタグ（「プライマーＩＤ（primer ID）」^１、「ＵＭＩ（unique molecular identifier）」、または「分子バーコード（molecular barcode）」と称する）を使用した核酸分子のデジタル定量がこれまでに開発されている。この技術は、ＲＮＡシークエンスによる遺伝子発現解析（ＲＮＡ−Ｓｅｑ）^２−７、ｉＣＬＩＰ（individual-nucleotide resolution UV cross-linking and immunoprecipitation）^８、抗体レパトワ解析^９、細菌１６ＳｒＲＮＡ遺伝子解析^{１０，１１}、およびＣｈＩＰ−ｎｅｘｕｓ（chromatin immunoprecipitation experiments with nucleotide resolution through exonuclease, unique barcode and single ligation）^１２のような次世代シークエンスプラットフォームにおける多くの応用のために使用されている。これらの方法により、測定系におけるノイズおよび／またはバイアスの存在下であっても、所定のサンプル中の分子の絶対数をデジタル的に正確に決定することが可能になる。分子バーコードを使用するＲＮＡ−Ｓｅｑ、すなわち、デジタルＲＮＡ−Ｓｅｑ（ｄＲＮＡ−Ｓｅｑ）^３または定量的ＲＮＡ−Ｓｅｑ^１３は、デジタルカウントの最も広く使用される応用の１つである。ｄＲＮＡ−Ｓｅｑは、小さなサンプルサイズについてさえも良好に機能するので、単一細胞遺伝子発現解析にしばしば使用されている。このような測定において、検出限界は重要である。なぜなら、単一細胞は多くの低コピーＲＮＡを有することが示されており^{１３，１４}、そして検出限界は、多くの潜在的に未検出の低コピーＲＮＡが存在することを示し、これが生物学的現象のその後の解釈に影響を及ぼし得るからである。それゆえ、使用されるバーコードシステムが核酸定量の検出限界を決定するので、絶対的かつデジタルの定量のためのバーコードの有効性の調査は重大である。さらに、高コピー数種をカウントするバーコードの能力の同時の有効性もまた重要である。なぜなら、例えば、ランダム塩基バーコードが、数千個のウイルスＲＮＡを標識するために^１、そして高スループット単一細胞ＲＮＡ−Ｓｅｑの研究（ここで、バーコードは一回のシークエンスランにおいて個々の細胞を区別するために使用される）において数千の細胞を同定するために使用され得るからである^７。
核酸分子のデジタル定量の一般的な手順は以下のとおりである（図１のパネルＡ参照）。（ｉ）各々のＲＮＡ（または相補的ＤＮＡ若しくはｃＤＮＡ）またはＤＮＡを、多様な配列を含む外部から加えたＤＮＡ（分子バーコード）によって固有にタグ化する^１−３。（ｉｉ）バーコード付加されたＤＮＡまたはｃＤＮＡ（ＲＮＡから出発する場合ＲＮＡから生成される）を増幅する。（ｉｉｉ）バーコード付加され増幅された（ｃ）ＤＮＡの目的核酸配列およびバーコード配列の両方をタンデムにシークエンスする。（ｉｖ）理論的に提唱されているように^１５、増幅前の元の目的核酸（すなわち、増幅前ＲＮＡまたは（ｃ）ＤＮＡ）の絶対的コピー数を与えるために、各々の目的核酸（または遺伝子）について、増幅された分子の数（いわゆる「リード数」）ではなく固有のバーコードの数が定量される。このスキームによって、システムの測定の間の種々の工程において（例えば、増幅、シークエンス、および／または分析から）生成されるノイズおよび／またはバイアスの影響を除外することができる。デジタルカウントシステムが適切に機能することを確実にするために、各々の目的核酸分子が固有にタグ化されることが保証され（またはほぼ保証され）、固有の分子バーコードの測定される数が所定の目的核酸分子の数と等しくなるように多様なバーコード配列を使用しなければならない^{１６，１７}（下記の第１の要件）。また、正確なカウントのために十分なシークエンス深度が必要であると経験的に考えられている^{１８，１９}（下記の第２の要件）。
デジタルカウントスキームにおいて、代表的には以下の２つのタイプのバーコード設計が使用されている：配列限定バーコード（各々のバーコード配列は個別に設計される）および非配列限定バーコード（「ランダム塩基」バーコードと称することがある）。配列限定バーコードが以前に使用されたときに、正確な定量のために必要とされるバーコード配列の多様性が理論計算によって概算され^１６、そしてバーコード付加された分子の絶対的定量のためのこの技術のキャパシティが実験的に確認された^３，１６。しかし、配列限定バーコードの使用には以下のような不利益が存在する：高いダイナミックレンジの測定のためには多くの異なる個別に設計されたバーコード配列を調製しなければならず、これは費用対効果が良くない。カウントのダイナミックレンジを増加させながらコストを最小化するために、ランダム（または擬似ランダム）塩基バーコードが代わりに使用されている^{２，４−９，１１，１２，１８，２０}。この場合でも同様に、バーコードセットの配列多様性が十分であると決定すべきである^{１７，１８}。しかし、単に、配列限定バーコードとは異なり、シークエンスおよび／または増幅エラーに起因するバーコードにおける配列変化（これらのエラーの１つから新たに生成されるバーコード配列が偽陽性になり得る）^２１という理由で、この調査はささいなことではない。すなわち、エラーはサンプル中の分子数の過大評価を引き起こし得る（配列限定バーコードの場合、全ての使用されるバーコード配列は既知であり、このことは全ての未使用のバーコード配列もまた既知であって、エラーから生じる配列を同定しそして除外することができることを意味することに留意のこと）。この問題は、類似のバーコード配列は同じ元のバーコード配列を起源とするエラーを通じて生じるという合理的な仮定に基づいてコンピューター解析を使用してエラーを除外することによってアプローチされる。さらに、Sudberyらは最近、制限されたダイナミックレンジ（１００分子まで）についてのエラーのモデリングによるコンピューター解析に基づいてランダム塩基ＵＭＩ（分子バーコード）の有効性を示した^２２。しかし、正確なデジタルカウントのためのランダム塩基分子バーコードの有効性は、特に定量的な意味^７，２０および高いダイナミックレンジで、理論的モデルには存在しない影響を明白に含み得る実験に基づいては、明確に示されたわけではない。
ここで本発明者らは、特定のバーコード設計を使用するときに、および、コンピューター解析の後に、ランダム塩基分子バーコードを、バーコード付加されたＤＮＡ分子の絶対数のデジタル定量のために利用することができることを実験的に示す。様々な応用において変動し得るバーコード付加および／または逆転写のような他の影響を除外することによってバーコード自体の有効性を調査するために、本発明者らはバーコード配列を含むＤＮＡ分子を合成し、そして増幅分子についてのシークエンスによってそれを定量した（図１のパネルＡの点線の枠参照）。正確なデジタルカウントのために、本発明者らは上記２つの要件を定量的に調査した；（ｉ）所定の分子の数と比較して十分に多いバーコード配列のセットを使用すること（上記の通り）（図１のパネルＢ）、および（ｉｉ）所定の分子の数に比較して十分なシークエンス深度が達成されること（図１のパネルＣ）。次いで、本発明者らは、分子のインプット数および測定される分子のアウトプット数の両方が、２つの要件を満たすモデル測定システムを通じて一貫していることを実験的に示す。これら２つの要件を満たすために、すなわち、デジタルカウントシステムが機能することを確実にするために、本発明者らは、エラー検出のためにランダムバーコード配列内に固定塩基を導入し、インハウス開発したソフトウエアを使用したバーコード配列クラスタリングを実施し、そして分子バーコードからの情報を利用して、異なってインデックス付加されたサンプル間のクロスコンタミネーションおよびマッピングプロセスにおける目的核酸配列（鋳型）の誤同定を同定および除外した。本結果は、任意の所定のサンプル中のバーコード付加された核酸分子の正確な定量が、適切なバーコード設計（最小の必要とされるバーコード長を含む）および十分なシークエンス深度を通じて、高いダイナミックレンジで（１から１０^４超、潜在的には１０^１５分子まで）達成され得ることを示す。
以下、本実施例では、「ランダム」という用語を用いるが、この用語は、本実施例では、配列を設計することなく配列に莫大な多様性を確保するために実験者が無作為に合成したことを意味する。
［方法］
ライブラリーの調製
ランダム塩基を含む一本鎖ＤＮＡ鋳型をIntegrated DNA Technologies, Inc., Coralville, IA, USAから購入した（図１３参照）。各鋳型の濃度は、提供された仕様シート（Integrated DNA Technologies, Inc.）に記載された吸収係数を用い分光光度計（NanoDrop 1000; Thermo Fisher Scientific Inc., MA, USA）を使用して２６０ｎｍでの吸収により測定した。鋳型ＤＮＡは、0.1%（v/v）TWEEN20（Sigma-Aldrich, St. Louis, MO, USA）溶液中で50μMで-30℃で保存した。増幅用のDNA鋳型の濃度を調節するために、全ての鋳型は、水（蒸留水、脱イオン、滅菌、NIPPON GENE CO., LTD., Toyama, Japan）と0.1%TWEEN20で希釈し、下記の最終コピー数になるようPCRチューブ中で混合した。増幅は、25μLサンプル中で0.3μMの各プライマー（図１４参照）を用い、MightyAmp (TAKARA BIO INC., Shiga, Japan)を用いてPCRにより実施した。2本のチューブを50μMの鋳型ストックから独立して調製し、プライマーの一つの中に設計されたインデックスによって区別した（図１４参照）。熱サイクル（ProFlex PCR system; Themo Fisher Scientific Inc.）は、以下のように実施した：９８℃で２分の１サイクル；９８℃で１０秒、６０℃で１０秒、および６８℃で１分の４サイクル；９８℃で１０秒、６０℃で２秒、および６８℃で１分の１９サイクル；６８℃で５分の１サイクル；その後４℃でインキュベート。次いで、増幅産物を２回カラム精製した（DNA Clean & Concentrator^TM-5; Zymo Research Corp, CA, USA）し、増幅産物の長さ分布を2100 Bioanalyzer (Agilent Technologies, Inc., CA, USA)を用いて確認した。濃度をreal-time PCR system (7500; Themo Fisher Scientific Inc.)を使用してqPCR kit (KK4602; KAPA Biosystems, Inc., MA, USA)によって決定した。Biology in the modern big data era requires accurate quantification of biomolecules in system-wide measurements. Because the quality of the analysis is highly dependent on the initial raw data. For this reason, digital quantification of nucleic acid molecules using DNA tags (referred to as “primer ID” ¹ , “UMI (unique molecular identifier)” or “molecular barcode”) has hitherto been performed. Being developed. This technique includes gene expression analysis by RNA sequence (RNA-Seq) ^2-7 , iCLIP (individual-nucleotide resolution UV cross-linking and immunoprecipitation) ⁸ , antibody repertoire analysis ⁹ , bacterial 16S rRNA gene analysis ¹⁰ , ¹¹ , and ChIP. -Nexus (chromatin immunoprecipitation experiments with nucleotide resolution through exonuclease, unique barcode and single ligation) ¹² has been used for many applications in next-generation sequencing platforms. These methods allow digitally accurate determination of the absolute number of molecules in a given sample, even in the presence of noise and / or bias in the measurement system. RNA-Seq using a molecular barcode, namely digital RNA-Seq (dRNA-Seq) ³ or quantitative RNA-Seq ¹³ is one of the most widely used applications of digital counting. dRNA-Seq works well even for small sample sizes and is often used for single cell gene expression analysis. Detection limits are important in such measurements. Because single cells have been shown to have many low copy RNAs ^13,14 , and the limit of detection indicates that there are many potentially undetected low copy RNAs, which are biological This can affect the subsequent interpretation of the phenomenon. Therefore, the investigation of the effectiveness of barcodes for absolute and digital quantitation is critical because the barcode system used determines the detection limit of nucleic acid quantification. Moreover, the simultaneous effectiveness of the bar code's ability to count high copy numbers is also important. Because, for example, random base barcodes have been used to label thousands of viral RNAs ¹ , and high-throughput single-cell RNA-Seq studies (where barcodes are used to label individual cells in a single sequencing run). Used to identify thousands of cells in ⁷⁾ .
The general procedure for digital quantification of nucleic acid molecules is as follows (see FIG. 1, panel A). (I) Uniquely tag each RNA (or complementary DNA or cDNA) or DNA with exogenously added DNA (molecular barcode) containing diverse sequences ^1-3 . (Ii) Amplify barcoded DNA or cDNA (generated from RNA if starting from RNA). (Iii) Sequence both the nucleic acid sequence of interest and the barcode sequence of the barcoded and amplified (c) DNA in tandem. (Iv) As has been theoretically proposed ¹⁵ , each nucleic acid of interest (or gene) to give the absolute copy number of the original nucleic acid of interest before amplification (ie, pre-amplification RNA or (c) DNA) ), The number of unique barcodes is quantified rather than the number of amplified molecules (the so-called “read number”). This scheme allows to exclude the effects of noise and / or bias generated at various steps during the measurement of the system (eg from amplification, sequencing, and / or analysis). To ensure that the digital counting system functions properly, each nucleic acid molecule of interest is guaranteed (or nearly guaranteed) to be uniquely tagged, and the measured number of unique molecular barcodes is Diverse barcode sequences must be used to equal the number of nucleic acid molecules of interest ^16,17 (first requirement below). In addition, it is empirically considered that a sufficient sequence depth is necessary for accurate counting ^18,19 (second requirement below).
Two types of barcode designs are typically used in digital counting schemes: sequence-limited barcodes (each barcode sequence is designed individually) and non-sequence-limited barcodes ("random"). Sometimes referred to as a "base" bar code). When sequence-restricted barcodes were previously used, the diversity of barcode sequences required for accurate quantification was estimated by theoretical calculations ¹⁶ , and for absolute quantification of barcoded molecules. The capacity of this technology has been experimentally confirmed ^3,16 . However, the use of sequence-limited barcodes has the following disadvantages: For measuring high dynamic range, many different individually designed barcode sequences must be prepared, which Not cost effective. Random (or pseudorandom) base barcodes have been used instead to increase cost while increasing the dynamic range of counts ^{2,4-9,11,12,18,20} . In this case as well, it should be determined that the sequence diversity of the barcode set is sufficient ^17,18 . However, unlike sequence-limited barcodes, only sequence changes in the barcode due to sequencing and / or amplification errors (newly generated barcode sequences from one of these errors can be false positives) ²¹ That is why this survey is not trivial. That is, the error can cause an overestimation of the number of molecules in the sample (in the case of sequence limited barcodes, all used barcode sequences are known, which also means that all unused barcode sequences are known. , Which means that the sequences resulting from the error can be identified and excluded). This problem is approached by using computational analysis to rule out errors based on the reasonable assumption that similar barcode sequences occur through errors originating from the same original barcode sequence. Furthermore, Sudbery et al. Recently demonstrated the efficacy of the random base UMI (molecular barcode) based on computer analysis by modeling errors for limited dynamic range (up to 100 molecules) ²² . However, the effectiveness of random base molecule barcodes for accurate digital counting is based on experiments that can clearly include effects that are not present in the theoretical model, especially with quantitative meaning ^7,20 and high dynamic range. Is not explicitly shown.
Here, we use random base molecule barcodes for digital quantification of the absolute number of barcoded DNA molecules when using a particular barcode design and after computer analysis. We will show experimentally what can be done. In order to investigate the effectiveness of the barcode itself by excluding other effects such as barcode addition and / or reverse transcription, which may vary in various applications, we have found that DNA molecules containing barcode sequences. Was synthesized and quantified by sequencing for amplified molecules (see panel A in Figure 1, dashed box). For accurate digital counting, we quantitatively investigated the above two requirements; (i) using a sufficiently large set of barcode sequences compared to the number of given molecules (see above). (As in panel B of FIG. 1), and (ii) sufficient sequence depth is achieved compared to the number of given molecules (panel C of FIG. 1). We then experimentally show that both the number of molecular inputs and the number of measured molecular outputs are consistent through a model measurement system that meets two requirements. In order to meet these two requirements, ie to ensure that the digital counting system works, we introduced a fixed base within the random barcode sequence for error detection and Perform barcode sequence clustering using the developed software, and use the information from the molecular barcodes to cross-contaminate between differently indexed samples and to identify the target nucleic acid sequence (template) in the mapping process. Misidentifications were identified and excluded. The present results show that accurate quantification of barcoded nucleic acid molecules in any given sample is high through proper barcode design (including minimal required barcode length) and sufficient sequence depth. We show that it can be achieved in the dynamic range (from 1 to more than 10 ⁴ and potentially up to 10 ¹⁵ molecules).
Hereinafter, in this example, the term “random” is used, but this term was randomly synthesized by an experimenter in order to secure a great variety of sequences without designing the sequences in this example. Means that.
[Method]
Library Preparation Single-stranded DNA templates containing random bases were purchased from Integrated DNA Technologies, Inc., Coralville, IA, USA (see Figure 13). The concentration of each template was 260 nm using a spectrophotometer (NanoDrop 1000; Thermo Fisher Scientific Inc., MA, USA) using the absorption coefficient described in the supplied specification sheet (Integrated DNA Technologies, Inc.). It was measured by absorption. Template DNA was stored at −30 ° C. at 50 μM in 0.1% (v / v) TWEEN20 (Sigma-Aldrich, St. Louis, MO, USA) solution. To control the concentration of DNA template for amplification, all templates were diluted with water (distilled water, deionized, sterilized, NIPPON GENE CO., LTD., Toyama, Japan) and 0.1% TWEEN20, and Mix in PCR tubes to final copy number. Amplification was carried out by PCR using MightyAmp (TAKARA BIO INC., Shiga, Japan) using 0.3 μM of each primer (see FIG. 14) in a 25 μL sample. Two tubes were prepared independently from 50 μM template stock and distinguished by the index designed into one of the primers (see Figure 14). Thermal cycling (ProFlex PCR system; Themo Fisher Scientific Inc.) was performed as follows: 98 ° C. half cycle; 98 ° C. 10 sec, 60 ° C. 10 sec, and 68 ° C. 1 min. 4 cycles; 98 ° C for 10 seconds, 60 ° C for 2 seconds, and 68 ° C for 1 minute 19 cycles; 68 ° C for 1/5 cycle; then incubated at 4 ° C. Then, the amplification product was subjected to column purification twice (DNA Clean & Concentrator ^™ -5; Zymo Research Corp, CA, USA), and the length distribution of the amplification product was analyzed by 2100 Bioanalyzer (Agilent Technologies, Inc., CA, USA). Confirmed using. Concentration was determined by qPCR kit (KK4602; KAPA Biosystems, Inc., MA, USA) using a real-time PCR system (7500; Themo Fisher Scientific Inc.).

シークエンシング
インデックス付加された２つのサンプル（CGCTCATT: インデックスA（index A）, GAGATTCC: インデックスB(index B)）を150 cycle kit v3 (Read 1: 100サイクル, Read 2: 50サイクル, Index 1: 8サイクル)を使用しMiSeq sequencer (Illumina, Inc.)を用いてシングルランでシークエンスした。Read 2中の配列はRead 1中の配列の一部なので、Read 2は分析には用いなかった。分析に使用した生のシークエンスデータをGEO database GSE94895に寄託した。Two samples with sequencing index (CGCTCATT: index A (index A), GAGATTCC: index B (index B)) 150 cycle kit v3 (Read 1: 100 cycles, Read 2: 50 cycles, Index 1: 8 Cycle) and MiSeq sequencer (Illumina, Inc.) in a single run. Read 2 was not used in the analysis as the sequence in Read 2 is part of the sequence in Read 1. The raw sequence data used for the analysis was deposited in GEO database GSE94895.

分析
Read 1の配列は、インデックスＡおよびＢによってソートし、各インデックスに対するfastqファイルをMiSeqを用いて生成した。いくつかの場合では、リードの１００％、３２％、１０％、３．２％、１％、０．３２％、および０．１％をランダムにサンプリングした。MiSeqのfastqファイルは、配列長によってフィルターをかけた（短い鋳型に対しては≧３４ｂｐ長かつ≦３９ｂｐ長、および長い鋳型に対しては≧９０ｂｐ長）。目的核酸配列に対するリードのアラインメントは、リファレンスとして１１種の鋳型の目的核酸配列を用い（図１３の「target」参照）、Bowtie2 v.2.2.9^２７を用いて長い鋳型（ＬＴ）および短い鋳型（ＳＴ）に対して個別に実施した。基本的に、固有にマップされたリードを次の分析に用いた。バーコード領域は、長い鋳型では５’末端から５０ｂｐであり、短い鋳型では５’末端から３０ｂｐであり（図１３の「barcode」参照）、これらをマップされたリードから抽出した。バーコード領域中の固定塩基（短い鋳型については最大で６塩基であり、長い鋳型については最大で１２塩基；図１３「barcode」参照）をフィルタリングのために用い、少なくとも１つの固定塩基のミスマッチを有するバーコードを除外した。その後、距離（Distance）=０、１、２、または３でバーコードをクラスタリングするためにインハウスソフトウエアNucleotide Sequence Clusterizerを用いた。クラスターの数は、増幅前の分子の数であると考えられた。インデックスのクロスコンタミネーションを考慮した場合には、クラスタリング前にインデックスＡおよびＢを伴うリードを統合した。後者において、多重にマップされたリードもその後の分析に用いた。そして、クラスタリングの後で、複数のインデックスを含むクラスターが存在した場合、少数派のリードを除外した。インデックスＡのリードとインデックスＢのリードの数が同じであった場合には、インデックスＡとインデックスＢの両方に対して０．５の係数を与えた。同様に、ミスアラインメントも考慮した場合には、インデックスＡおよびインデックスＢを有する鋳型にマップされた全てのリードをクラスタリング前に統合した。一つのリードが複数の鋳型に対してマップされたときには、各鋳型に対して１／（異なる鋳型の数）の係数を与えた。クラスタリングの後で、複数の目的核酸にマップされたリードおよび／またはインデックスを含むクラスターが存在した場合には、少数派のリードを除外した。異なる鋳型にマップされたリードおよび／またはインデックスの数が同じ場合には、複数にマップされた目的核酸および／またはインデックスのそれぞれに対して、１／（異なる鋳型および／またはインデックスの数）の係数を与えた。各プロセスにおけるリードの数は、図１５に示す通りである。 analysis
The Read 1 sequence was sorted by index A and B and a fastq file for each index was generated using MiSeq. In some cases, 100%, 32%, 10%, 3.2%, 1%, 0.32%, and 0.1% of reads were randomly sampled. The MiSeq fastq files were filtered by sequence length (≧ 34 bp and ≦ 39 bp long for short templates and ≧ 90 bp long for long templates). Alignment of the reads to the target nucleic acid sequence uses the target nucleic acid sequences of 11 types of templates as references (see “target” in FIG. 13), and uses Bowtie2 v.2.2.9 ²⁷ for long template (LT) and short template ( ST) was carried out individually. Essentially, uniquely mapped reads were used for subsequent analysis. The barcode region was 50 bp from the 5'end in the long template and 30 bp from the 5'end in the short template (see "barcode" in Figure 13) and these were extracted from the mapped reads. The fixed bases in the barcode region (up to 6 bases for short templates, up to 12 bases for long templates; see FIG. 13 “barcode”) were used for filtering and at least one fixed base mismatch was used. Bar codes with were excluded. The in-house software Nucleotide Sequence Clusterizer was then used to cluster the barcodes at Distance = 0, 1, 2, or 3. The number of clusters was considered to be the number of molecules before amplification. Reads with indexes A and B were integrated before clustering if index cross-contamination was considered. In the latter, multiple-mapped reads were also used for subsequent analysis. Then, after clustering, the minority leads were excluded if there were clusters containing multiple indexes. If the number of reads for index A and the number of reads for index B were the same, a coefficient of 0.5 was given to both index A and index B. Similarly, when misalignment was also considered, all reads mapped to the template with index A and index B were combined before clustering. When one read was mapped to multiple templates, a factor of 1 / (number of different templates) was given for each template. After clustering, minority reads were excluded if there were clusters containing reads and / or indexes mapped to multiple nucleic acids of interest. If the number of reads and / or indexes mapped to different templates is the same, a factor of 1 / (number of different templates and / or indexes) for each of the multiple mapped nucleic acids of interest and / or indexes. Was given. The number of leads in each process is as shown in FIG.

ヌクレオチド配列クラスタライザー（Nucleotide Sequence Clusterizer）
クラスタリングのために、「Nucleotide Sequence Clusterizer」と名付けたインハウスソフトウェアをC言語でコードした。このツールは、各配列の特定されたヌクレオチド位置を用いてDNA配列のクラスタリングを実施する。このツールは、有界単リンククラスタリングを実行する：最初に各配列はそれ自身のクラスターに存在する。任意の２つの配列がＤ個以下のミスマッチで互いに異なる場合、それらのクラスターを一緒に統合した。ここでＤは、設定可能な「距離（Distance）」パラメータである。このプロセスは、これ以上統合するクラスターが存在しなくなるまで継続し、この時点でNucleotide Sequence Clusterizerは、クラスター数と各クラスター内の配列を報告する。Nucleotide Sequence Clusterizerは、要求に応じて入手可能である。 Nucleotide Sequence Clusterizer
For clustering, we coded in-house software named "Nucleotide Sequence Clusterizer" in C language. This tool performs clustering of DNA sequences using identified nucleotide positions in each sequence. This tool performs bounded single-link clustering: first each array is in its own cluster. If any two sequences differed from each other by no more than D mismatches, those clusters were integrated together. Here, D is a settable “Distance” parameter. This process continues until there are no more clusters to integrate, at which point the Nucleotide Sequence Clusterizer reports the number of clusters and the sequence within each cluster. Nucleotide Sequence Clusterizer is available on request.

本実施例では、ランダム塩基バーコードを用いた核酸のデジタルカウントシステムによってサンプル中に含まれるDNA分子の絶対数を正確に測定することができるかどうかを調べた。図１３に示すように、６種の長い鋳型（ＬＴ１〜６）と５種の短い鋳型（ＳＴ１〜５）の大きく２種類の鋳型ＤＮＡを設計した。 In this example, it was investigated whether or not the absolute number of DNA molecules contained in a sample could be accurately measured by a nucleic acid digital counting system using a random base barcode. As shown in FIG. 13, two major template DNAs were designed: six long templates (LT1-6) and five short templates (ST1-5).

図１３に示すように、ＬＴ１〜６の核酸分子を、５’末端から３’末端側に向けて、
配列番号１の配列 - バーコード配列 - 目的核酸配列 - 配列番号２の配列
となるように設計した。ＬＴ１〜６のバーコード配列および目的核酸配列を配列番号５〜１６に示す。
また、図１３に示すように、ＳＴ１〜５の核酸分子を、５’末端から３’末端側に向けて、
配列番号３の配列 - バーコード配列 - 目的核酸配列 - 配列番号４の配列
となるように設計した。ＳＴ１〜５のバーコード配列および目的核酸配列を配列番号１７〜２６に示す。As shown in FIG. 13, the nucleic acid molecules of LT1 to 6 are oriented from the 5 ′ end to the 3 ′ end,
It was designed to be the sequence of SEQ ID NO: 1-the barcode sequence-the target nucleic acid sequence-the sequence of SEQ ID NO: 2. The barcode sequences of LT1 to 6 and the nucleic acid sequences of interest are shown in SEQ ID NOs: 5 to 16.
Further, as shown in FIG. 13, the nucleic acid molecules of ST1 to 5 are directed from the 5 ′ end to the 3 ′ end side,
It was designed to be the sequence of SEQ ID NO: 3-the barcode sequence-the target nucleic acid sequence-the sequence of SEQ ID NO: 4. The barcode sequences of ST1 to 5 and the nucleic acid sequences of interest are shown in SEQ ID NOs: 17 to 26.

これらの鋳型ＤＮＡはすべて、図１のパネルＡにおいて分子バーコード群として示されるランダム塩基バーコードを含み、長い鋳型は３８個のランダム塩基と１２個の固定塩基からなる５０塩基のバーコードの下流に５０塩基の目的核酸配列を有し、短い鋳型は、２４個のランダム塩基と６個の固定塩基からなる３０塩基のバーコードの下流に８塩基の目的核酸配列を有するものとした（図１３参照）。また、全ての鋳型は、ＰＣＲ増幅のために用いる５’末端および３’末端の両方の共通配列を含んだ（図１３および図１４参照）。本実施例では、モデル測定サンプルとして、それぞれ40000、40000、4000、300、100、および20コピーのＬＴ１、ＬＴ２、ＬＴ３、ＬＴ４、ＬＴ５、およびＬＴ６と、20000コピーのＳＴ１およびＳＴ２、並びに4000コピーのＳＴ３、ＳＴ４、およびＳＴ５を各々が含む、2つの同一サンプルを調製した。2つの異なるインデックス（インデックスＡおよびインデックスＢ）によって区別されたこれら2つのサンプル中のこれらの鋳型を増幅し、MiSeqを用いて増幅産物をシークエンスし、インデックスＡに対しては11,992,843リード、インデックスＢに対しては15,373,718リードを得た（図１５参照）。 All of these template DNAs contain a random base barcode shown as a molecular barcode group in panel A of FIG. 1, with the long template downstream of a 50 base barcode consisting of 38 random bases and 12 fixed bases. And a short template had an 8 nucleotide target nucleic acid sequence downstream of a 30 nucleotide barcode consisting of 24 random bases and 6 fixed bases (Fig. 13). reference). All templates also contained consensus sequences at both the 5'and 3'ends used for PCR amplification (see Figures 13 and 14). In the present embodiment, as model measurement samples, 40,000, 40,000, 4000, 300, 100, and 20 copies of LT1, LT2, LT3, LT4, LT5, and LT6, and 20000 copies of ST1 and ST2, and 4000 copies were prepared. Two identical samples were prepared, each containing ST3, ST4, and ST5. Amplify these templates in these two samples, distinguished by two different indexes (Index A and Index B) and sequence the amplified products using MiSeq to read 11,992,843 reads for Index A, Index B for On the other hand, 15,373,718 reads were obtained (see FIG. 15).

本実施例では、インデックスＡおよびＢの配列を、増幅用リバースプライマー中に含めることにより、鋳型に対して付加した（図１４参照）。
インデックスＡの増幅用リバースプライマーの配列（図１４におけるＲｖｐｒｉｍｅｒ）：
CAAGCAGAAGACGGCATACGAGATAATGAGCGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT (配列番号２８)
インデックスＢの増幅用リバースプライマーの配列（図１４におけるＲｖｐｒｉｍｅｒ２）：
CAAGCAGAAGACGGCATACGAGATGGAATCTCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT (配列番号２９)
なお、上記配列番号２８の核酸配列において、下線部がインデックスＡの核酸配列に対応し、上記配列番号２９の核酸配列において、下線部がインデックスＢの核酸配列に対応する。In this example, the sequences of indexes A and B were added to the template by including them in the reverse primer for amplification (see FIG. 14).
Sequence of reverse primer for amplification of index A (Rv primer in FIG. 14):
CAAGCAGAAGACGGCATACGAGAT AATGAGCG GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT (SEQ ID NO: 28)
Sequence of reverse primer for amplification of index B (Rv primer2 in FIG. 14):
CAAGCAGAAGACGGCATACGAGAT GGAATCTC GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT (SEQ ID NO: 29)
In the nucleic acid sequence of SEQ ID NO: 28, the underlined portion corresponds to the nucleic acid sequence of index A, and in the nucleic acid sequence of SEQ ID NO: 29, the underlined portion corresponds to the nucleic acid sequence of index B.

そして、インデックス（ＡおよびＢ）毎に全てのリードをMiSeqでソートし、ソートされたリードを目的核酸配列からなるリファレンスに対してマップし、シークエンスされたリード数（すなわち、増幅された分子数）をカウントする代わりに、固有のバーコードの数（またはバーコードクラスターの数）をカウントすることによってデジタル式に各々のインデックスおよび鋳型に対する分子の数を定量した。 Then, all reads are sorted by MiSeq for each index (A and B), the sorted reads are mapped to a reference consisting of the target nucleic acid sequence, and the number of reads read (that is, the number of amplified molecules). Instead of counting, the number of molecules for each index and template was quantified digitally by counting the number of unique barcodes (or the number of barcode clusters).

次に、エラー存在下における正確なデジタル定量のための２つの要件（すなわち、サンプル中の所定の数の分子をカウントするためにバーコード中にいくつのランダム塩基が必要か、および、分子あたりのリード数（「カバー率」と定義される）がいくつ必要か）を調べた（図２および図８）。第１の要件に関して、各鋳型においてランダム塩基の数を計算機上で変更して（ＬＴに対しては４〜３８塩基、およびＳＴに対しては４〜２４塩基）、各々のソートされたインデックスおよび鋳型に対する固有のバーコード数を決定した（図２パネルＡおよび図８Ａ；グレーの線）。決定された固有のバーコードの数は、バーコード中のランダム塩基の数を増加させるにつれて劇的に増加した。このことは、所定の数の分子を定量するためには、ある最小の数のランダム塩基が必要であることを示唆するものである。（バーコードの長さを増加させることによって）可能なバーコード配列数を人為的に増加させたとしても、測定される元の目的核酸配列の数は、元のコピー数の20000を超えては増加しないはずであることから、プラトーが20000にあると予想した。しかしながら、より大きなランダム塩基数の領域において予想されたプラトーは観察されず、決定された固有のバーコードの数はランダム塩基の数が増加するにつれて単調に増加した。第２の要件に関して、リードの一部をランダムに除外することによってシークエンシングのカバー率を計算機上で変更し、そして各々のインデックスおよび鋳型に対して残りのリードを用いて固有のバーコードの数を決定した（図２のパネルＣおよび図８のパネルＣ；グレーの線）。もし、デジタルカウントのスキームが機能していれば、カバー率が十分なレベルに達すると、同定される固有のバーコードの数はカバー率（シークエンス深度）に依存しないはずであるため、プラトーがこれらのプロットにおいて観察されることになる。シークエンス深度（すなわち、各バーコードが読み取られる回数）を増加させたとしても、元の目的核酸配列の測定される数が元のコピー数の20000を超えて増加しないはずであるので、プラトーが20000にあると予想した。しかし、予想されたこのプラトーは観察されず、決定された固有のバーコードの数はカバー率が増加するにつれて単調に増加した。このことは、この条件でデジタルカウントシステムに改良が必要であることを示唆するものである。 Next, there are two requirements for accurate digital quantification in the presence of error (ie how many random bases are needed in the barcode to count a given number of molecules in a sample, and The number of leads (defined as "coverage") was examined (Figs. 2 and 8). Regarding the first requirement, by changing the number of random bases in each template on the computer (4-38 bases for LT and 4-24 bases for ST), each sorted index and The unique number of barcodes for the template was determined (Figure 2 panel A and Figure 8A; gray line). The number of unique barcodes determined increased dramatically with increasing number of random bases in the barcode. This suggests that a certain minimum number of random bases is needed to quantify a given number of molecules. Even if we artificially increase the number of possible barcode sequences (by increasing the length of the barcode), the number of original nucleic acid sequences of interest measured will not exceed the original copy number of 20,000. The plateau was expected to be at 20000 because it should not increase. However, the expected plateau in the region of higher random base numbers was not observed, and the number of unique barcodes determined increased monotonically as the number of random bases increased. With respect to the second requirement, the sequencing coverage was computationally modified by randomly excluding some of the reads, and the number of unique barcodes with the remaining reads for each index and template. Was determined (panel C of FIG. 2 and panel C of FIG. 8; gray line). If the digital counting scheme is working, once the coverage reaches a sufficient level, the number of unique barcodes identified should not depend on the coverage (sequence depth), so the plateau Will be observed in the plot. Increasing the sequence depth (ie, the number of times each barcode is read) should not increase the measured number of the original target nucleic acid sequence by more than 20000 of the original copy number, resulting in a plateau of 20000. I expected it to be. However, this expected plateau was not observed and the number of unique barcodes determined increased monotonically with increasing coverage. This suggests that under this condition the digital counting system needs improvement.

これらの図（図２のパネルＡおよびパネルＣ、並びに図８のパネルＡおよびパネルＣ）でプラトーが観察されなかった理由は、実際のバーコード配列のインプットと比べて、最終的にシークエンスされたバーコードのアウトプットにおいて、塩基の変化、例えば、置換のエラーおよび挿入−欠失（indel）のエラーによって説明され得る。置換のエラー（おそらくシークエンスのエラーおよび／またはポリメラーゼ増幅のエラーによる）を除外するために、インハウスソフトウェアであるNucleotide Sequence Clusterizerを用いてバーコード配列をクラスタリングした。クラスタリングの手続において、「距離（Distance）」と呼ぶパラメータを導入した：ここで、Distanceは、所定の２つのバーコード配列間で相違する塩基の数を示す。例えば、あるバーコード配列が別のバーコード配列と、いずれか２つの位置での２つの塩基変化を除けば正確に同一となる場合、これら２つのバーコード配列間のDistanceは２である。それゆえ、パラメータとしてDistance=2でクラスタリングした後には、ある所定のクラスターにおける全てのバーコード配列は、そのクラスター中の少なくとも１つの他のバーコード配列からDistance=2以内である（クラスターに含まれる任意の分子が、必ずしも他の全ての配列からDistance=2以内というわけではない）。本質的には、クラスタリングなしで固有のバーコードの数をカウントするための元の分析は、Distance=0でクラスタリングすることによって実施されたと言える。Distance = 0、1、2、または3でバーコードクラスターの数を決定した（図３のパネルＡおよび図９のパネルＡ参照）。所定の分子数に対して十分に多様な種類のバーコードが存在するならばバーコードクラスターの数はDistanceが増加するにつれて一定値に近づくことが予想され、実際にその傾向が観察された。正確なデジタル定量のための２つの要件に対するクラスタリングの効果を観察するために、実施した最も長いDistance（Distance = 3）でのクラスタリングを用いることによって、決定されたバーコードクラスターの数をランダム塩基の数（図２のパネルＡおよび図８のパネルＡの青の線を参照）とカバー率（図２のパネルＣおよび図８のパネルＣの青の線を参照）の関数としてプロットした。いずれのプロットに関しても、よりプラトー様の曲線が観察されたが、特にカバー率が増加するにつれて、決定されたバーコードクラスターの数は依然として単調に増加した。 The reason why no plateau was observed in these figures (panel A and panel C of FIG. 2 and panel A and panel C of FIG. 8) was finally sequenced compared to the input of the actual barcode sequence. In the barcode output, it can be explained by base changes, eg, substitution errors and insertion-deletion (indel) errors. To exclude substitution errors (possibly due to sequencing errors and / or polymerase amplification errors), the in-house software Nucleotide Sequence Clusterizer was used to cluster barcode sequences. In the clustering procedure, we introduced a parameter called "Distance": where Distance indicates the number of bases that differ between two given barcode sequences. For example, if one barcode sequence is exactly the same as another barcode sequence except for two base changes at any two positions, the Distance between these two barcode sequences is 2. Therefore, after clustering with Distance = 2 as a parameter, all barcode sequences in a given cluster are within Distance = 2 from at least one other barcode sequence in that cluster (included in the cluster Not every molecule is within Distance = 2 from all other sequences). In essence, it can be said that the original analysis for counting the number of unique barcodes without clustering was performed by clustering with Distance = 0. The number of barcode clusters was determined at Distance = 0, 1, 2, or 3 (see panel A of Figure 3 and panel A of Figure 9). It was expected that the number of barcode clusters would approach a certain value as the distance increased, if there were various types of barcodes for a given number of molecules, and that tendency was actually observed. In order to observe the effect of clustering on the two requirements for accurate digital quantification, the number of barcode clusters determined by using clustering with the longest Distance (Distance = 3) performed was used. It was plotted as a function of number (see blue line in FIG. 2 panel A and FIG. 8 panel A) and coverage (see FIG. 2 panel C and FIG. 8 panel C blue line). A more plateau-like curve was observed for both plots, but the number of barcode clusters determined still increased monotonically, especially as coverage increased.

次に、シークエンスされたリードのうちバーコード配列の固定塩基の位置にミスマッチ塩基を含むリードを除外することによって挿入−欠失（indel）型のエラーの影響を除外することを試みた（図１３参照）。もし、バーコード配列アウトプットが、これらの固定塩基の位置のいずれかにミスマッチ塩基を含んでいたならば、固定塩基の位置によって規定される指定の「リーディングフレーム」からの残りの塩基のずれを引き起こす、バーコード配列中の別の位置での塩基の挿入または欠失が分かった。デジタルカウントシステムに対するこのプロセスの効果を調べるために、バーコード配列中の固定塩基の位置依存性を調べた。この除外手順のために１つの固定塩基を用いたときのバーコードクラスターの数を決定した（図３のパネルＢおよび図９のパネルＢ参照）。固定塩基の位置がシークエンスプライマー部位から離れるにつれてバーコードクラスターの数が減少した。固定塩基のミスマッチは、シークエンス開始部位と固定塩基の位置との間で生じるindel型の配列変化を検出し得るものであるため、このことは合理的である。また、決定されたバーコードクラスターの数に対する固定塩基の数の依存性を分析した。この際には、シークエンスプライマー部位から最も遠い位置にある固定塩基を用いた（図３のパネルＣおよび図９のパネルＣ参照）。用いた固定塩基の数が小さいときには、決定されたバーコードクラスターの数は有意に減少し、用いた固定塩基の数が増加するにつれて、決定されたバーコードクラスターの数はほぼ一定になった。正確なデジタル定量のための上記２つの要件に対するミスマッチ除外の効果を観察するために、ランダム塩基の数（図２のパネルＡおよび図８のパネルＡ；緑の線）およびカバー率（図２のパネルＣおよび図８のパネルＣ；緑の線）の関数として、決定されたバーコードクラスターの数をプロットした。使用した中で最も多い固定塩基数（短い鋳型については６塩基、長い鋳型については１２塩基）を用いてミスマッチ除外プロセスを実施した。その結果、いずれのプロットについても（図２のパネルＡおよびＣ、図８のパネルＡおよびＣ）、プラトー様の曲線が見られ、このことは固定塩基を用いたindel型エラー除外がデジタル定量をより正確なものにしたことを示す。 Next, it was attempted to exclude the effect of insertion-deletion (indel) type errors by excluding the reads containing mismatched bases at the positions of fixed bases in the barcode sequence of the sequenced reads (Fig. 13). reference). If the barcode sequence output contained mismatched bases at any of these fixed base positions, the deviation of the remaining bases from the designated "reading frame" defined by the fixed base positions was used. The resulting insertion or deletion of bases at another position in the barcode sequence was found. To investigate the effect of this process on the digital counting system, the position dependence of fixed bases in the barcode sequence was investigated. The number of barcode clusters when using one fixed base for this exclusion procedure was determined (see FIG. 3 panel B and FIG. 9 panel B). The number of barcode clusters decreased as the position of fixed base moved away from the sequence primer site. This is rational because the fixed base mismatch can detect indel type sequence changes that occur between the sequence start site and the fixed base position. In addition, the dependence of the number of fixed bases on the determined number of barcode clusters was analyzed. At this time, a fixed base located farthest from the sequence primer site was used (see FIG. 3 panel C and FIG. 9 panel C). When the number of fixed bases used was small, the number of determined barcode clusters decreased significantly, and as the number of fixed bases used increased, the number of determined barcode clusters became almost constant. To observe the effect of mismatch exclusion on the above two requirements for accurate digital quantification, the number of random bases (panel A of FIG. 2 and panel A of FIG. 8; green line) and coverage (FIG. 2). The number of barcode clusters determined was plotted as a function of panel C and panel C of FIG. 8 (green line). The mismatch exclusion process was performed using the highest number of fixed bases used (6 bases for short templates, 12 bases for long templates). As a result, a plateau-like curve was seen for all plots (panels A and C of FIG. 2 and panels A and C of FIG. 8), which indicates that indel error exclusion using fixed bases provided digital quantification. Indicates that it has been made more accurate.

別の問題として、サンプル間のクロスコンタミネーションが生じることを見出した。これは図２のパネルＣおよび図８のパネルＣの緑の線におけるプラトー様の相において、観察されるクラスターの数のわずかな増加を引き起こしていると考えられる。ＰＣＲの間に増幅プライマーによって異なるインデックス（インデックスＡおよびインデックスＢ）によって２つのサンプルをそれぞれ標識して、２つの別々のチューブ中でＰＣＲによって増幅した２つのサンプルを同時にシークエンスした。インデックスＡおよびインデックスＢの両方を用いてバーコードをクラスタリングしたときに、両方のインデックスを含むバーコードクラスターの小さな画分を見出した。これは、Jaitinらによっても報告されている^５。ＰＣＲ増幅用のバーコード付加された鋳型は元の鋳型プールからランダムに選択されたものであるため、これはクロスコンタミネーション無しで生じた可能性はある。しかし、短い鋳型の場合であってもバーコード配列の種類は、理論上２．８×１０^１４（＝４^２４）存在することから、完全に同一のバーコードを有する元の鋳型が２つの増幅チューブに添加される可能性は非常に小さいと考えられる。従って、特定のインデックスを含むＰＣＲプライマーがチューブに混入したか、インデックス配列がエラーを有していたか、そして／または、シークエンス工程においてインデックススイッチング（index switching）が生じたかのいずれかが考えられた（Sinha, R. et al., biorxiv, 10.1101/125724 (2017)）。この影響を除くために、まず、各鋳型について２つのインデックスにソートされた全てのリードを混合し、これらの混合されたリードに対してクラスタリングを実施した。一つのバーコードクラスターの中に複数のインデックス（この場合、２つのインデックス）が見出された場合、シークエンスされたリードのうち最も数の多いリードを含むインデックスを有するとしてバーコードクラスターをカウントした。このプロセスを用いて、決定されたクラスター数がカバー率の関数としてプラトーを示すことを最終的に見出した（図２のパネルＣおよび図８のパネルＣの黄色の線を参照）。重要なことに、上記図２のパネルＣおよび図８のパネルＣの青の線では、カバー率が上昇すると決定されるクラスター数が微小に上昇するようすが認められたが、インデックススイッチングの影響を除外する上記プロセスによって、クラスター数は、カバー率が上昇してもプラトーを示した。As another problem, it was found that cross contamination between samples occurs. It is believed that this is causing a slight increase in the number of observed clusters in the plateau-like phase in the green line of panel C of Figure 2 and panel C of Figure 8. The two samples were each labeled with a different index (index A and index B) by amplification primers during PCR, and the two samples amplified by PCR were sequenced simultaneously in two separate tubes. When clustering barcodes using both index A and index B, a small fraction of barcode clusters containing both indexes were found. This has also been reported by Jaitin et al. ⁵ This could have occurred without cross-contamination as the barcoded templates for PCR amplification were randomly selected from the original template pool. However, even in the case of a short template, since there are theoretically 2.8 × 10 ¹⁴ (= ⁴²⁴ ) types of barcode sequences, the original template having completely the same barcode has two amplifications. It is considered that the possibility of being added to the tube is very small. Therefore, it is possible that either the PCR primer containing the specific index was mixed in the tube, the index sequence had an error, and / or index switching occurred in the sequencing process (Sinha. , R. et al., Biorxiv, 10.1101 / 125724 (2017)). To eliminate this effect, we first mixed all reads that were sorted into two indexes for each template, and clustered these mixed reads. If multiple indexes (two indexes in this case) were found in one barcode cluster, the barcode cluster was counted as having the index containing the highest number of reads in the sequence. Using this process, it was finally found that the determined number of clusters exhibited a plateau as a function of coverage (see the yellow line in panel C of Figure 2 and panel C of Figure 8). Importantly, in the blue lines of panel C of FIG. 2 and panel C of FIG. 8, it was observed that the number of clusters determined to increase the coverage ratio slightly increased, but the influence of index switching was observed. By the above process of exclusion, the cluster number showed a plateau with increasing coverage.

同一のバーコード配列が両方のインデックスに使われることは無いと考えられることから、正確なデジタル定量のための第１の要件を確認するために、インデックスＡとＢとの合計について、決定されたクラスターの数をプロットした（図２のパネルＢおよび図８のパネルＢの黄色の線を参照）。依然としてプラトーが存在したことから、用いられたランダム塩基の数は、正確なデジタル定量を実施するための許容可能な範囲内であった。 Since it is unlikely that the same barcode sequence will be used for both indices, the sum of indices A and B was determined to confirm the first requirement for accurate digital quantification. The number of clusters was plotted (see yellow line in panel B of Figure 2 and panel B of Figure 8). The number of random bases used was within an acceptable range to perform an accurate digital quantification, as there was still a plateau.

上記の実施例は、複数のサンプルを混合して解析する際に生じ得る、「インデックススイッチング（index switching）」が、バーコードクラスタリングの精度に影響を与えること、およびインデックススイッチング（ミスインデックス）の除外プロセスが、精度の改善し、カバー率によって精度が影響を受けないデジタル定量システムを可能とすることを示すものである。 The above example shows that "index switching" can affect the accuracy of barcode clustering and the exclusion of index switching (misindex) that can occur when mixing and analyzing multiple samples. It is shown that the process improves accuracy and enables a digital quantitation system in which accuracy is not affected by coverage.

サンプル間のクロスコンタミネーションが見出されたので、次いで、リファレンスに対するリードのマッピングプロセスにおける誤同定について調べた。インデックスの問題に関して行ったのと同様のプロセスに従った。ここで、２つのインデックスにソートされ、そしていずれかの鋳型に対してマップされた全リードを混合し、その後、混合されたリードに対してクラスタリングを実行した。次に、一つのバーコードクラスター内に複数の鋳型および／または複数のインデックスが見出されたときには、シークエンスしたリードの中で最も高いリード数を示した鋳型およびインデックスに対するバーコードクラスターをカウントした。しかし、このプロセスを通して、カバー率の関数としての決定されたクラスターの数には有意差は観察されなかった（図２のパネルＣおよび図８のパネルＣの赤の点線を参照）。このことにより、この系では、誤同定はさほど頻繁には生じないことが示唆された。同一のバーコード配列が両方のインデックスおよび全ての鋳型に用いられることはないと考えられることから、正確なデジタル定量のための第１の要件を確認するために、インデックスＡ、インデックスＢおよび全ての鋳型の合計について、決定されたクラスターの数をプロットした（図２のパネルＢおよび図８のパネルＢの赤の点線を参照）。依然としてプラトーが存在したことから、用いられたランダム塩基の数は、鋳型の誤同定（misidentification）を説明するときでさえ、正確なデジタルカウントを実施するための許容可能な範囲内であった。本実施例では、誤同定の影響は少なかったが、このプロセスは、より大量のリファレンスが用いられる分析（例えば、RNA-Seq）においては重要になる。これは、このような分析においては、誤同定はより頻繁に生じ得るからである。 Since cross-contamination between the samples was found, we next examined for misidentification in the process of mapping the lead to the reference. We followed a similar process we did for indexing issues. Here, all reads that were sorted into two indexes and mapped to either template were mixed, and then clustering was performed on the mixed reads. Then, when multiple templates and / or multiple indexes were found in one barcode cluster, the barcode cluster for the template and index that showed the highest number of reads among the sequenced reads was counted. However, throughout this process, no significant difference was observed in the number of determined clusters as a function of coverage (see red dotted line in Figure 2 panel C and Figure 8 panel C). This suggested that in this system misidentification does not occur very often. Since it is unlikely that the same barcode sequence will be used for both indexes and for all templates, we have confirmed that the first requirement for accurate digital quantification is index A, index B and all The number of determined clusters was plotted against the total template (see red dotted line in FIG. 2 panel B and FIG. 8 panel B). Since there was still a plateau, the number of random bases used was within an acceptable range for performing accurate digital counts, even when accounting for template misidentification. In this example, the effect of misidentification was less, but this process is important in assays where larger amounts of reference are used (eg RNA-Seq). This is because false identification can occur more frequently in such analyses.

上記分析プロセスにおいて生じていることをさらに理解するために、各プロセスに対してカバー率のヒストグラムを作成した（図１０）。固有のバーコードの数をカウントした（上記のいずれの処理もなしで）ヒストグラムは、主に低リードクラスターを含む大きなピークを有した。これらの低リードクラスターは、このデジタルカウント法によって測定すると、目的核酸配列のアウトプットコピー数を人工的に増加させ（シークエンスエラー、indelエラーなどに起因する元のサンプルには存在しない人工的に生じたバーコード配列による）、システムがより正確な定量を行うためにはこれを除外しなければならない（上記２つの要件）。最初の２つの処理工程の後にこのピークは劇的に減少したが、このことは、主にシークエンスエラーによって生成されたバーコード配列はこれらの処理工程によって除外されたことを示唆するものである。 To further understand what is happening in the above analytical process, a histogram of coverage was created for each process (Figure 10). The histogram counting the number of unique barcodes (without any of the above treatments) had large peaks containing predominantly low read clusters. These low-read clusters artificially increase the output copy number of the target nucleic acid sequence when measured by this digital counting method (the artificial generation that is not present in the original sample due to sequence error, indel error, etc.). (Depending on the barcode sequence), this must be excluded for the system to make more accurate quantification (two requirements above). This peak was dramatically reduced after the first two processing steps, suggesting that the barcode sequences generated primarily by sequence errors were excluded by these processing steps.

４つの具体的鋳型（ＳＴ１、ＳＴ２、ＬＴ１およびＬＴ２）を用いた場合に、上記のバーコード設計およびコンピューター分析が、正確なデジタルカウントのための２つの前記要件を満たすことが示された（図２のパネルＡ〜Ｃおよび図８のパネルＡ〜Ｃ参照）。次に、パラメータを至適化し、そして２０〜４００００の広い範囲のコピー数を含む全ての鋳型についてこれらの分析を適用した。パラメータとしてDistance=2である場合に決定されるクラスターの数が一定値に既に近づいていたので（図３のパネルＡおよび図９のパネルＡ参照）、以後の分析ではDistance=2を用いた。固定塩基の数に関しては、固定塩基の数が4であるときに、決定されるクラスターの数が一定値に近づいていたので（図３のパネルＣおよび図９のパネルＣ参照）、固定塩基の数を4とした（全ての鋳型について、左から16番目、21番目、24番目および28番目が固定塩基であるバーコード（図１３）を用いた）。インデックスのクロスコンタミネーションと鋳型の誤同定も考慮した。上記の定量分析および洞察の全てを利用し、本発明のデジタルカウントスキームを使用して目的核酸分子を正確に定量することができると考えられる。これらの条件に基づいて、全ての鋳型について２つの要件を調べ、このデジタルカウントシステムのダイナミックレンジを決定した（図４のパネルＡ、パネルＢおよび図１１参照）。カバー率依存性に関しては、クラスタリングのために２０個のランダム塩基を用い（図４のパネルＡおよび図１１）、ランダム塩基数への依存性については、元の総リード数の１０％を分析に用いることとした（図４のパネルＢ）。なぜなら、４つの元の鋳型についてのこれまでの初期的な分析に基づけば、両方のパラメータが依然として機能するはずであると考えられたからである（図２のパネルＡ、図２のパネルＣ、図８のパネルＡおよび図８のパネルＣ）。分析用にリードの１００％未満を用いる場合には、リードをランダムに選択し、このプロセスを８回繰り返すことによって平均と標準偏差を求めた（図４Ａ〜４Ｃおよび図１１）。図４Ａ〜４Ｃおよび図１１に示されるように、ランダム塩基の数およびカバー率の関数としてのプロットにおいて、全ての鋳型についてプラトーが存在した。このことにより、選択したパラメータによって、広い範囲のコピー数の鋳型について正確なデジタル定量が可能になることが示唆された。 It was shown that the above barcode design and computer analysis fulfilled two of the above requirements for accurate digital counting when using four specific templates (ST1, ST2, LT1 and LT2) (FIG. 2 panels A-C and FIG. 8 panels A-C). The parameters were then optimized and these analyzes were applied to all templates containing a wide range of copy numbers from 20 to 40,000. Since the number of clusters determined when Distance = 2 as a parameter was already close to a constant value (see panel A of FIG. 3 and panel A of FIG. 9), Distance = 2 was used in the subsequent analysis. Regarding the number of fixed bases, when the number of fixed bases was 4, the number of determined clusters was close to a constant value (see panel C of FIG. 3 and panel C of FIG. 9). The number was set to 4 (for all templates, barcodes (FIG. 13) in which the 16th, 21st, 24th and 28th from the left are fixed bases were used). Cross-contamination of the index and misidentification of the template were also considered. All of the above quantitative analysis and insights could be used to accurately quantify nucleic acid molecules of interest using the digital counting scheme of the present invention. Based on these conditions, two requirements were examined for all molds to determine the dynamic range of this digital counting system (see panel A, panel B and FIG. 11 of FIG. 4). For the coverage dependency, 20 random bases were used for clustering (FIG. 4, panel A and FIG. 11), and for the dependency on the random base number, 10% of the original total read number was analyzed. It was decided to use (panel B of FIG. 4). Because it was thought that both parameters should still work based on previous initial analysis of the four original templates (FIG. 2, panel A, FIG. 2, panel C, FIG. 8 panel A and FIG. 8 panel C). When using less than 100% of the reads for analysis, the reads were randomly selected and the process was repeated 8 times to determine the mean and standard deviation (FIGS. 4A-4C and FIG. 11). As shown in Figures 4A-4C and Figure 11, there was a plateau for all templates in the plot as a function of number of random bases and coverage. This suggested that the parameters selected allowed accurate digital quantitation for a wide range of copy number templates.

図４のパネルＡおよび図１１における決定されたバーコードの数が、１２．６〜２０．９のカバー率で（リードの１０％をサンプリングした場合）、ＰＣＲ増幅前のサンプルチューブに含まれていた鋳型の数と対応していた。これらの値を用いて、光学密度により決定される分子のインプット数と、本発明のデジタルカウント法で決定される分子のアウトプット数とを比較した（図４のパネルＣ参照）。その結果、インプット分子数の値とアウトプット分子数の値とは高い相関を示した（ピアソンの積率相関係数r = 0.990）。このアウトプット／インプットの比は、長い鋳型（ＬＴ）については、0.32〜0.45の範囲であり、短い鋳型（ＳＴ）については、0.41〜0.57であり、実験誤差によって説明され得る（例えば、ＰＣＲ増幅のための準備における（高々）７段階の鋳型希釈における統計誤差）。このことから、本実施例で提示されたパラメータに基づくデジタルカウントスキームによってＰＣＲ増幅前の核酸分子の絶対コピー数を定量することができることが示唆される。 The determined number of barcodes in panel A of FIG. 4 and FIG. 11 was included in the sample tube before PCR amplification with a coverage of 12.6 to 20.9 (when 10% of the leads were sampled). It corresponded to the number of molds. These values were used to compare the number of molecular inputs determined by optical density with the number of molecular outputs determined by the digital counting method of the invention (see panel C of FIG. 4). As a result, a high correlation was shown between the number of input molecules and the number of output molecules (Pearson product moment correlation coefficient r = 0.990). This output / input ratio is in the range of 0.32-0.45 for the long template (LT) and 0.41-0.57 for the short template (ST) and may be explained by experimental error (eg PCR amplification). (At most) 7-step template dilution in preparation for (statistical error). This suggests that the parameter-based digital counting scheme presented in this example can quantify the absolute copy number of a nucleic acid molecule before PCR amplification.

これらの結果に基づいて、エラーの存在下で分子の絶対数をカウントするためのランダム塩基の必要数を提示することができる（図５のパネルＡ参照）。ｘ軸は測定しようとする分子のインプット数を示し、ｙ軸は図４のパネルＢおよび図５のパネルＢにおける各々の曲線が０．９５の相対的クラスター数に達するときのランダム塩基の数を示す。図５のパネルＢは、図４のパネルＢでなされたようにランダム塩基の数に対する相対的クラスター数の依存性を示すものであるが、各鋳型に対して誤同定の除外プロセス（クラスター数に対して有意な効果を有しなかった）を行わなかった。所定の分子数のより低い範囲におけるより多くのデータを示すために図５のパネルＡにおけるこれらのデータを含め、そして、例えば、９５％を超える精度で約１０^５個の分子を定量するためには、少なくとも１６個のランダム塩基が必要となることが分かった。Based on these results, the required number of random bases to count the absolute number of molecules in the presence of error can be presented (see Figure 5, panel A). The x-axis shows the number of inputs of the molecule to be measured and the y-axis shows the number of random bases when each curve in FIG. 4 panel B and FIG. 5 panel B reaches a relative cluster number of 0.95. Show. Panel B of FIG. 5 shows the dependence of relative cluster number on the number of random bases as done in panel B of FIG. 4, but for each template the misidentification exclusion process (cluster number Which did not have a significant effect). Include these data in panel A of FIG. 5 to show more data in the lower range for a given number of molecules, and, for example, to quantify about 10 ⁵ molecules with greater than 95% accuracy. Was found to require at least 16 random bases.

実験的に、高々84,420個の分子（インプットした全ＬＴの数）が、20個のランダム塩基を用いて正確に定量されたことが示された（図４のパネルＢ）。この数は、例えば、トランスクリプトーム解析において個々の遺伝子に対するＲＮＡ分子の数をカウントするに十分であると考えられる。実際には、測定可能な分子数は、MiSeqシークエンサーのキャパシティによって制限を受ける。
最大で３８個のランダム塩基を使用し、所定の分子数に依存して必要とされるランダム塩基の数（図１２参照）により、実験的に測定されたデータセットへの単純線形回帰に基づけば、約１０^１５個の分子が本発明の測定システムで定量され得ることが示唆される。このダイナミックレンジは、市販のディープシークエンサーの現在のキャパシティを遙かに超える優れたものである。これにより、広いダイナミックレンジを備えた定量分析のボトルネックは、もはやバーコードの設計によっては制限されず、むしろシークエンスのスループットによって制限される。Experimentally, it was shown that at most 84,420 molecules (number of total LT input) were accurately quantified using 20 random bases (FIG. 4, panel B). This number is considered sufficient to count the number of RNA molecules for an individual gene in, for example, transcriptome analysis. In practice, the measurable number of molecules is limited by the capacity of the MiSeq sequencer.
Based on a simple linear regression to the experimentally measured data set, using a maximum of 38 random bases, with the number of random bases required (see Figure 12) depending on the number of given molecules. , About 10 ¹⁵ molecules can be quantified with the measurement system of the present invention. This dynamic range is far superior to the current capacity of commercial deep sequencers. As a result, the bottleneck of quantitative analysis with a wide dynamic range is no longer limited by the barcode design, but rather by the sequence throughput.

上記のように本実施例では、ランダム塩基と固定塩基とを含むバイブリッド型の分子バーコードを設計して使用するデジタルカウントを実施し、所定のサンプル中の分子の数を定量できることを示した。ここで、適切なバーコードの設計、十分なシークエンス深度、適切なパラメータでの分析方法が用いられる。これにより、広く高いダイナミックレンジでかつ低コストで核酸分子の数を測定することが可能となる。この結果に基づいて、エラーの存在下で所定のバーコード分子の数をカウントするために必要なランダム塩基と固定塩基の数を示唆することができる（図５のパネルＡおよび図１２）。本実施例ではまた、分子バーコードの更なる機能的な利点を定量的に示した。すなわち、分子バーコードを、サンプルのクロスコンタミネーション（プライマーの物理的混入、インデックスにおけるエラー、および／またはシークエンスプロセスにおけるインデックススイッチングによって引き起こされる）の同定や、アラインメントプロセスにおける目的核酸配列の誤同定に利用した。実際、上記の通り、前者は、次世代シークエンサープラットフォームにおける報告されている重大な問題を解決し得るものである^{２３，２４}。エラーの影響はライブラリーの調製および／またはシークエンスプラットフォームに依存し得るものであるが、ランダム塩基バーコードの有効性が一般的な応用において示されており、そしてここで示したバーコード使用の検証のためのストラテジーは、様々なプラットフォームに適用可能である。さらに、バーコード付加された分子に対するランダム塩基バーコードの有効性を示したので、応用毎に異なり得る当業者であればバーコード付加の効果または有効性を評価することができる。本発明は、遺伝子発現解析、ｉＣＬＩＰ^８、抗体レパトワ解析^９、細菌１６ＳｒＲＮＡ遺伝子解析^{１０，１１}、ＣｈＩＰ−ｎｅｘｕｓ^１２における分子のカウントだけでなく、細胞^{９，２５，２６}、ウイルス^１、およびバーコードを使用する他の応用用途のための、分子バーコードを使用する核酸定量のデジタルカウント法に広く用いることができる。近年、Single Cell Sequencing Solution (Illumina, Inc., CA, USおよびBio-Rad Laboratories, Inc., CA, USA)や、Chromium Single Cell 3’ Solution (10x Genomics, Inc. CA, USA)などの市販の装置を用いてこれらの応用を行い得る。実験的に得られた大量の定量的データに基づいてシステムバイオロジーが促進されると考える。As described above, in the present Example, it was shown that the number of molecules in a given sample can be quantified by conducting a digital count using a hybrid type molecular barcode including a random base and a fixed base. . Here, an appropriate barcode design, sufficient sequence depth, and analysis method with appropriate parameters are used. This makes it possible to measure the number of nucleic acid molecules in a wide and high dynamic range and at low cost. Based on this result, one can suggest the number of random and fixed bases needed to count the number of a given barcode molecule in the presence of error (panel A of FIG. 5 and FIG. 12). This example also quantitatively demonstrated the additional functional advantage of molecular barcodes. That is, the molecular barcode is used to identify sample cross-contamination (caused by physical contamination of primers, errors in the index, and / or index switching in the sequencing process) and misidentification of the target nucleic acid sequence in the alignment process. did. In fact, as mentioned above, the former may solve a significant reported problem in next-generation sequencer platforms ^23,24 . Although the effects of errors may be dependent on library preparation and / or sequencing platforms, the effectiveness of random base barcodes has been demonstrated in common applications, and validation of barcode use presented here. The strategy for is applicable to various platforms. Furthermore, since the effectiveness of random base barcodes on barcoded molecules has been demonstrated, one of ordinary skill in the art, who can vary from application to application, can assess the effect or effectiveness of barcode addition. The present invention includes not only the gene expression analysis, iCLIP ⁸ , antibody repertoire analysis ⁹ , bacterial 16S rRNA gene analysis ¹⁰ , ¹¹ and molecular count in ChIP-nexus ¹² , but also cells ⁹ , ²⁵ , ²⁶ , virus ¹ , and barcode. It can be widely used for digital counting method of nucleic acid quantification using molecular barcode for other application uses. Recently, such as Single Cell Sequencing Solution (Illumina, Inc., CA, US and Bio-Rad Laboratories, Inc., CA, USA) and Chromium Single Cell 3'Solution (10x Genomics, Inc. CA, USA) are commercially available. The device may be used to make these applications. We believe that system biology will be promoted based on a large amount of experimentally obtained quantitative data.

配列表の内容
配列番号１：ＬＴ１〜６の５’領域の塩基配列
配列番号２：ＬＴ１〜６の３’領域の塩基配列
配列番号３：ＳＴ１〜５の５’領域の塩基配列
配列番号４：ＳＴ１〜５の３’領域の塩基配列
配列番号５：ＬＴ１のバーコード配列
配列番号６：ＬＴ１の目的核酸配列
配列番号７：ＬＴ２のバーコード配列
配列番号８：ＬＴ２の目的核酸配列
配列番号９：ＬＴ３のバーコード配列
配列番号１０：ＬＴ３の目的核酸配列
配列番号１１：ＬＴ４のバーコード配列
配列番号１２：ＬＴ４の目的核酸配列
配列番号１３：ＬＴ５のバーコード配列
配列番号１４：ＬＴ５の目的核酸配列
配列番号１５：ＬＴ６のバーコード配列
配列番号１６：ＬＴ６の目的核酸配列
配列番号１７：ＳＴ１のバーコード配列
配列番号１８：ＳＴ１の目的核酸配列
配列番号１９：ＳＴ２のバーコード配列
配列番号２０：ＳＴ２の目的核酸配列
配列番号２１：ＳＴ３のバーコード配列
配列番号２２：ＳＴ３の目的核酸配列
配列番号２３：ＳＴ４のバーコード配列
配列番号２４：ＳＴ４の目的核酸配列
配列番号２５：ＳＴ５のバーコード配列
配列番号２６：ＳＴ５の目的核酸配列
配列番号２７：増幅用フォワードプライマーの配列
配列番号２８：増幅用リバースプライマーの配列（インデックスＡ用）
配列番号２９：増幅用リバースプライマーの配列（インデックスＢ用） Contents of Sequence Listing SEQ ID NO: 1: base sequence of 5'region of LT1 to 6 SEQ ID NO: 2: base sequence of 3'region of LT1 to 6 SEQ ID NO: 3: base sequence of 5'region of ST1 to 5 SEQ ID NO: 4: Base sequence of 3 ′ region of ST1 to 5 SEQ ID NO: 5: Barcode sequence of LT1 SEQ ID NO: 6: Target nucleic acid sequence of LT1 SEQ ID NO: 7: Barcode sequence of LT2 SEQ ID NO: 8: Target nucleic acid sequence of LT2 SEQ ID NO: 9: LT3 barcode sequence SEQ ID NO: 10: LT3 target nucleic acid sequence SEQ ID NO: 11: LT4 barcode sequence SEQ ID NO: 12: LT4 target nucleic acid sequence SEQ ID NO: 13: LT5 barcode sequence SEQ ID NO: 14: LT5 Target nucleic acid sequence SEQ ID NO: 15: LT6 barcode sequence SEQ ID NO: 16: LT6 target nucleic acid sequence SEQ ID NO: 17: ST1 barcode sequence SEQ ID NO: 18: ST1 target nucleic acid sequence SEQ ID NO: 19: ST2 barcode sequence SEQ ID NO: 20: ST2 objective nucleic acid sequence SEQ ID NO: 21: ST3 barcode sequence SEQ ID NO: 22: ST3 objective nucleic acid sequence SEQ ID NO: 23: ST4 barcode sequence SEQ ID NO: 24: ST4 objective Nucleic acid sequence SEQ ID NO: 25: barcode sequence of ST5 SEQ ID NO: 26: target nucleic acid sequence of ST5 SEQ ID NO: 27: sequence of forward primer for amplification SEQ ID NO: 28: sequence of reverse primer for amplification (for index A)
SEQ ID NO: 29: Sequence of reverse primer for amplification (for index B)

参考文献

References

Claims

A method for analyzing nucleic acids:
(I) subjecting a mixture of a plurality of target nucleic acid molecules having a molecular barcode and an index to sequencing to obtain sequence information;
(II) A sequence having a specific index or a sequence similar thereto and / or a sequence having a specific molecular barcode or a sequence similar thereto is selected from the sequence information obtained in (I) above, and selected. Creating a group with the arranged array,
(III) in the group created in (II) above, determining the pair of the index and the molecular barcode with the highest detection frequency as the correct pair of the index and the molecular barcode,
Including the method.

The method according to claim 1, wherein the nucleic acid molecule of interest to which at least the molecular barcode is added has been subjected to amplification before step (I).

A sequence similar to the sequence having the specific molecular barcode in step (II) is a sequence containing mismatch bases having a predetermined number of bases or less with the sequence having the specific molecular barcode in the molecular barcode sequence portion. The method according to Item 1 or 2.

The method according to any one of claims 1 to 3, wherein the molecular barcode has a fixed base at a specific position.

A sequence similar to the sequence having the specific molecular barcode in step (II) contains the fixed base at the specific position, and / or the position of the fixed base is shifted from the specific position. 5. The method of claim 4, selected on the basis of:

5. The method of claim 4, further comprising excluding a sequence having a molecular barcode that does not include the fixed base at the particular position from analysis.

In the step (III), the pair of the index and the molecular barcode other than the determined correct pair is determined to be a pair of the index and the molecular barcode and excluded.
The method according to any one of claims 1 to 5.

The method further comprising the step of determining the number of the target nucleic acid molecule contained in the sample from which the target nucleic acid molecule is derived, based on the number of groups created by the sequence having a specific molecular barcode or a sequence similar thereto. The method according to any one of 1 to 7.

A method for analyzing nucleic acids:
(I) subjecting a mixture of a plurality of nucleic acid molecules having a molecular barcode to sequencing to obtain sequence information;
(II) a step of selecting a sequence having a specific molecular barcode or a sequence similar thereto from the sequence information obtained in (I) above, and creating a group by the selected sequence;
Including the method.

A sequence similar to the sequence having the specific molecular barcode in step (II) is a sequence containing mismatch bases having a predetermined number of bases or less with the sequence having the specific molecular barcode in the molecular barcode sequence portion. Item 9. The method according to Item 9.

The method according to claim 9 or 10, wherein the molecular barcode has a fixed base at a specific position.

A sequence similar to the sequence having the specific molecular barcode in step (II) contains the fixed base at the specific position, and / or the position of the fixed base is shifted from the specific position. 12. The method of claim 11, selected based on:

12. The method of claim 11, further comprising excluding a sequence having a molecular barcode that does not include the fixed base at the particular position from analysis.

The method further comprising the step of determining the number of the target nucleic acid molecule contained in the sample from which the target nucleic acid molecule is derived, based on the number of groups created by the sequence having a specific molecular barcode or a sequence similar thereto. The method according to any one of 9 to 13.

The method according to any one of claims 9 to 14, wherein the nucleic acid molecule of interest to which at least the molecular barcode has been added has been subjected to amplification before step (I).

A method for analyzing nucleic acids:
(I) subjecting a mixture of a plurality of nucleic acid molecules having a molecular barcode having a fixed base at a specific position to sequencing to obtain sequence information;
(IIa) excluding from the analysis a sequence having a molecular barcode that does not include the fixed base at the specific position;
(IIb) In step (I) or after step (I), a step of obtaining sequence information consisting of a sequence containing the fixed base at the specific position; or (IIc) the step (II) above (I). ) Further comprising the step of selecting a sequence having a specific molecular barcode or a sequence similar thereto from the sequence information obtained in step (4), and forming a group with the selected sequence, and in the step (II) or the step (II) After II), obtaining a group of sequences comprising the fixed base at the particular position;
Including the method.