HK40112829A

HK40112829A - Cancer detection and classification using methylome analysis

Info

Publication number: HK40112829A
Application number: HK42024100992.7A
Authority: HK
Inventors: 丹尼尔·迪尼兹·德·卡尔瓦霍; 斯科特·维克托·布拉特曼; 拉贾特·辛加尼亚; 阿库尔·拉维纳拉亚纳·查克拉瓦西; 沈淑怡
Original assignee: 大学健康网络; 西奈卫生系统公司
Priority date: 2017-07-12
Filing date: 2024-12-17
Publication date: 2025-01-28

Description

Cancer detection and classification using methylome analysis

本申请是申请日为2018年07月11日、申请号为201880059089.5、发明名称为“使用甲基化组分析进行癌症检测和分类”的中国专利申请(其对应PCT申请的申请日为2018年07月11日、申请号为PCT/CA2018/000141)的分案申请。This application is a divisional application of Chinese patent application No. 201880059089.5, entitled "Cancer Detection and Classification Using Methylome Analysis" (the corresponding PCT application was filed on July 11, 2018, with application number PCT/CA2018/000141).

相关申请Related applications

本申请要求2017年7月12日提交的美国临时专利申请号62/531527的优先权，其通过引用并入文中。This application claims priority to U.S. Provisional Patent Application No. 62/531527, filed July 12, 2017, which is incorporated herein by reference.

技术领域Technical Field

本发明涉及癌症检测和分类，更具体地说，涉及甲基化组分析用于癌症检测和分类的用途。This invention relates to cancer detection and classification, and more specifically, to the use of methylome analysis for cancer detection and classification.

背景技术Background Technology

循环无细胞DNA(cfDNA)作为生物标记物来源的用途在肿瘤学中迅速获得发展[1]。cfDNA的DNA甲基化映射作为生物标记物的用途可能会对液体活检领域产生重大影响，因为它可以以微创方式识别起源的组织[2]，可以进行癌症类型和亚型分类并分层癌症患者[3]。此外，使用cfDNA的全基因组DNA甲基化映射可以克服在没有疾病的放射线证据下在早期癌症患者中检测循环肿瘤DNA(ctDNA)的关键灵敏度问题。现有的ctDNA检测方法基于测序突变，并且灵敏度有限，部分原因是可用于区分肿瘤和正常循环cfDNA的有限数量的复发性突变[4，5]。另一方面，全基因组DNA甲基化映射利用了可用于区分循环肿瘤DNA(ctDNA)与正常循环无细胞DNA(cfDNA)的大量的表观遗传学改变。例如，某些类型的肿瘤(例如室管膜瘤)可能具有广泛的DNA甲基化畸变，而没有任何明显的复发性体细胞突变[6]。The use of circulating cell-free DNA (cfDNA) as a source of biomarkers has been rapidly developed in oncology [1]. The use of DNA methylation mapping of cfDNA as a biomarker could have a significant impact on the field of liquid biopsy, as it can identify the tissue of origin in a minimally invasive manner [2] and can be used to classify and stratify cancer patients by cancer type and subtype [3]. Furthermore, the use of whole-genome DNA methylation mapping of cfDNA can overcome the critical sensitivity problem of detecting circulating tumor DNA (ctDNA) in patients with early-stage cancer without radiographic evidence of disease. Existing ctDNA detection methods are based on sequencing mutations and have limited sensitivity, partly due to the limited number of recurrent mutations that can be used to distinguish tumor and normal circulating cfDNA [4, 5]. On the other hand, whole-genome DNA methylation mapping takes advantage of a large number of epigenetic alterations that can be used to distinguish circulating tumor DNA (ctDNA) from normal circulating cell-free DNA (cfDNA). For example, certain types of tumors (e.g., ependymoma) may have extensive DNA methylation aberrations without any obvious recurrent somatic mutations [6].

捕获无细胞甲基化DNA的某些方法在WO 2017/190215中有所描述，其通过引用并入本文。Some methods for capturing cell-free methylated DNA are described in WO 2017/190215, which is incorporated herein by reference.

发明内容Summary of the Invention

在一方面，提供了一种在受试者中检测来自癌细胞的DNA的存在的方法，包括：提供来自受试者的无细胞DNA的样品；使样品进行文库制备以允许随后对无细胞甲基化DNA进行测序；向样品中加入第一量的填充DNA(filler DNA)，其中至少一部分的填充DNA是甲基化的，然后可选地使样品变性；使用对甲基化多核苷酸具有选择性的结合剂捕获无细胞甲基化DNA；对捕获的无细胞甲基化DNA进行测序；将捕获的无细胞甲基化DNA的序列与来自健康和癌症个体的对照无细胞甲基化DNA序列相比较；如果捕获的无细胞甲基化DNA与来自癌症个体的无细胞甲基化DNA序列的一个或多个序列之间在统计学上具有显著相似性，则识别来自癌细胞的DNA的存在。In one aspect, a method is provided for detecting the presence of DNA from cancer cells in a subject, comprising: providing a sample of cell-free DNA from the subject; performing library preparation on the sample to allow subsequent sequencing of the cell-free methylated DNA; adding a first amount of filler DNA to the sample, wherein at least a portion of the filler DNA is methylated, and then optionally denaturing the sample; capturing the cell-free methylated DNA using a binding agent selective for methylated polynucleotides; sequencing the captured cell-free methylated DNA; comparing the sequence of the captured cell-free methylated DNA with control cell-free methylated DNA sequences from healthy and cancer individuals; and identifying the presence of DNA from cancer cells if there is statistically significant similarity between the captured cell-free methylated DNA and one or more sequences of cell-free methylated DNA from cancer individuals.

在一方面，提供了一种检测来自癌细胞的DNA的存在并识别癌症亚型的方法，该方法包括：从受试者样品接收无细胞甲基化DNA的测序数据；将捕获的无细胞甲基化DNA的序列与来自健康和癌症个体的对照无细胞甲基化DNA序列相比较；如果捕获的无细胞甲基化DNA与来自癌症个体的无细胞甲基化DNA序列的一个或多个序列之间在统计学上具有显著相似性，则识别来自癌细胞的DNA的存在；并且如果识别出来自癌细胞的DNA，则基于比较进一步识别起源的癌细胞组织和癌症亚型。In one aspect, a method is provided for detecting the presence of DNA from cancer cells and identifying cancer subtypes, the method comprising: receiving sequencing data of cell-free methylated DNA from a subject sample; comparing the sequence of the captured cell-free methylated DNA with control cell-free methylated DNA sequences from healthy and cancer individuals; identifying the presence of DNA from cancer cells if there is statistically significant similarity between the captured cell-free methylated DNA and one or more sequences of cell-free methylated DNA from cancer individuals; and, if DNA from cancer cells is identified, further identifying the cancer cell tissue of origin and cancer subtype based on the comparison.

在一方面，提供了一种计算机实现的方法，其检测来自癌细胞的DNA的存在并识别癌症亚型，该方法包括：至少一个处理器从受试者样品接收无细胞甲基化DNA的测序数据；在至少一个处理器处，将捕获的无细胞甲基化DNA的序列与来自健康和癌症个体的对照无细胞甲基化DNA序列相比较；如果捕获的无细胞甲基化DNA与来自癌症个体的无细胞甲基化DNA序列的一个或多个序列之间在统计学上具有显著相似性，则在至少一个处理器处识别来自癌细胞的DNA的存在；并且如果识别出来自癌细胞的DNA，则基于比较进一步识别起源的癌细胞组织和癌症亚型。In one aspect, a computer-implemented method is provided for detecting the presence of DNA from cancer cells and identifying cancer subtypes, the method comprising: receiving sequencing data of cell-free methylated DNA from a subject sample by at least one processor; at at least one processor, comparing the sequence of the captured cell-free methylated DNA with control cell-free methylated DNA sequences from healthy and cancer individuals; identifying the presence of DNA from cancer cells at at least one processor if the captured cell-free methylated DNA is statistically significantly similar to one or more sequences of cell-free methylated DNA from cancer individuals; and if DNA from cancer cells is identified, further identifying the cancer cell tissue of origin and cancer subtype based on the comparison.

在一方面，提供了一种与具有处理器和连接到该处理器的存储器的通用计算机结合使用的计算机程序产品，该计算机程序产品包括其上编码有计算机机制的计算机可读存储介质，其中可以将计算机程序机制加载到计算机的存储器中，并使计算机执行本文所描述的方法。In one aspect, a computer program product is provided for use in conjunction with a general-purpose computer having a processor and memory connected to the processor, the computer program product including a computer-readable storage medium thereon having computer mechanisms encoded thereon, wherein the computer program mechanisms can be loaded into the computer's memory and cause the computer to perform the methods described herein.

在一方面，提供了一种计算机可读介质，其上存储有用于存储本文所描述的计算机程序产品的数据结构。In one aspect, a computer-readable medium is provided having a data structure thereon for storing the computer program product described herein.

在一方面，提供了一种用于检测来自癌细胞的DNA的存在并识别癌症亚型的设备，该设备包括：至少一个处理器；以及与至少一个处理器通信的电子存储器，电子存储器存储处理器可执行代码，该代码在至少一个处理器处执行时，使至少一个处理器：从受试者样品接收无细胞甲基化DNA的测序数据；将捕获的无细胞甲基化DNA的序列与来自健康和癌症个体的对照无细胞甲基化DNA序列相比较；如果捕获的无细胞甲基化DNA与来自癌症个体的无细胞甲基化DNA序列的一个或多个序列之间在统计学上具有显著相似性，则识别来自癌细胞的DNA的存在；并且如果识别出来自癌细胞的DNA，则基于比较进一步识别起源的癌细胞组织和癌症亚型。In one aspect, an apparatus is provided for detecting the presence of DNA from cancer cells and identifying cancer subtypes, the apparatus comprising: at least one processor; and an electronic memory in communication with the at least one processor, the electronic memory storing processor-executable code that, when executed at the at least one processor, causes the at least one processor to: receive sequencing data of cell-free methylated DNA from a subject sample; compare the sequence of the captured cell-free methylated DNA with control cell-free methylated DNA sequences from healthy and cancer individuals; identify the presence of DNA from cancer cells if there is statistically significant similarity between the captured cell-free methylated DNA and one or more sequences of cell-free methylated DNA from cancer individuals; and, if DNA from cancer cells is identified, further identify the cancer cell tissue of origin and cancer subtype based on the comparison.

在一方面，提供了一种检测来自癌细胞的DNA的存在并确定来自两个或多个可能器官的癌细胞所处的癌症的位置的方法，该方法包括：提供来自受试者的无细胞DNA的样品；使用对甲基化多核苷酸具有选择性的结合剂从所述样品中捕获无细胞甲基化DNA；对捕获的无细胞甲基化DNA进行测序；比较捕获的无细胞甲基化DNA的序列模式与对照个体的两个或多个群体的DNA序列模式，所述两个或多个群体中的每一个在不同器官中具有局限性癌症；基于无细胞DNA的甲基化模式与所述两个或多个群体中一个之间的统计学上显著的相似性，确定癌细胞来自哪个器官。In one aspect, a method is provided for detecting the presence of DNA from cancer cells and determining the location of cancer cells from two or more possible organs, the method comprising: providing a sample of cell-free DNA from a subject; capturing cell-free methylated DNA from the sample using a binding agent selective for methylated polynucleotides; sequencing the captured cell-free methylated DNA; comparing the sequence pattern of the captured cell-free methylated DNA with the DNA sequence patterns of two or more groups of a control individual, each of the two or more groups having localized cancer in a different organ; and determining which organ the cancer cells originated from based on a statistically significant similarity between the methylation pattern of the cell-free DNA and one of the two or more groups.

附图说明Attached Figure Description

在下面的详细描述中，本发明的优选实施例的这些和其他特征将变得更加明显，在这些详细描述中参考了附图，其中：These and other features of the preferred embodiments of the invention will become more apparent in the following detailed description, in which reference is made to the accompanying drawings, wherein:

图1A-1G显示，对cfDNA的甲基化组分析是一种高灵敏度的方法，可以在少量输入DNA中富集和检测ctDNA。A)作为ctDNA浓度(列)，所研究的DMR数量(行)和测序深度(x轴)函数的检测至少一个表突变的概率的计算机模拟。B)1至100ng来自被片段化以模拟血浆cfDNA的HCT116细胞系的输入DNA的DNA甲基化信号之间的全基因组Pearson相关性。每个浓度具有两个生物学重复。C)从不同浓度的来自HCT116的输入DNA的cfMeDIP-seq获得的DNA甲基化图谱(绿色轨道)加上从ENCODE(ENCSR000DFS)获得的RRBS(简化代表性亚硫酸氢盐测序)HCT116数据和从GEO(GSM1465024)获得的WGBS(全基因组亚硫酸氢盐测序)HCT116数据。对于热图(RRBS轨道)，黄色表示甲基化，蓝色表示未甲基化，灰色表示无覆盖。D-E)将CRC细胞系HCT116在多发性骨髓瘤(MM)细胞系MM1.S中系列稀释。cfMeDIP-seq在纯HCT116DNA(100％CRC)，纯MM1.SDNA(100％MM)和在MM DNA中稀释的10％、1％、0.1％、0.01％和0.001％CRC DNA中进行。将所有DNA片段化以模拟血浆cfDNA。我们观察到DMR数量(D)以及(E)这些DMR中的DNA甲基化信号(以RPKM表示)的观测值与预期值之间几乎完美的线性相关性(r²＝0.99，p<0.0001)。F)在相同的稀释系列中，只有通过超出背景测序仪和聚合酶错误率的超深(>10,000X)靶向测序，才能在1/100等位基因分数下检测到已知的体细胞突变。显示的是在CRC细胞系中每个突变位点包含每个碱基或插入/缺失的读取分数。G)携带来自两名结肠直肠癌患者的患者源性异种移植物(PDX)的小鼠的血浆中ctDNA(人类)占总cfDNA(人类+小鼠)的百分比。Figures 1A-1G show that cfDNA methylome analysis is a highly sensitive method for enriching and detecting ctDNA in small amounts of input DNA. A) Computer simulation of the probability of detecting at least one epigenetic mutation as a function of the number of DMRs studied (rows) and sequencing depth (x-axis) as ctDNA concentration (columns). B) Genome-wide Pearson correlations between DNA methylation signals from input DNA of the HCT116 cell line fragmented to simulate plasma cfDNA, from 1 to 100 ng. Each concentration has two biological replicates. C) DNA methylation maps (green tracks) obtained from cfMeDIP-seq of input DNA from HCT116 at different concentrations, plus RRBS (Reduced Representative Bisulfite Sequencing) HCT116 data obtained from ENCODE (ENCSR000DFS) and WGBS (Whole Genome Bisulfite Sequencing) HCT116 data obtained from GEO (GSM1465024). For the heatmap (RRBS tracks), yellow indicates methylation, blue indicates unmethylation, and gray indicates no coverage. DE) Serial dilutions of the CRC cell line HCT116 were performed in the multiple myeloma (MM) cell line MM1.S. cfMeDIP-seq was performed on pure HCT116 DNA (100% CRC), pure MM1.S DNA (100% MM), and 10%, 1%, 0.1%, 0.01%, and 0.001% CRC DNA diluted in MM DNA. All DNA fragments were simulated to mimic plasma cfDNA. We observed near-perfect linear correlations between the observed and expected values of the number of DMRs (D) and (E) DNA methylation signals (expressed as RPKM) in these DMRs ( ^r² = 0.99, p < 0.0001). F) In the same dilution series, known somatic mutations were detected at a fraction of 1/100 alleles only by ultra-deep (>10,000X) targeted sequencing beyond the background sequencer and polymerase error rates. The read fractions for each mutation site or insertion/deletion are shown in the CRC cell line. G) Percentage of ctDNA (human) in the plasma of mice carrying patient-derived xenografts (PDX) from two colorectal cancer patients to total cfDNA (human + mouse).

图2A-2D显示，对血浆cfDNA的甲基化组分析可以进行肿瘤分类。A)示意性说明了用于癌症分类的机器学习分类器构建方法。B)包含在多类弹性网机器学习分类器中的DMR的热图。对来自健康供体(n＝24)、肺癌(n＝25)、乳腺癌(n＝25)、结肠直肠癌(n＝23)、急性髓系白血病(AML)(n＝28)和多形性胶质母细胞瘤(GBM)(n＝71)的血浆DNA样品进行了分类器训练。层次聚类方法：Ward。C)通过tSNE(t分布随机邻域嵌入)对在10％或25％的模型中识别的癌症类型相关DMR的2D可视化。D)基于血浆cfDNA甲基化的多癌分类器的性能指标。生成50倍弹性网机器学习分类器后，Y轴上显示每种癌症类型和健康供体的接受者工作曲线下的面积(auROC)。Figures 2A-2D show that methylome analysis of plasma cfDNA can be used for tumor classification. A) Schematic illustration of the machine learning classifier construction method used for cancer classification. B) Heatmap of DMR contained in a multi-class elastic net machine learning classifier. Classifiers were trained on plasma DNA samples from healthy donors (n=24), lung cancer (n=25), breast cancer (n=25), colorectal cancer (n=23), acute myeloid leukemia (AML) (n=28), and glioblastoma multiforme (GBM) (n=71). Hierarchical clustering method: Ward. C) 2D visualization of cancer type-related DMRs identified in 10% or 25% of the models using tSNE (t-distributed random neighborhood embedding). D) Performance metrics of the multi-cancer classifier based on plasma cfDNA methylation. Area under the recipient operating curve (auROC) is shown on the Y-axis after generating a 50x elastic net machine learning classifier.

图3A-3B显示了多癌症分类器对单独队列的验证。A)显示了多癌症分类器对肺癌(LUC)(n＝55LUC相对于其他n＝97)、AML(n＝35AML相对于其他n＝117)和健康供体队列(n＝62个健康供体相对于其他n＝90)的单独验证的ROC曲线。B)显示了多癌症分类器对早期LUC(n＝32个I-II期LUC相对于其他n＝97)和晚期LUC(n＝23个III-IV期LUC相对于其他n＝97)的单独验证的ROC曲线。Figures 3A-3B show the validation of the multi-cancer classifier against individual cohorts. A) shows the ROC curves for the individual validation of the multi-cancer classifier against lung cancer (LUC) (n=55 LUCs vs. others n=97), AML (n=35 AMLs vs. others n=117), and a healthy donor cohort (n=62 healthy donors vs. others n=90). B) shows the ROC curves for the individual validation of the multi-cancer classifier against early-stage LUC (n=32 stage I-II LUCs vs. others n=97) and late-stage LUC (n=23 stage III-IV LUCs vs. others n=97).

图4A-4E显示了，对血浆cfDNA的甲基化组分析可以进行肿瘤亚型分类。A)通过tSNE(t分布随机邻域嵌入)对癌症亚型相关DMR的2D可视化。乳腺癌亚型表现出区分具有不同基因表达模式和转录因子活性(ER状态)以及不同肿瘤拷贝数畸变(HER2状态)的肿瘤患者的能力。AML亚型显示出区分具有不同重排(FLT3状态)的肿瘤患者的能力。多形性胶质母细胞瘤(GBM)亚型显示出区分具有不同点突变(IDH基因突变状态)的肿瘤患者的能力。肺癌亚型显示出区分具有不同组织学的的肿瘤患者的能力(腺癌相对于鳞状癌相对于小细胞癌)，这具有预后和治疗意义。B)热图(Heatmap)显示了，顶部DMR可准确区分乳腺癌血浆样品中的三种乳腺癌亚型。C)热图显示了，顶部DMR可准确区分AML患者血浆样品中的FLT3-ITD状态。D)热图显示了，顶部DMR可准确区分多形性胶质母细胞瘤(GBM)患者血浆样品中的IDH基因突变状态。E)热图显示了，顶部DMR可准确区分肺癌血浆样品中的三种肺癌组织学。Figures 4A-4E demonstrate that methylome analysis of plasma cfDNA can be used for tumor subtype classification. A) 2D visualization of cancer subtype-related DMRs using tSNE (t-distributed random neighborhood embedding). Breast cancer subtypes showed the ability to distinguish patients with tumors exhibiting different gene expression patterns and transcription factor activities (ER status) and different tumor copy number aberrations (HER2 status). AML subtypes showed the ability to distinguish patients with tumors exhibiting different rearrangements (FLT3 status). Glioblastoma multiforme (GBM) subtypes showed the ability to distinguish patients with tumors exhibiting different point mutations (IDH gene mutation status). Lung cancer subtypes showed the ability to distinguish patients with tumors exhibiting different histological features (adenocarcinoma vs. squamous cell carcinoma vs. small cell carcinoma), which has prognostic and therapeutic significance. B) A heatmap shows that the top DMR accurately distinguishes the three breast cancer subtypes in breast cancer plasma samples. C) A heatmap shows that the top DMR accurately distinguishes the FLT3-ITD status in AML patient plasma samples. D) The heatmap shows that the top DMR can accurately distinguish the IDH gene mutation status in plasma samples from patients with glioblastoma multiforme (GBM). E) The heatmap shows that the top DMR can accurately distinguish three lung cancer histologies in lung cancer plasma samples.

图5显示了适当配置的计算机设备以及相关联的通信网络、设备、软件和固件，以提供用于实现本文所描述的一个或多个实施例的平台。Figure 5 illustrates a suitably configured computer device and associated communication networks, devices, software, and firmware to provide a platform for implementing one or more embodiments described herein.

图6A-6C显示了测序饱和度分析和质量控制。A)该图显示了，Bioconductor软件包MEDIPS分析来自被片段化以模拟血浆cfDNA的HCT116 DNA的每个输入浓度的每次重复的cfMeDIP-seq数据的饱和度分析结果。B)对HCT116细胞系的四个起始DNA浓度(100、10、5和1ng)进行两次重复测试。使用甲基化和未甲基化的加入的拟南芥DNA计算反应的特异性。使用片段化的HCT116 DNA的基因组区域(甲基化睾丸特异性H2B、TSH2B0和未甲基化的人类DNA区域(GAPDH启动子)的引物)计算富集倍数比。水平虚线表示富集倍数比阈值为25。误差条表示±1s.e.m.。C)测序样品的CpG富集得分显示，与输入对照相比，从免疫沉淀样品中在基因组区域内大量富集CpG。CpG富集得分通过将区域中CpG的相对频率除以人类基因组中CpG的相对频率来获得。误差条表示±1s.e.m.。Figures 6A-6C show the sequencing saturation analysis and quality control. A) This figure shows the saturation analysis results of cfMeDIP-seq data from each replicate of HCT116 DNA fragmented to mimic plasma cfDNA, analyzed using the Bioconductor software package MEDIPS. B) Two replicate tests were performed for four starting DNA concentrations (100, 10, 5, and 1 ng) of the HCT116 cell line. The specificity of the reaction was calculated using methylated and unmethylated Arabidopsis DNA. The enrichment fold ratio was calculated using genomic regions of fragmented HCT116 DNA (primers for methylated testis-specific H2B, TSH2B0, and unmethylated human DNA regions (GAPDH promoter)). The horizontal dashed line indicates an enrichment fold ratio threshold of 25. Error bars represent ±1 s.e.m. C) The CpG enrichment scores of the sequencing samples show that CpG was significantly enriched in genomic regions from the immunoprecipitated samples compared to the input control. CpG enrichment scores are obtained by dividing the relative frequency of CpG in a region by the relative frequency of CpG in the human genome. Error bars represent ±1 s.e.m.

具体实施方式Detailed Implementation

在以下描述中，阐述了许多具体细节以提供对本发明的透彻理解。然而，应理解，可以在没有这些具体细节的情况下实施本发明。In the following description, numerous specific details are set forth to provide a thorough understanding of the invention. However, it should be understood that the invention may be practiced without these specific details.

DNA甲基化图谱是细胞类型特异性的，并在癌症中被破坏。使用被设计成对微量循环无细胞DNA(cfDNA)进行甲基化组分析的稳健而灵敏的方法，我们识别了上千个差异性甲基化区域(DMR)，它们将多种肿瘤类型彼此以及与健康个体区分开。对cfDNA的甲基化组分析高度灵敏并适用于检测早期患者的循环肿瘤DNA(ctDNA)。使用cfDNA甲基化组的机器学习衍生分类器能够根据交叉验证正确分类来自患有5种癌症类型的患者和健康供体的196个血浆样品。在单独验证中，使用在血浆cfDNA中识别的相同DMR，分类器能够正确分类AML、肺癌、和健康供体以及早期和晚期肺癌。因此，对cfDNA的甲基化组分析可用于ctDNA的无创早期检测，并对癌症类型进行可靠分类。DNA methylation profiling is cell type specific and is disrupted in cancer. Using a robust and sensitive method designed for methylome analysis of trace amounts of circulating cell-free DNA (cfDNA), we identified thousands of differentially methylated regions (DMRs) that distinguish multiple tumor types from each other and from healthy individuals. Methylome analysis of cfDNA is highly sensitive and suitable for detecting circulating tumor DNA (ctDNA) in early-stage patients. A machine learning-derived classifier using cfDNA methylome analysis was able to correctly classify 196 plasma samples from patients with five cancer types and healthy donors based on cross-validation. In individual validation, using the same DMRs identified in plasma cfDNA, the classifier was able to correctly classify AML, lung cancer, and healthy donors, as well as early and late-stage lung cancer. Therefore, methylome analysis of cfDNA can be used for non-invasive early detection of ctDNA and reliably classify cancer types.

在一方面，提供了一种在受试者中检测来自癌细胞的DNA的存在的方法，包括：提供来自受试者的无细胞DNA的样品；使样品进行文库制备以允许随后对无细胞甲基化DNA进行测序；向样品中加入第一量的填充DNA，其中至少一部分的填充DNA是甲基化的，然后可选地使样品变性；使用对甲基化多核苷酸具有选择性的结合剂捕获无细胞甲基化DNA；对捕获的无细胞甲基化DNA进行测序；将捕获的无细胞甲基化DNA的序列与来自健康和癌症个体的对照无细胞甲基化DNA序列相比较；如果捕获的无细胞甲基化DNA与来自癌症个体的无细胞甲基化DNA序列的一个或多个序列之间在统计学上具有显著相似性，则识别来自癌细胞的DNA的存在。In one aspect, a method is provided for detecting the presence of DNA from cancer cells in a subject, comprising: providing a sample of cell-free DNA from the subject; performing library preparation on the sample to allow subsequent sequencing of the cell-free methylated DNA; adding a first amount of filler DNA to the sample, wherein at least a portion of the filler DNA is methylated, and then optionally denaturing the sample; capturing the cell-free methylated DNA using a binding agent selective for methylated polynucleotides; sequencing the captured cell-free methylated DNA; comparing the sequence of the captured cell-free methylated DNA with control cell-free methylated DNA sequences from healthy and cancer individuals; and identifying the presence of DNA from cancer cells if there is statistically significant similarity between the captured cell-free methylated DNA and one or more sequences of cell-free methylated DNA from cancer individuals.

申请人的共同拥有的申请：2016年5月3日提交的美国临时专利申请第62/331,070号和2017年5月3日提交的国际专利申请第PCT/CA2017/000108号描述了捕获无细胞甲基化DNA的方法并通过引用并入文中。The applicant’s jointly owned applications, U.S. Provisional Patent Application No. 62/331,070, filed May 3, 2016, and International Patent Application No. PCT/CA2017/000108, filed May 3, 2017, describe methods for capturing cell-free methylated DNA and are incorporated herein by reference.

传统上，癌症已按来源组织进行了分类—例如，结肠直肠癌、乳腺癌、肺癌等。在临床肿瘤学的现代实践中，能够通过各种分子、发育以及功能基础区分癌症亚型变得越来越重要。治疗决策通常取决于癌症的精确亚型，临床医生可能有必要在开始治疗之前识别亚型。可能影响治疗决策的癌症亚型的实例包括(但不限于)阶段(例如，手术治疗的早期肺癌相对于化疗治疗的晚期肺癌)、组织学(例如，肺癌中的小细胞癌相对于腺癌相对于鳞状细胞癌)、基因表达模式或转录因子活性(例如乳腺癌中的ER状态)、拷贝数畸变(例如乳腺癌中的HER2状态)、特定的重排(例如AML中的FLT3)、特定的基因点突变状态(例如，IDH基因点突变)和DNA甲基化模式(例如，脑癌中的MGMT基因启动子甲基化)。Traditionally, cancers have been classified according to the tissue of origin—for example, colorectal cancer, breast cancer, lung cancer, etc. In modern clinical oncology practice, the ability to differentiate cancer subtypes based on various molecular, developmental, and functional bases has become increasingly important. Treatment decisions often depend on the precise subtype of the cancer, and clinicians may need to identify the subtype before initiating treatment. Examples of cancer subtypes that may influence treatment decisions include (but are not limited to) stage (e.g., early-stage lung cancer treated surgically versus advanced lung cancer treated with chemotherapy), histology (e.g., small cell carcinoma versus adenocarcinoma versus squamous cell carcinoma in lung cancer), gene expression patterns or transcription factor activity (e.g., ER status in breast cancer), copy number aberrations (e.g., HER2 status in breast cancer), specific rearrangements (e.g., FLT3 in AML), specific gene point mutation status (e.g., IDH gene point mutation), and DNA methylation patterns (e.g., MGMT gene promoter methylation in brain cancer).

本文描述的方法适用于多种癌症，包括但不限于肾上腺癌、肛门癌、胆管癌、膀胱癌、骨癌、脑/cns肿瘤、乳腺癌、Castleman病、宫颈癌、结肠癌/直肠癌、子宫内膜癌、食道癌、尤文家族肿瘤、眼癌、胆囊癌、胃肠道类癌、胃肠道间质瘤(gist)、妊娠滋养细胞疾病、霍奇金病、卡波西肉瘤、肾癌、喉和下咽癌、白血病(急性淋巴细胞、急性髓样、慢性淋巴细胞、慢性髓样、慢性粒单核细胞)、肝癌、肺癌(非小细胞、小细胞、肺类癌)、淋巴瘤、皮肤淋巴瘤、恶性间皮瘤、多发性骨髓瘤、骨髓增生异常综合征、鼻腔和鼻旁窦癌、鼻咽癌、神经母细胞瘤、非霍奇金淋巴瘤、口腔和口咽癌、骨肉瘤、卵巢癌、阴茎癌、垂体瘤、前列腺癌、视网膜母细胞瘤、横纹肌肉瘤、唾液腺癌、肉瘤-成人软组织癌、皮肤癌(基底和鳞状细胞、黑素瘤、默克尔细胞)、小肠癌、胃癌、睾丸癌、胸腺癌、甲状腺癌、子宫肉瘤、阴道癌、外阴癌、Waldenstrom巨球蛋白血症、Wilms瘤。The methods described in this article are applicable to a variety of cancers, including but not limited to adrenal cancer, anal cancer, bile duct cancer, bladder cancer, bone cancer, brain/CNS tumors, breast cancer, Castleman's disease, cervical cancer, colon/rectal cancer, endometrial cancer, esophageal cancer, Ewing family tumors, eye cancer, gallbladder cancer, gastrointestinal carcinoid tumors, gastrointestinal stromal tumors (GIST), gestational trophoblastic disease, Hodgkin's disease, Kaposi's sarcoma, kidney cancer, laryngeal and hypopharyngeal cancer, leukemia (acute lymphoblastic, acute myeloid, chronic lymphoblastic, chronic myeloid, chronic myelomonocytic), liver cancer, and lung cancer (non-small cell, small cell, lung cancer). Carcinoid tumors, lymphomas, cutaneous lymphomas, malignant mesotheliomas, multiple myeloma, myelodysplastic syndromes, nasal and paranasal sinus carcinomas, nasopharyngeal carcinomas, neuroblastomas, non-Hodgkin lymphomas, oral and oropharyngeal carcinomas, osteosarcomas, ovarian cancers, penile cancers, pituitary adenomas, prostate cancers, retinoblastomas, rhabdomyosarcomas, salivary gland carcinomas, sarcomas—adult soft tissue cancers, skin cancers (basal and squamous cell, melanoma, Merkel cell), small bowel cancers, gastric cancers, testicular cancers, thymic carcinomas, thyroid cancers, uterine sarcomas, vaginal cancers, vulvar cancers, Waldenstrom's macroglobulinemia, Wilms' tumors.

多种测序技术是本领域技术人员已知的，例如聚合酶链式反应(PCR)，然后进行Sanger测序。还可以使用下一代测序(NGS)技术，也称为高通量测序，其包括各种测序技术，包括：Illumina(Solexa)测序、Roche 454测序、Ion torrent：质子/PGM测序、SOLiD测序。NGS可以比以前使用的Sanger测序更快且更便宜地对DNA和RNA进行测序。在一些实施方式中，对所述测序优化用于短读取测序。Various sequencing technologies are known to those skilled in the art, such as polymerase chain reaction (PCR) followed by Sanger sequencing. Next-generation sequencing (NGS) technologies, also known as high-throughput sequencing, can also be used, encompassing a variety of sequencing technologies including Illumina (Solexa) sequencing, Roche 454 sequencing, Ion torrent: proton/PGM sequencing, and SOLiD sequencing. NGS can sequence DNA and RNA faster and cheaper than the previously used Sanger sequencing. In some embodiments, the sequencing is optimized for short-read sequencing.

如本文所用，术语“受试者”是指动物界的任何成员，优选为人类，最优选为患有，曾患有或疑似患有前列腺癌的人类。As used herein, the term “subject” means any member of the animal kingdom, preferably a human, and most preferably a human who has, has had, or is suspected of having prostate cancer.

无细胞甲基化DNA是在血液中自由循环的DNA，并在DNA的各种已知区域被甲基化。可以采集样品(例如血浆样品)来分析无细胞甲基化DNA。因此，在一些实施方式中，样品是受试者的血液或血浆。Cell-free methylated DNA is DNA that circulates freely in the blood and is methylated in various known regions of the DNA. Samples (e.g., plasma samples) can be collected to analyze cell-free methylated DNA. Therefore, in some embodiments, the sample is the blood or plasma of a subject.

如本文所用，“文库制备”包括列表末端修复、A-加尾、衔接子连接或对无细胞DNA进行以允许DNA的后续测序的任何其他制备。As used herein, “library preparation” includes list end repair, A-tailing, adaptor ligation, or any other preparation of cell-free DNA to allow for subsequent sequencing of the DNA.

如本文所用，“填充DNA”可以是非编码DNA或可以由扩增子组成。As used in this article, “filler DNA” can be non-coding DNA or can consist of amplicon.

可以例如使用足够的热量使DNA样品变性。For example, sufficient heat can be used to denature the DNA sample.

在一些实施方式中，比较步骤基于使用统计分类器的拟合。使用DNA甲基化数据的统计分类器可用于将样品分配给特定的疾病状态，例如癌症类型或亚型。出于癌症类型或亚型分类的目的，分类器将由统计模型内的一个或多个DNA甲基化变量(即特征)组成，并且统计模型的输出将具有一个或多个阈值以区分不同的疾病状态。统计分类器中使用的特定特征和阈值可以从癌症类型或亚型的先验知识、可能最有用的特征的先验知识、机器学习、或两种或多种这些方法的组合得到。In some implementations, the comparison step is based on fitting using a statistical classifier. A statistical classifier using DNA methylation data can be used to assign samples to specific disease states, such as cancer types or subtypes. For the purpose of cancer type or subtype classification, the classifier will consist of one or more DNA methylation variables (i.e., features) within a statistical model, and the output of the statistical model will have one or more thresholds to distinguish different disease states. The specific features and thresholds used in the statistical classifier can be derived from prior knowledge of cancer types or subtypes, prior knowledge of the most potentially useful features, machine learning, or a combination of two or more of these methods.

在一些实施方式中，分类器是机器学习得到的。优选地，分类器是弹性网分类器、套索、支持向量机、随机森林或神经网络。In some implementations, the classifier is obtained through machine learning. Preferably, the classifier is an elastic net classifier, lasso, support vector machine, random forest, or neural network.

被分析的基因组空间可以是全基因组的，或者优选地限于调节区域(即，FANTOM5增强子、CpG岛、CpG海岸和CpG架子)。The analyzed genomic space can be whole-genome or preferably limited to regulatory regions (i.e., FANTOM5 enhancers, CpG islands, CpG shores, and CpG frameworks).

优选地，回收的spike-in甲基化DNA的百分比作为协变量包括在内，以控制下拉效率变化。Preferably, the percentage of recovered spike-in methylated DNA is included as a covariate to control for variations in pull-down efficiency.

对于能够彼此区分多种癌症类型(或亚型)的分类器，分类器将优选地由来自感兴趣的每种类型(或亚型)的成对比较的差异性甲基化区域组成。For a classifier capable of distinguishing multiple cancer types (or subtypes) from one another, the classifier will preferably consist of differentially methylated regions from pairwise comparisons of each type (or subtype) of interest.

在一些实施方式中，来自健康和癌症个体的对照无细胞甲基化DNA序列包含在健康和癌症个体之间的差异性甲基化区域(DMR)的数据库中。In some implementations, control cell-free methylated DNA sequences from healthy and cancer individuals are included in a database of differentially methylated regions (DMRs) between healthy and cancer individuals.

在一些实施方式中，来自健康和癌症个体的对照无细胞甲基化DNA序列限于在源自体液的无细胞DNA的DNA中的健康和癌症个体之间是差异性甲基化的那些对照无细胞甲基化DNA序列，例如来自血清、脑脊髓液、尿液、痰、胸膜液、腹水、眼泪、汗液、巴氏涂片液、内窥镜刷液等，优选地来自血浆。In some embodiments, the control cell-free methylated DNA sequences from healthy and cancer individuals are limited to those control cell-free methylated DNA sequences that are differentially methylated between healthy and cancer individuals in cell-free DNA derived from bodily fluids, such as those from serum, cerebrospinal fluid, urine, sputum, pleural fluid, ascites, tears, sweat, Pap smears, endoscopic brush fluid, etc., preferably from plasma.

在一些实施方式中，样品具有小于100ng、75ng或50ng的无细胞DNA。In some implementations, the sample contains less than 100 ng, 75 ng, or 50 ng of cell-free DNA.

在一些实施方式中，第一量的填充DNA包含约5％、10％、15％、20％、30％、40％、50％、60％、70％、80％、90％或100％的甲基化填充DNA，其余为未甲基化的填充DNA，优选地为5％至50％、10％至40％或15％至30％的甲基化填充DNA。In some embodiments, the first amount of filler DNA comprises about 5%, 10%, 15%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 100% methylated filler DNA, with the remainder being unmethylated filler DNA, preferably 5% to 50%, 10% to 40%, or 15% to 30% methylated filler DNA.

在一些实施方式中，第一量的填充DNA为20ng至100ng，优选地30ng至100ng，更优选地50ng至100ng。In some embodiments, the first amount of filling DNA is 20 ng to 100 ng, preferably 30 ng to 100 ng, and more preferably 50 ng to 100 ng.

在一些实施方式中，来自样品的无细胞DNA和第一量的填充DNA一起包含至少50ng的总DNA，优选地至少100ng的总DNA。In some embodiments, the cell-free DNA from the sample and the first amount of filler DNA together contain at least 50 ng of total DNA, preferably at least 100 ng of total DNA.

在一些实施方式中，填充DNA的长度为50bp至800bp，优选地100bp至600bp，并且更优选地200bp至600bp。In some embodiments, the length of the filled DNA is 50 bp to 800 bp, preferably 100 bp to 600 bp, and more preferably 200 bp to 600 bp.

在一些实施方式中，填充DNA是双链的。填充DNA是双链的。例如，填充DNA可以是垃圾DNA。填充DNA也可以是内源性或外源性DNA。例如，填充DNA是非人类DNA，并且在优选的实施例中，是λDNA。如本文所用，“λDNA”是指肠杆菌噬菌体λDNA。在一些实施方式中，填充DNA与人类DNA没有比对。In some embodiments, the filler DNA is double-stranded. The filler DNA is double-stranded. For example, the filler DNA can be junk DNA. The filler DNA can also be endogenous or exogenous DNA. For example, the filler DNA is non-human DNA, and in a preferred embodiment, it is λDNA. As used herein, "λDNA" refers to Enterobacterial bacteriophage λDNA. In some embodiments, the filler DNA is not aligned with human DNA.

在一些实施方式中，结合剂是包含甲基-CpG结合结构域的蛋白质。一种这样的示例性蛋白质是MBD2蛋白质。如本文所用，“甲基-CpG结合结构域(MBD)”是指蛋白质和酶的某些结构域，其长度约70个残基，并结合至包含一个或多个对称甲基化CpG的DNA。MeCP2、MBD1、MBD2、MBD4和BAZ2的MBD介导与DNA的结合，在MeCP2、MBD1和MBD2的情况下，优先与甲基化CpG结合。人类蛋白质MECP2、MBD1、MBD2、MBD3和MBD4包含存在于每个甲基CpG结合结构域(MBD)中而相关的核蛋白家族。除MBD3外，这些蛋白质中的每一种都能够与甲基化DNA特异性结合。In some embodiments, the binder is a protein containing a methyl-CpG binding domain. One exemplary protein of this kind is the MBD2 protein. As used herein, a “methyl-CpG binding domain (MBD)” refers to a domain of a protein or enzyme that is approximately 70 residues long and binds to DNA containing one or more symmetrically methylated CpGs. The MBDs of MeCP2, MBD1, MBD2, MBD4, and BAZ2 mediate binding to DNA, with the latter preferentially binding to methylated CpGs in the case of MeCP2, MBD1, and MBD2. Human proteins MECP2, MBD1, MBD2, MBD3, and MBD4 comprise the nucleoprotein families associated with each methyl-CpG binding domain (MBD). Each of these proteins, except MBD3, is capable of specifically binding to methylated DNA.

在其他实施方式中，结合剂是抗体，并且捕获无细胞甲基化DNA包括使用抗体将无细胞甲基化DNA免疫沉淀。如本文所用，“免疫沉淀”是指使用与该特定抗原特异性结合的抗体从溶液中沉淀出抗原(例如多肽和核苷酸)的技术。此过程可用于从样品中分离和浓缩特定的蛋白质或DNA，并要求在该过程中的某个时刻将抗体偶联到固体底物上。固体底物包括例如珠子，例如磁珠。其他类型的珠子和固体底物是本领域已知的。In other embodiments, the binder is an antibody, and capturing cell-free methylated DNA involves immunoprecipitating the cell-free methylated DNA with an antibody. As used herein, “immunoprecipitation” refers to a technique of precipitating an antigen (e.g., peptides and nucleotides) from solution using an antibody that specifically binds to that particular antigen. This process can be used to isolate and concentrate specific proteins or DNA from a sample and requires, at some point in the process, the antibody to be conjugated to a solid substrate. Solid substrates include, for example, beads, such as magnetic beads. Other types of beads and solid substrates are known in the art.

一种示例性抗体是5-MeC抗体。对于免疫沉淀过程，在一些实施方式中，将至少0.05μg抗体添加至样品；而在更优选的实施例中，将至少0.16μg抗体加入样品中。为了确认免疫沉淀反应，在一些实施方式中，本文所描述的方法还包括将第二量的对照DNA添加至样品中的步骤。An exemplary antibody is a 5-MeC antibody. For the immunoprecipitation process, in some embodiments, at least 0.05 μg of antibody is added to the sample; in more preferred embodiments, at least 0.16 μg of antibody is added to the sample. To confirm the immunoprecipitation reaction, in some embodiments, the method described herein further includes the step of adding a second amount of control DNA to the sample.

在一些实施方式中，该方法还包括将第二量的对照DNA添加至样品中以确认免疫沉淀反应的步骤。In some embodiments, the method further includes the step of adding a second amount of control DNA to the sample to confirm the immunoprecipitation reaction.

如本文所用，“对照”可包括阳性和阴性对照，或至少阳性对照。As used in this article, “control” may include positive and negative controls, or at least positive controls.

在一些实施方式中，该方法还包括将第二量的对照DNA添加至样品中以确认无细胞甲基化DNA的捕获的步骤。In some embodiments, the method further includes the step of adding a second amount of control DNA to the sample to confirm the capture of cell-free methylated DNA.

在一些实施方式中，识别来自癌细胞的DNA的存在还包括识别起源的癌细胞组织。In some implementations, identifying the presence of DNA from cancer cells also includes identifying the cancer cell tissue of origin.

在一些情况下，肿瘤组织采样可能具有挑战性或具有重大风险，在这种情况下，可能期望在不需要肿瘤组织采样的情况下诊断和/或分型癌症。例如，肺肿瘤组织采样可能需要侵入性手术，例如纵隔镜检查、开胸手术或经皮穿刺活检；这些手术可能需要住院、胸管、机械通气、抗生素或其他医疗介入。由于医学上的并存病或由于偏爱，因此一些个体可能未进行肿瘤组织采样所需的侵入性手术。在某些情况下，获取肿瘤组织的实际程序可能取决于疑似的癌症亚型。在其他情况下，癌症亚型可能在同一个体内随时间进展。通过侵入性肿瘤组织采样程序的连续评估通常是不切实际的，患者对此不能很好地耐受。因此，通过血液测试的非侵入性癌症分型在临床肿瘤学实施中可能具有许多有利的应用。In some cases, tumor tissue sampling can be challenging or carry significant risks, in which case it may be desirable to diagnose and/or subtype cancer without requiring tumor tissue sampling. For example, lung tumor tissue sampling may require invasive procedures such as mediastinoscopy, open thoracotomy, or percutaneous biopsy; these procedures may require hospitalization, chest tubes, mechanical ventilation, antibiotics, or other medical interventions. Some individuals may not undergo the invasive procedures required for tumor tissue sampling due to medical comorbidities or personal preference. In some cases, the actual procedure for obtaining tumor tissue may depend on the suspected cancer subtype. In other cases, the cancer subtype may progress over time within the same body. Continuous evaluation via invasive tumor tissue sampling procedures is often impractical and poorly tolerated by patients. Therefore, non-invasive cancer subtyping via blood tests may have many advantageous applications in the implementation of clinical oncology.

因此，在一些实施方式中，识别起源的癌细胞组织还包括识别癌症亚型。优选地，癌症亚型基于阶段(例如，手术治疗的早期肺癌相对于化疗治疗的晚期肺癌)、组织学(例如，肺癌中的小细胞癌相对于腺癌相对于鳞状细胞癌)、基因表达模式或转录因子活性(例如乳腺癌中的ER状态)、拷贝数畸变(例如乳腺癌中的HER2状态)、特定的重排(例如AML中的FLT3)、特定的基因点突变状态(例如，IDH基因点突变)和DNA甲基化模式(例如，脑癌中的MGMT基因启动子甲基化)来区分癌症。Therefore, in some embodiments, identifying the origin of cancerous tissue also includes identifying cancer subtypes. Preferably, cancer subtypes are distinguished based on stage (e.g., early-stage lung cancer treated surgically versus late-stage lung cancer treated with chemotherapy), histology (e.g., small cell carcinoma versus adenocarcinoma versus squamous cell carcinoma in lung cancer), gene expression patterns or transcription factor activity (e.g., ER status in breast cancer), copy number aberrations (e.g., HER2 status in breast cancer), specific rearrangements (e.g., FLT3 in AML), specific gene point mutation status (e.g., IDH gene point mutation), and DNA methylation patterns (e.g., MGMT gene promoter methylation in brain cancer).

在一些实施方式中，步骤(f)中的比较在全基因组范围内进行。In some implementations, the comparison in step (f) is performed on a genome-wide scale.

在其他实施方式中，步骤(f)中的比较从全基因组限制到特定的调节区域，例如但不限于FANTOM5增强子、CpG岛、CpG海岸、CpG架子或前述的任何组合。In other implementations, the comparison in step (f) ranges from genome-wide restriction to specific regulatory regions, such as, but not limited to, the FANTOM5 enhancer, CpG islands, CpG shores, CpG scaffolds, or any combination thereof.

在一些实施方式中，某些步骤由计算机处理器执行。In some implementations, certain steps are performed by a computer processor.

在一方面，提供了一种检测来自癌细胞的DNA的存在并识别癌症亚型的方法，该方法包括：从受试者样品接收无细胞甲基化DNA的测序数据；将捕获的无细胞甲基化DNA的序列与来自健康和癌症个体的对照无细胞甲基化DNA序列相比较；如果捕获的无细胞甲基化DNA与来自癌症个体的无细胞甲基化DNA序列的一个或多个序列之间在统计学上具有显著相似性，则识别来自癌细胞的DNA的存在；并且如果识别出来自癌细胞的DNA，则基于比较步骤进一步识别起源的癌细胞组织和癌症亚型。In one aspect, a method is provided for detecting the presence of DNA from cancer cells and identifying cancer subtypes, the method comprising: receiving sequencing data of cell-free methylated DNA from a subject sample; comparing the sequence of the captured cell-free methylated DNA with control cell-free methylated DNA sequences from healthy and cancer individuals; identifying the presence of DNA from cancer cells if there is statistically significant similarity between the captured cell-free methylated DNA and one or more sequences of cell-free methylated DNA from cancer individuals; and, if DNA from cancer cells is identified, further identifying the cancer cell tissue of origin and cancer subtype based on the comparison step.

可以在各种实施方式中实施本发明系统和方法。适当配置的计算机设备以及相关联的通信网络、设备、软件和固件可以提供用于实现如上所述的一个或多个实施例的平台。通过实例，图5示出了通用计算机设备100，其可以包括连接到存储单元104和随机存取存储器106的中央处理单元(“CPU”)102。CPU102可以处理操作系统101，应用程序103和数据123。操作系统101，应用程序103和数据123可以存储在存储单元104中，并根据需要加载到存储器106中。计算机设备100可以进一步包括图形处理单元(GPU)122，其可操作地连接到CPU102和存储器106，以从CPU102卸载密集的图像处理计算并且与CPU102并行地运行这些计算。使用通过视频接口105连接的视频显示器108，以及通过I/O接口109连接的各种输入/输出设备，例如键盘115、鼠标112和磁盘驱动器或固态驱动器114，操作者107可以与计算机设备100交互。通过已知方式，鼠标112可以被配置为控制光标在视频显示器108中的移动，并使用鼠标按钮来操作出现在视频显示器108中的各种图形用户界面(GUI)控件。磁盘驱动器或固态驱动器114可以被配置为接受计算机可读介质116。计算机设备100可以经由网络接口111形成网络的一部分，从而允许计算机设备100与其他适当配置的数据处理系统(非显示)通信。一种或多种不同类型的传感器135可以用于从各种来源接收输入。The systems and methods of the present invention can be implemented in various embodiments. Appropriately configured computer devices, along with associated communication networks, devices, software, and firmware, can provide a platform for implementing one or more embodiments as described above. By way of example, FIG5 illustrates a general-purpose computer device 100, which may include a central processing unit (“CPU”) 102 connected to a storage unit 104 and random access memory 106. The CPU 102 may process an operating system 101, application programs 103, and data 123. The operating system 101, application programs 103, and data 123 may be stored in the storage unit 104 and loaded into the memory 106 as needed. The computer device 100 may further include a graphics processing unit (GPU) 122 operatively connected to the CPU 102 and the memory 106 to offload intensive image processing computations from the CPU 102 and to run these computations in parallel with the CPU 102. Using a video display 108 connected via video interface 105 and various input/output devices connected via I/O interface 109, such as a keyboard 115, mouse 112, and disk drive or solid-state drive 114, an operator 107 can interact with computer device 100. The mouse 112 can be configured in known ways to control cursor movement within the video display 108 and to use mouse buttons to operate various graphical user interface (GUI) controls displayed on the video display 108. The disk drive or solid-state drive 114 can be configured to accept computer-readable media 116. Computer device 100 can form part of a network via network interface 111, allowing computer device 100 to communicate with other appropriately configured data processing systems (non-display). One or more different types of sensors 135 can be used to receive input from various sources.

本发明系统和方法实际上可以在任何形式的计算机设备上实施，其包括台式计算机、膝上型计算机、平板计算机或无线手持设备。本发明系统和方法还可以被实现为包括计算机程序代码的计算机可读/可用介质，以使一个或多个计算机设备能够实现根据本发明的方法中的各个处理步骤中的每一个。在执行整个操作的多于一个计算机设备的情况下，将计算机设备联网以分配操作的各个步骤。应理解，术语计算机可读介质或计算机可用介质包括程序代码的物理实施例的任何类型中的一个或多个。特别地，计算机可读/可用介质可以包括体现在一个或多个便携式存储制品(例如光盘、磁盘、磁带等)上，在一个或多个计算设备的数据存储部分(例如与计算机和/或存储系统关联的存储器)上的程序代码。The systems and methods of the present invention can be implemented on virtually any type of computer device, including desktop computers, laptop computers, tablet computers, or wireless handheld devices. The systems and methods of the present invention can also be implemented as computer-readable/usable media comprising computer program code, enabling one or more computer devices to implement each of the various processing steps in the methods according to the invention. In cases where more than one computer device performs the entire operation, the computer devices are networked to distribute the various steps of the operation. It should be understood that the terms computer-readable medium or computer-usable medium include one or more of any type of physical embodiment of program code. In particular, computer-readable/usable media may include program code embodied on one or more portable storage articles (e.g., optical discs, magnetic disks, magnetic tapes, etc.) or on a data storage portion (e.g., memory associated with a computer and/or storage system) of one or more computing devices.

在一方面，提供了一种计算机实现的方法，其检测来自癌细胞的DNA的存在并识别癌症亚型，该方法包括：至少一个处理器从受试者样品接收无细胞甲基化DNA的测序数据；在至少一个处理器处，将捕获的无细胞甲基化DNA的序列与来自健康和癌症个体的对照无细胞甲基化DNA序列相比较；如果捕获的无细胞甲基化DNA与来自癌症个体的无细胞甲基化DNA序列的一个或多个序列之间在统计学上具有显著相似性，则在至少一个处理器处识别来自癌细胞的DNA的存在；并且如果识别出来自癌细胞的DNA，则基于比较步骤进一步识别起源的癌细胞组织和癌症亚型；In one aspect, a computer-implemented method is provided for detecting the presence of DNA from cancer cells and identifying cancer subtypes, the method comprising: receiving sequencing data of cell-free methylated DNA from a subject sample by at least one processor; at at least one processor, comparing the sequence of the captured cell-free methylated DNA with control cell-free methylated DNA sequences from healthy and cancer individuals; identifying the presence of DNA from cancer cells at at least one processor if there is statistically significant similarity between the captured cell-free methylated DNA and one or more sequences of cell-free methylated DNA from cancer individuals; and further identifying the originating cancer cell tissue and cancer subtype based on the comparison step if DNA from cancer cells is identified.

在一方面，提供了一种用于检测来自癌细胞的DNA的存在并识别癌症亚型的设备，该设备包括：至少一个处理器；以及与至少一个处理器通信的电子存储器，电子存储器存储处理器可执行代码，该代码在至少一个处理器处执行时，使至少一个处理器：从受试者样品接收无细胞甲基化DNA的测序数据；将捕获的无细胞甲基化DNA的序列与来自健康和癌症个体的对照无细胞甲基化DNA序列相比较；如果捕获的无细胞甲基化DNA与来自癌症个体的无细胞甲基化DNA序列的一个或多个序列之间在统计学上具有显著相似性，则识别来自癌细胞的DNA的存在；并且如果识别出来自癌细胞的DNA，则基于比较步骤进一步识别起源的癌细胞组织和癌症亚型。In one aspect, an apparatus is provided for detecting the presence of DNA from cancer cells and identifying cancer subtypes, the apparatus comprising: at least one processor; and an electronic memory in communication with the at least one processor, the electronic memory storing processor-executable code that, when executed at the at least one processor, causes the at least one processor to: receive sequencing data of cell-free methylated DNA from a subject sample; compare the sequence of the captured cell-free methylated DNA with control cell-free methylated DNA sequences from healthy and cancer individuals; identify the presence of DNA from cancer cells if there is statistically significant similarity between the captured cell-free methylated DNA and one or more sequences of cell-free methylated DNA from cancer individuals; and, if DNA from cancer cells is identified, further identify the cancer cell tissue of origin and cancer subtype based on the comparison step.

如本文所用，“处理器”可以是任何类型的处理器，诸如，例如，任何类型的通用微处理器或微控制器(例如，Intel^TMx86、PowerPC^TM、ARM^TM处理器等)、数字信号处理(DSP)处理器、集成电路、现场可编程门阵列(FPGA)或其任意组合。As used herein, “processor” can be any type of processor, such as, for example, any type of general-purpose microprocessor or microcontroller (e.g., Intel ^™ x86, PowerPC ^™ , ARM ^™ processors, etc.), digital signal processing (DSP) processor, integrated circuit, field-programmable gate array (FPGA), or any combination thereof.

如本文所用，“存储器”可以包括位于内部或外部的任何类型的计算机存储器的适当组合，诸如，例如，随机存取存储器(RAM)、只读存储器(ROM)、光盘只读存储器(CDROM)、电光存储器、磁光存储器、可擦除可编程只读存储器(EPROM)和电可擦除可编程只读存储器(EEPROM)等。存储器102的部分可以使用常规文件系统来组织，并通过控制设备的整体操作的操作系统来控制和管理。As used herein, “memory” can include any suitable combination of computer memories, whether internal or external, such as, for example, random access memory (RAM), read-only memory (ROM), optical disc read-only memory (CDROM), electro-optical memory, magneto-optical memory, erasable programmable read-only memory (EPROM), and electrically erasable programmable read-only memory (EEPROM). A portion of memory 102 can be organized using a conventional file system and controlled and managed by an operating system that controls the overall operation of the device.

如本文所用，“计算机可读存储介质”(也称为其中体现有计算机可读程序代码的机器可读介质、处理器可读介质或计算机可用介质)是能够以计算机或机器可读的格式存储数据的介质。机器可读介质可以是任何合适的有形、非暂时性介质，包括磁、光或电存储介质，包括磁盘、光盘只读存储器(CD-ROM)、存储设备(易失性或非易失性)、或类似的存储机制。所述计算机可读存储介质可以包含指令、代码序列、配置信息或其他数据的各种集合，这些集合在被执行时，使处理器执行根据本公开的实施例的方法中的步骤。本领域普通技术人员将意识到，实现所述实现方式所必需的其他指令和操作也可以存储在计算机可读存储介质上。存储在计算机可读存储介质上的指令可以由处理器或其他合适的处理设备执行，并且可以与电路交互以执行所描述的任务。As used herein, a "computer-readable storage medium" (also referred to as a machine-readable medium embodying computer-readable program code, a processor-readable medium, or a computer-usable medium) is a medium capable of storing data in a computer- or machine-readable format. A machine-readable medium can be any suitable tangible, non-transitory medium, including magnetic, optical, or electrical storage media, including magnetic disks, optical disc read-only memories (CD-ROMs), storage devices (volatile or non-volatile), or similar storage mechanisms. The computer-readable storage medium can contain various sets of instructions, code sequences, configuration information, or other data that, when executed, cause a processor to perform the steps in the methods according to embodiments of this disclosure. Those skilled in the art will recognize that other instructions and operations necessary to implement the described implementations can also be stored on the computer-readable storage medium. Instructions stored on the computer-readable storage medium can be executed by a processor or other suitable processing device and can interact with circuitry to perform the described tasks.

如本文所用，“数据结构”是在计算机中组织数据以便可以被有效地使用的特定方式。数据结构可以实现一种或多种特定的抽象数据类型(ADT)，它们指定可以对数据结构执行的操作以及这些操作的计算复杂性。相比之下，数据结构是ADT提供的规范的具体实现。As used in this article, a "data structure" is a specific way of organizing data in a computer so that it can be used efficiently. A data structure can implement one or more specific abstract data types (ADTs), which specify the operations that can be performed on the data structure and the computational complexity of those operations. In contrast, a data structure is a concrete implementation of the specification provided by an ADT.

本发明的优点通过以下实例进一步说明。本文阐述的实例及其具体细节仅用于举例说明，不应解释为对本发明权利要求的限制。The advantages of this invention are further illustrated by the following examples. The examples and specific details described herein are for illustrative purposes only and should not be construed as limiting the scope of the claims of this invention.

本发明提供了包括但不限于以下实施方式：This invention provides, but is not limited to, the following embodiments:

1.一种在受试者中检测来自癌细胞的DNA的存在的方法，包括：1. A method for detecting the presence of DNA from cancer cells in a subject, comprising:

(a)提供来自受试者的无细胞DNA的样品；(a) Provide a sample of cell-free DNA from the subject;

(b)使所述样品进行文库制备以允许随后对所述无细胞甲基化DNA进行测序；(b) Perform library preparation on the sample to allow subsequent sequencing of the cell-free methylated DNA;

(c)向所述样品中加入第一量的填充DNA，其中至少一部分的所述填充DNA是甲基化的，然后可选地使所述样品变性；(c) Add a first amount of filler DNA to the sample, wherein at least a portion of the filler DNA is methylated, and then optionally denature the sample;

(d)使用对甲基化多核苷酸具有选择性的结合剂捕获无细胞甲(d) Capturing cell-free methylated polynucleotides using a binding agent selective for methylated polynucleotides.

基化DNA；Matrixed DNA;

(e)对所捕获的无细胞甲基化DNA进行测序；(e) Sequencing of the captured cell-free methylated DNA;

(f)比较所捕获的无细胞甲基化DNA的序列与来自健康和癌症(f) Compare the sequences of the captured cell-free methylated DNA with those from healthy and cancerous individuals.

个体的对照无细胞甲基化DNA序列；Individual control cell-free methylated DNA sequences;

(g)如果所捕获的无细胞甲基化DNA与来自癌症个体的无细胞甲基化DNA序列的一个或多个序列之间在统计学上具有显著相似性，则识别来自癌细胞的DNA的存在。(g) The presence of DNA from cancer cells is identified if the captured cell-free methylated DNA is statistically significantly similar to one or more sequences of cell-free methylated DNA from a cancer individual.

2.根据实施方式1所述的方法，其中所述样品来自所述受试者的血液或血浆。2. The method according to embodiment 1, wherein the sample is derived from the subject's blood or plasma.

3.根据实施方式1或2所述的方法，其中比较步骤(f)基于使用统计分类器的拟合。3. The method according to embodiment 1 or 2, wherein the comparison step (f) is based on fitting using a statistical classifier.

4.根据实施方式3所述的方法，其中所述分类器是机器学习得出的。4. The method according to embodiment 3, wherein the classifier is derived by machine learning.

5.根据实施方式4所述的方法，其中所述分类器是弹性网分类器、套索、支持向量机、随机森林或神经网络。5. The method according to embodiment 4, wherein the classifier is an elastic net classifier, a lasso, a support vector machine, a random forest, or a neural network.

6.根据实施方式1至5中任一项所述的方法，其中来自健康和癌症个体的所述对照无细胞甲基化DNA序列包含在健康和癌症个体之间的差异性甲基化区域(DMR)的数据库中。6. The method according to any one of embodiments 1 to 5, wherein the control cell-free methylated DNA sequences from healthy and cancer individuals are included in a database of differentially methylated regions (DMRs) between healthy and cancer individuals.

7.根据实施方式1至6中任一项所述的方法，其中来自健康和癌症个体的所述对照无细胞甲基化DNA序列限于在源自无细胞DNA的DNA中的在健康和癌症个体之间是差异性甲基化的那些对照无细胞甲基化DNA序列。7. The method according to any one of embodiments 1 to 6, wherein the control cell-free methylated DNA sequences from healthy and cancer individuals are limited to those control cell-free methylated DNA sequences that are differentially methylated between healthy and cancer individuals in DNA derived from cell-free DNA.

8.根据实施方式7所述的方法，其中所述对照无细胞甲基化DNA序列在源自血浆的DNA中在健康和癌症个体之间是差异性甲基化的。8. The method according to embodiment 7, wherein the control cell-free methylated DNA sequence is differentially methylated in plasma-derived DNA between healthy and cancer individuals.

9.根据实施方式1至8中任一项所述的方法，其中所述样品具有小于1009. The method according to any one of embodiments 1 to 8, wherein the sample has a content of less than 100

ng、75ng或50ng的无细胞DNA。ng, 75ng or 50ng of cell-free DNA.

10.根据实施方式1至9中任一项所述的方法，其中所述第一量的填充DNA10. The method according to any one of embodiments 1 to 9, wherein the first amount of filling DNA

包含约5％、10％、15％、20％、30％、40％、50％、60％、70％、80％、90％或100％的甲基化填充DNA，其余为未甲基化的填充DNA，优选地为5％至50％、10％至40％或15％至30％的甲基化填充DNA。It contains about 5%, 10%, 15%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 100% methylated filler DNA, with the remainder being unmethylated filler DNA, preferably 5% to 50%, 10% to 40%, or 15% to 30% methylated filler DNA.

11.根据实施方式1至9中任一项所述的方法，其中所述第一量的填充DNA11. The method according to any one of embodiments 1 to 9, wherein the first amount of filling DNA

为20ng至100ng，优选为30ng至100ng，更优选为50ng至100ng。The amount is 20 ng to 100 ng, preferably 30 ng to 100 ng, and more preferably 50 ng to 100 ng.

12.根据实施方式1至11中任一项所述的方法，其中来自所述样品的所述无细胞DNA和所述第一量的填充DNA一起包含至少50ng的总DNA，优选至少100ng的总DNA。12. The method according to any one of embodiments 1 to 11, wherein the cell-free DNA from the sample and the first amount of filler DNA together contain at least 50 ng of total DNA, preferably at least 100 ng of total DNA.

13.根据实施方式1至12中任一项所述的方法，其中所述填充DNA的长度为50bp至800bp，优选为100bp至600bp，并且更优选为200bp13. The method according to any one of embodiments 1 to 12, wherein the length of the filled DNA is 50 bp to 800 bp, preferably 100 bp to 600 bp, and more preferably 200 bp.

至600bp。Up to 600bp.

14.根据实施方式1至13中任一项所述的方法，其中所述填充DNA是双链的。14. The method according to any one of embodiments 1 to 13, wherein the filling DNA is double-stranded.

15.根据实施方式1至2中任一项所述的方法，其中所述填充DNA是垃圾DNA。15. The method according to any one of embodiments 1 to 2, wherein the filling DNA is junk DNA.

16.根据实施方式1至3中任一项所述的方法，其中所述填充DNA是内源性或外源性DNA。16. The method according to any one of embodiments 1 to 3, wherein the filling DNA is endogenous or exogenous DNA.

17.根据实施方式16所述的方法，其中所述填充DNA是非人类DNA，优选为λDNA。17. The method according to embodiment 16, wherein the filling DNA is non-human DNA, preferably λ DNA.

18.根据实施方式1至17中任一项所述的方法，其中所述填充DNA与人类DNA没有比对。18. The method according to any one of embodiments 1 to 17, wherein the filled DNA is not compared with human DNA.

19.根据实施方式1至18中任一项所的方法，其中所述结合剂是包含甲基-CpG结合结构域的蛋白质。19. The method according to any one of embodiments 1 to 18, wherein the binding agent is a protein containing a methyl-CpG binding domain.

20.根据实施方式1至19中任一项所述的方法，其中所述蛋白质是MBD2蛋白质。20. The method according to any one of embodiments 1 to 19, wherein the protein is MBD2 protein.

21.根据实施方式1至20中任一项所述的方法，其中步骤(d)包括使用抗体将所述无细胞甲基化DNA免疫沉淀。21. The method according to any one of embodiments 1 to 20, wherein step (d) includes immunoprecipitation of the cell-free methylated DNA using an antibody.

22.根据实施方式21所述的方法，包括将至少0.05μg的所述抗体添加至所述样品用于免疫沉淀，并且优选地至少0.16μg。22. The method according to embodiment 21, comprising adding at least 0.05 μg of the antibody to the sample for immunoprecipitation, and preferably at least 0.16 μg.

23.根据实施方式21所述的方法，其中所述抗体是5-MeC抗体。23. The method according to embodiment 21, wherein the antibody is a 5-MeC antibody.

24.根据实施方式21所述的方法，还包括在步骤(c)之后向所述样品中添加第二量的对照DNA以确认所述免疫沉淀反应的步骤。24. The method according to embodiment 21 further includes the step of adding a second amount of control DNA to the sample after step (c) to confirm the immunoprecipitation reaction.

25.根据实施方式1至23中任一项所述的方法，还包括在步骤(c)之后向所述样品中添加第二量的对照DNA以确认所述无细胞甲基化DNA的捕获的步骤。25. The method according to any one of embodiments 1 to 23 further includes the step of adding a second amount of control DNA to the sample after step (c) to confirm the capture of the cell-free methylated DNA.

26.根据实施方式1至25中任一项所述的方法，其中识别来自癌细胞的DNA的存在还包括识别起源的癌细胞组织。26. The method according to any one of embodiments 1 to 25, wherein identifying the presence of DNA from cancer cells further includes identifying the cancer cell tissue of origin.

27.根据实施方式26所述的方法，其中识别所述起源的癌细胞组织还包括识别癌症亚型。27. The method according to embodiment 26, wherein identifying the cancer cell tissue of the origin further includes identifying a cancer subtype.

28.根据实施方式27所述的方法，其中所述癌症亚型基于阶段、组织学、基因表达模式、拷贝数畸变、重排或点突变状态来区分所述癌症。28. The method according to embodiment 27, wherein the cancer subtype is distinguished based on stage, histology, gene expression pattern, copy number aberration, rearrangement or point mutation status.

29.根据实施方式1至28中任一项所述的方法，其中步骤(f)中的所述比较在全基因组范围内进行。29. The method according to any one of embodiments 1 to 28, wherein the comparison in step (f) is performed on a genome-wide scale.

30.根据实施方式1至28中任一项所述的方法，其中步骤(f)中的所述比较从全基因组限制到特定的调节区域。30. The method according to any one of embodiments 1 to 28, wherein the comparison in step (f) is restricted from the whole genome to a specific regulatory region.

31.根据实施方式30所述的方法，其中所述调节区域是FANTOM5增强子、31. The method according to embodiment 30, wherein the adjustment region is a FANTOM5 enhancer,

CpG岛、CpG海岸、CpG架子或前述的任何组合。CpG islands, CpG coasts, CpG shelves, or any combination thereof.

32.根据实施方式1至31中任一项所述的方法，其中步骤(f)和(g)通过计算机处理器执行。32. The method according to any one of embodiments 1 to 31, wherein steps (f) and (g) are executed by a computer processor.

33.一种检测来自癌细胞的DNA的存在并识别癌症亚型的方法，所述方法33. A method for detecting the presence of DNA from cancer cells and identifying cancer subtypes, said method

包括：include:

a.从受试者样品接收无细胞甲基化DNA的测序数据；a. Receiving sequencing data of cell-free methylated DNA from subject samples;

b.比较所捕获的无细胞甲基化DNA的序列与来自健康和癌症个b. Compare the sequences of the captured cell-free methylated DNA with those from healthy and cancer patients.

体的对照无细胞甲基化DNA序列；The control group of cell-free methylated DNA sequences;

c.如果所捕获的无细胞甲基化DNA与来自癌症个体的无细胞甲基化DNA序列的一个或多个序列之间在统计学上具有显著相似性，则c. If the captured cell-free methylated DNA is statistically significantly similar to one or more sequences of cell-free methylated DNA from a cancer individual, then

识别来自癌细胞的DNA的存在；以及Identifying the presence of DNA from cancer cells; and

d.如果在步骤c中识别出来自癌细胞的DNA，则基于步骤b中的所述比较进一步识别所述起源的癌细胞组织和癌症亚型。d. If DNA from cancer cells is identified in step c, the origin of the cancer cell tissue and cancer subtype is further identified based on the comparison in step b.

34.一种检测来自癌细胞的DNA的存在并识别癌症亚型的计算机实现的方法，所述方法包括：34. A computer-implemented method for detecting the presence of DNA from cancer cells and identifying cancer subtypes, the method comprising:

a.至少一个处理器从受试者样品接收无细胞甲基化DNA的测序a. At least one processor receives sequencing of cell-free methylated DNA from the subject sample.

数据；data;

b.在所述至少一个处理器处，比较所捕获的无细胞甲基化DNA的序列与来自健康和癌症个体的对照无细胞甲基化DNA序列；b. At the at least one processor, the sequence of the captured cell-free methylated DNA is compared with the sequence of control cell-free methylated DNA from healthy and cancer individuals;

c.如果所捕获的无细胞甲基化DNA与来自癌症个体的无细胞甲基化DNA序列的一个或多个序列之间在统计学上具有显著相似性，则在所述至少一个处理器处识别来自癌细胞的DNA的存在；并且如果在步骤c中识别出来自癌细胞的DNA，则基于步骤b中的所述比较进一步识别所述起源的癌细胞组织和癌症亚型。c. If the captured cell-free methylated DNA is statistically significantly similar to one or more sequences of cell-free methylated DNA from a cancer individual, the presence of DNA from cancer cells is identified at the at least one processor; and if DNA from cancer cells is identified in step c, the origin of the cancer cell tissue and cancer subtype is further identified based on the comparison in step b.

35.一种与具有处理器和连接到所述处理器的存储器的通用计算机结合使用的计算机程序产品，所述计算机程序产品包括其上编码有计算机机制的计算机可读存储介质，其中可以将所述计算机程序机制加载到所述计算机的存储器中，并使所述计算机执行实施方式34所述的方法。35. A computer program product for use in conjunction with a general-purpose computer having a processor and a memory connected to the processor, the computer program product comprising a computer-readable storage medium thereon having a computer mechanism encoded thereon, wherein the computer program mechanism can be loaded into the memory of the computer and cause the computer to perform the method of embodiment 34.

36.一种计算机可读介质，其上存储有用于存储实施方式35所述的计算机程序产品的数据结构。36. A computer-readable medium having a data structure thereon for storing the computer program product of embodiment 35.

37.一种用于检测来自癌细胞的DNA的存在并识别癌症亚型的设备，所述37. An apparatus for detecting the presence of DNA from cancer cells and identifying cancer subtypes, wherein...

设备包括：The equipment includes:

至少一个处理器；以及At least one processor; and

与所述至少一个处理器通信的电子存储器，所述电子存储器存储处理器可执行代码，所述代码在所述至少一个处理器处执行时，使所述至An electronic memory communicating with the at least one processor, the electronic memory storing processor-executable code, which, when executed at the at least one processor, causes the at least one processor to...

少一个处理器：One less processor:

c.如果所捕获的无细胞甲基化DNA与来自癌症个体的无细胞甲基化DNA序列的一个或多个序列之间在统计学上具有显著相似性，则识别来自癌细胞的DNA的存在，并且如果在步骤c中识别出来自癌细胞的DNA，则基于步骤b中的所述比较进一步识别所述起源的癌细胞组织和癌症亚型。c. If the captured cell-free methylated DNA is statistically significantly similar to one or more sequences of cell-free methylated DNA from a cancer individual, the presence of DNA from cancer cells is identified, and if DNA from cancer cells is identified in step c, the cancer cell tissue of origin and cancer subtype are further identified based on the comparison in step b.

38.一种检测来自癌细胞的DNA的存在并确定来自两个或多个可能器官的所述癌细胞所处的癌症的位置的方法，所述方法包括：38. A method for detecting the presence of DNA from cancer cells and determining the location of cancer cells from two or more possible organs, the method comprising:

(b)使用对甲基化多核苷酸具有选择性的结合剂从所述样品中(b) Using a binding agent selective for methylated polynucleotides from said sample

捕获无细胞甲基化DNA；Capture cell-free methylated DNA;

(c)对所捕获的无细胞甲基化DNA进行测序；(c) Sequencing of the captured cell-free methylated DNA;

(d)比较所捕获的无细胞甲基化DNA的序列模式与对照个体的两个或多个群体的DNA序列模式，所述两个或多个群体中的每一个在(d) Compare the sequence patterns of the captured cell-free methylated DNA with the DNA sequence patterns of two or more populations of control individuals, each of the two or more populations in which the sequence patterns are...

不同器官中具有局限性癌症；Localized cancers can occur in different organs;

(e)基于所述无细胞DNA的甲基化模式与所述两个或多个群体中一个之间的统计学上显著的相似性，确定所述癌细胞来自哪个器官。(e) Based on the statistically significant similarity between the methylation pattern of the cell-free DNA and one of the two or more populations, determine which organ the cancer cells originated from.

实施例Example

方法与材料Methods and Materials

供体招募和样品采集Donor recruitment and sample collection

CRC、乳腺癌和GBM样品是从大学健康网络生物库获得的；AML样品是从大学健康网络白血病生物库获得的；最后，通过加拿大多伦多西奈山医院(MSH)的家庭医学中心招募健康对照。在获得患者同意的情况下收集的所有样品均获得了研究伦理委员会、大学健康网络和加拿大多伦多西奈山医院的机构批准。CRC, breast cancer, and GBM samples were obtained from the University Health Network Biobank; AML samples were obtained from the University Health Network Leukemia Biobank; and finally, healthy controls were recruited through the Family Medicine Centre at Mount Sinai Hospital (MSH) in Toronto, Canada. All samples collected with patient consent received institutional approval from the research ethics committee, the University Health Network, and Mount Sinai Hospital in Toronto, Canada.

试样处理–cfDNASample processing – cfDNA

EDTA和ACD血浆样品是从生物库和加拿大多伦多西奈山医院(MSH)的家庭医学中心获得的。所有样品都保存在-80℃或气相液氮中直至使用。使用QIAamp循环核酸试剂盒(Qiagen)从0.5-3.5ml血浆中提取无细胞DNA。使用前，通过Qubit对提取的DNA进行定量。EDTA and ACD plasma samples were obtained from the biobank and the Family Medicine Centre at Mount Sinai Hospital (MSH) in Toronto, Canada. All samples were stored at -80°C or in liquid nitrogen until use. Cell-free DNA was extracted from 0.5–3.5 ml of plasma using the QIAamp Cyclic Nucleic Acid Kit (Qiagen). The extracted DNA was quantified using Qubit before use.

试样处理–PDX cfDNASample preparation – PDX cfDNA

经大学健康网络研究伦理委员会批准，在患者同意下从大学健康网络生物库获得的人结肠直肠肿瘤组织使用胶原酶A消化成单细胞。将单细胞皮下注射到4-6周龄的NOD/SCID雄性小鼠中。在通过心脏穿刺采血之前，通过吸入二氧化碳使小鼠安乐死，并将其存储在EDTA管中。从收集的血样中分离血浆并存储在-80℃下。使用QIAamp循环核酸试剂盒(Qiagen)从0.3-0.7ml血浆中提取无细胞DNA。所有动物工作均按照大学健康网络动物保护委员会批准的道德规定进行。With the approval of the University Health Network Research Ethics Committee and with patient consent, human colorectal tumor tissue obtained from the University Health Network Biobank was digested into single cells using collagenase A. The single cells were subcutaneously injected into 4–6 week old NOD/SCID male mice. Mice were euthanized by carbon dioxide inhalation prior to cardiac blood collection and stored in EDTA tubes. Plasma was separated from the collected blood samples and stored at -80°C. Cell-free DNA was extracted from 0.3–0.7 ml of plasma using the QIAamp Cyclic Nucleic Acid Kit (Qiagen). All animal work was conducted in accordance with the ethical guidelines approved by the University Health Network Animal Protection Committee.

cfMeDIP-seqcfMeDIP-seq

cfMeDIP-seq方案的示意图显示在WO2017/190215中。在cfMeDIP之前，使用KapaHyper Prep试剂盒(Kapa Biosystems)对DNA样品进行文库制备。遵循制造商方案进行了一些修改。简而言之，将感兴趣的DNA添加到0.2mL PCR管中，并进行末端修复和A-加尾。使用NEBNext衔接子(来自New England Biolabs的NEBNext Multiplex Oligos for Illumina试剂盒)进行衔接子连接，最终浓度为0.181μM，在20℃孵育20分钟并用AMPure XP珠子纯化。使用USER酶(加拿大New England Biolabs)消化洗脱的文库，然后在MeDIP之前使用Qiagen MinElute PCR纯化试剂盒进行纯化。A schematic diagram of the cfMeDIP-seq protocol is shown in WO2017/190215. Prior to cfMeDIP, DNA samples were prepared into libraries using the KapaHyper Prep kit (Kapa Biosystems). Some modifications were made to the manufacturer's protocol. In short, the DNA of interest was added to a 0.2 mL PCR tube, and end repair and A-tailing were performed. Intercalators were ligated using NEBNext adaptors (NEBNext Multiplex Oligos for Illumina kit from New England Biolabs) to a final concentration of 0.181 μM, incubated at 20°C for 20 min, and purified using AMPure XP beads. The eluted library was digested with USER enzyme (New England Biolabs, Canada) and then purified using the Qiagen MinElute PCR Purification Kit before MeDIP.

将制备的文库与合并的甲基化/未甲基化λPCR产物合并，最终DNA量为100ng，并在进行一些修改后使用Taiwo等人2012[7]的方案进行MeDIP。简而言之，对于MeDIP，在进行一些修饰后按照制造商的方案使用Diagenode MagMeDIP试剂盒(编号C02010021)。在将0.3ng对照甲基化和0.3ng对照未甲基化拟南芥DNA，填充DNA(使DNA[cfDNA+填充剂+对照]的总量达到100ng)和缓冲液添加到含有衔接子连接的DNA的PCR管中之后，将样品加热到95℃，持续10分钟，然后立即放入冰水浴中10分钟。将每个样品分成两个0.2mL PCR管：一个用于10％输入对照，另一个用于要进行免疫沉淀的样品。将MagMeDIP试剂盒中包含的5-mC单克隆抗体33D3(编号C15200081)进行1:15稀释，然后生成稀释的抗体混合物，并添加到样品中。还添加洗涤过的磁珠(按照制造商的说明)，然后在4℃孵育17小时。使用DiagenodeiPure试剂盒纯化样品，并用50μl缓冲液C洗脱。通过qPCR验证了反应的成功(QC1)，以检测加入的拟南芥DNA的存在，确保未甲基化加入的DNA的回收百分比<1％，反应的特异性百分比>99％(根据1-[加入的未甲基化对照DNA的回收率超过加入的甲基化对照DNA的回收率]计算)，然后进行下一步。通过使用qPCR确定扩增每个文库的最佳循环数，然后使用KAPAHiFi Hotstart Mastermix扩增样品，并添加NEBNext多重寡核苷酸至终浓度0.3μM。用于扩增文库的PCR设置如下：在95℃下活化3分钟，然后为98℃进行20秒，65℃进行15秒和72℃进行30秒的预定循环，并在72℃下进行1分钟的最终延伸。使用MinElute PCR纯化柱纯化扩增的文库，然后用3％ Nusieve GTG琼脂糖凝胶进行凝胶大小选择，以除去任何衔接子二聚体。在提交测序之前，确定从被剪切以模拟无细胞DNA的HCT116细胞系DNA(细胞系获自ATCC，无支原体)生成的MeDIP-seq和cfMeDIP-seq文库的甲基化人DNA区(睾丸特异性H2B，TSH2B)和未甲基化人DNA区(GAPDH启动子)的富集倍数。将最终的文库提交用以BioAnalyzer分析，然后在UHN玛格丽特公主基因组中心(UHN Princess Margaret GenomicCentre)通过Illumina HiSeq 2000进行测序。The prepared library was combined with the combined methylated/unmethylated λPCR products to a final DNA volume of 100 ng, and MeDIP was performed using the protocol of Taiwo et al. 2012 [7] after some modifications. In short, for MeDIP, the Diagenode MagMeDIP kit (number C02010021) was used according to the manufacturer’s protocol after some modifications. After adding 0.3 ng of control methylated and 0.3 ng of control unmethylated Arabidopsis DNA, filler DNA (to make the total amount of DNA [cfDNA + filler + control] 100 ng) and buffer to the PCR tube containing the adapter-ligated DNA, the sample was heated to 95 °C for 10 min and then immediately placed in an ice water bath for 10 min. Each sample was divided into two 0.2 mL PCR tubes: one for the 10% input control and the other for the sample to be immunoprecipitated. The 5-mC monoclonal antibody 33D3 (C15200081) included in the MagMeDIP kit was diluted 1:15 to generate a diluted antibody mixture, which was then added to the sample. Washed magnetic beads (according to the manufacturer's instructions) were also added, followed by incubation at 4°C for 17 hours. The sample was purified using the DiagenodeiPure kit and eluted with 50 μl of buffer C. The success of the reaction was verified by qPCR (QC1) to detect the presence of added Arabidopsis DNA, ensuring a recovery percentage of unmethylated added DNA <1% and a reaction specificity >99% (calculated as 1 - [recovery of unmethylated control DNA exceeding recovery of methylated control DNA]), before proceeding to the next step. The optimal number of cycles for amplifying each library was determined using qPCR, and the sample was then amplified using the KAPAHiFi Hotstart Mastermix with NEBNext multiplex oligonucleotides added to a final concentration of 0.3 μM. The PCR setup for library amplification was as follows: activation at 95°C for 3 minutes, followed by predetermined cycles of 98°C for 20 seconds, 65°C for 15 seconds, and 72°C for 30 seconds, with a final extension at 72°C for 1 minute. The amplified library was purified using a MinElute PCR purification column, followed by gel size selection on a 3% Nusieve GTG agarose gel to remove any adaptor dimers. Prior to sequencing, the enrichment folds of methylated human DNA regions (testis-specific H2B, TSH2B) and unmethylated human DNA regions (GAPDH promoter) in MeDIP-seq and cfMeDIP-seq libraries generated from HCT116 cell line DNA (obtained from ATCC, mycoplasma-free) that had been cleaved to mimic cell-free DNA were determined. The final library was submitted for analysis using BioAnalyzer and then sequenced using Illumina HiSeq 2000 at the UHN Princess Margaret Genomic Centre.

用于点突变检测的超深度靶向测序Ultra-deep targeted sequencing for point mutation detection

我们使用QIAgen循环核酸试剂盒以从约20mL血浆(4-5x 10mL EDTA血液管)中分离出无细胞DNA，这些血浆来自具有匹配肿瘤组织分子概况分析数据的患者，该数据在入组早期临床试验前在玛格丽特公主癌症中心产生。使用PureGene Gentra试剂盒从细胞系(CRC和MM细胞系稀释液)中提取DNA，使用Covaris超声仪将其片段化至约180bp，并使用Ampure珠排除较大的片段以模拟无细胞DNA的片段大小。利用NEXTflex-96 DNA条形码衔接子(Bio Scientific,Austin,TX)衔接子，使用KAPA Hyper Prep试剂盒(Kapa Biosystems,Wilmington,MA)由83ng片段化DNA构建DNA测序文库。为了分离出包含已知突变的DNA片段，我们设计了生物素化DNA捕获探针(xGen Lockdown Custom Probes Mini Pool,Integrated DNA Technologies,Coralville,IA)，其靶向由临床实验室使用IlluminaTruSeq扩增子癌症组测试的48个基因的突变热点。将条形码库合并，然后按照制造商的说明(IDT xGEN Lockdown方案2.1版)应用自定义混合捕获库。使用Illumina HiSeq 2000仪器将这些片段测序到>10,000X读取覆盖率。使用bwa-mem将所得读取对齐，并使用samtools和muTect 1.1.4版检测突变。We used the QIAgen Cyclic Nucleic Acid Kit to isolate cell-free DNA from approximately 20 mL of plasma (4-5 x 10 mL EDTA blood tubes) from patients with matched tumor tissue molecular profiling data generated at Princess Margaret Cancer Centre prior to enrollment in early clinical trials. DNA was extracted from cell lines (CRC and MM cell line dilutions) using the PureGene Gentra Kit, fragmented to approximately 180 bp using Covaris sonication, and larger fragments were excluded using Ampure beads to mimic the fragment size of cell-free DNA. DNA sequencing libraries were constructed from 83 ng of fragmented DNA using the NEXTflex-96 DNA barcoding adaptor (Bio Scientific, Austin, TX) using the KAPA Hyper Prep Kit (Kapa Biosystems, Wilmington, MA). To isolate DNA fragments containing known mutations, we designed biotinylated DNA capture probes (xGen Lockdown Custom Probes Mini Pool, Integrated DNA Technologies, Coralville, IA) targeting mutation hotspots in 48 genes tested by clinical laboratories using Illumina TruSeq amplicon cancer assays. Barcode libraries were merged, and then the custom-mixed capture library was applied following the manufacturer's instructions (IDT xGEN Lockdown protocol version 2.1). These fragments were sequenced to >10,000X read coverage using an Illumina HiSeq 2000 instrument. The resulting reads were aligned using bwa-mem, and mutations were detected using samtools and muTect version 1.1.4.

肿瘤特异性特征数量与通过测序深度检测的概率之间的关系的建模Modeling the relationship between the number of tumor-specific features and the probability of detection via sequencing depth.

我们创建了145,000个模拟基因组，其中癌症特异性甲基化DMR的比例分别设置为0.001％、0.01％、0.1％、1％和10％并由1、10、100、1000和10000个独立DMR组成。我们从这些原始混合物中取样了14,500个二倍体基因组(代表100ng DNA)，并进一步对每个基因座取样了10、100、1000和10000个读数，以表示在这些深度的测序覆盖率。对于覆盖率、丰度和特征数量的每种组合，重复此过程100次。我们估计了每种参数组合的至少1个DMR的成功检测频率，并绘制了概率曲线(图1A)，以可视方式评估特征数量对取决于测序深度的成功检测概率的影响。We created 145,000 simulated genomes, with cancer-specific methylation DMRs set to proportions of 0.001%, 0.01%, 0.1%, 1%, and 10%, and composed of 1, 10, 100, 1000, and 10000 independent DMRs, respectively. From these raw mixtures, we sampled 14,500 diploid genomes (representing 100 ng of DNA) and further sampled 10, 100, 1000, and 10000 reads at each locus to represent sequencing coverage at these depths. This process was repeated 100 times for each combination of coverage, abundance, and feature number. We estimated the frequency of successful detection of at least one DMR for each parameter combination and plotted probability curves (Figure 1A) to visually assess the impact of feature number on the probability of successful detection depending on sequencing depth.

组织区别特征的推导，多组织分类器的开发以及450k数据的验证Derivation of organizational distinguishing features, development of a multi-organization classifier, and validation on 450k data sets.

使用MEDIPS R软件包[8]对cfDNAMeDIP图谱进行了定量，将其转换为RPKM，然后转换为log2每百万计数。随后，使用limma-trend[9]将线性模型拟合在映射到FANTOM5增强子、CpG岛、CpG海岸和CpG架子的特征矩阵上，其中回收的spike-in甲基化DNA的百分比作为协变量包括在内，以控制下拉效率变化。对每对组织类型评估成对对比，并选择前150个和后150个DMR进行弹性网分类器训练并验证癌症类型的特异性。性能指标是通过交叉验证中具有最高Kappa值的模型的折外呼叫的多数票投票得出的，Chakravarthy等人[10]曾采用过这种启发式方法。The cfDNAMeDIP profiles were quantified using the MEDIPS R package [8], converted to RPKM, and then to log2 per million counts. Subsequently, a linear model was fitted to a feature matrix mapped to the FANTOM5 enhancer, CpG islands, CpG shores, and CpG scaffolds using limma-trend [9], with the percentage of recovered spike-in methylated DNA included as a covariate to control for pull-down efficiency variations. Pairwise comparisons were evaluated for each tissue type, and the top 150 and bottom 150 DMRs were selected for training an elastic net classifier and validation of cancer type specificity. Performance metrics were derived by majority vote of the out-of-bounds call of the model with the highest Kappa value in cross-validation, a heuristic previously employed by Chakravarthy et al. [10].

机器学习分析用于评估分类准确性Machine learning analytics are used to evaluate classification accuracy.

对发现队列的模型训练和评估Model training and evaluation for discovery queues

为了在没有高计算成本的情况下评估cfMeDIP数据在肿瘤分类中的性能，我们将可能的候选特征的初始集合减少到包含CpG岛、CpG海岸、CpG架子和FANTOM5增强子(在此标记为“调节特征”)的窗口，从而产生196个样品和505,027个特征的矩阵。然后，我们使用插入符号R软件包以80％-20％的方式将发现队列数据划分为50个独立训练和测试集合(图2A)。进行拆分，同时保持发现队列中的类别比例。然后，我们相对于其他类别对每个类别使用limma-trend，通过训练数据分区上的中度t统计量(150个高甲基化，150个低甲基化)选择了前300个DMR。然后使用这些DMR(最多300个DMR x 7个其他类别＝2100个特征)训练了二项式GLMnet，并使用10倍交叉验证(CV)的3次迭代来优化混合参数(α，值＝0、0.2、0.5、0.8和1)和惩罚(λ，值＝0-0.05，以0.01为增量)的值，以Cohen的Kappa作为性能指标。对于每个训练集，产生了6个一类对其他类二项式分类器的集合。To evaluate the performance of cfMeDIP data in tumor classification without high computational cost, we reduced the initial set of possible candidate features to a window containing CpG islands, CpG shores, CpG frames, and FANTOM5 enhancers (labeled here as "modulation features"), resulting in a matrix of 196 samples and 505,027 features. We then used the caret R package to split the discovery cohort data into 50 independent training and test sets in an 80%–20% manner (Figure 2A). This splitting was performed while maintaining the class proportions within the discovery cohort. We then used limma-trend for each class relative to the others, selecting the top 300 DMRs by a moderate t-statistic (150 high methylations, 150 low methylations) on the training data partitions. A binomial GLMnet was then trained using these DMRs (up to 300 DMRs x 7 other classes = 2100 features), and optimized with 3 iterations of 10x cross-validation (CV) to optimize the values of the mixture parameters (α, values = 0, 0.2, 0.5, 0.8, and 1) and the penalty (λ, values = 0–0.05 in 0.01 increments), using Cohen's Kappa as the performance metric. For each training set, a set of 6 binomial classifiers for one class against the other classes was generated.

然后，我们使用AUROC(接受者工作特性曲线下的区域)估算对保留测试集的分类性能。这些估计值代表了分类的无偏度量，因为保留测试集样品未用于DMR预选或GLMnet训练和调整。50个独立训练和测试集还由于训练集偏差而允许将乐观估计最小化。We then used AUROC (the region below the acceptor working characteristic curve) to estimate the classification performance on the retained test set. These estimates represent an unbiased measure of classification, since the retained test set samples were not used for DMR pre-selection or GLMnet training and tuning. The 50 independent training and test sets also allow for minimizing optimistic estimates due to training set bias.

对验证队列的模型评估Model evaluation of the verification queue

对于每个验证队列cfMeDIP样品，我们估计了在发现队列内的50个不同训练集上训练的AML、LUC和正常的一对多二项式分类器的分类概率。将来自50个模型的概率取平均值，以产生单个得分，然后将其用于AUROC估计。我们还将早期(I和II期)或晚期LUC样品(III和IV期)留给了一对多分类器，通过估计AUROC来评估疾病阶段是否影响性能。For each validation cohort cfMeDIP sample, we estimated the classification probabilities of AML, LUC, and normal one-to-many binomial classifiers trained on 50 different training sets within the discovery cohort. The probabilities from the 50 models were averaged to produce a single score, which was then used for AUROC estimation. We also left early (stages I and II) or late LUC samples (stages III and IV) to the one-to-many classifier to assess whether disease stage affected performance by estimating AUROC.

结果和讨论Results and discussion

我们用生物信息学方法模拟了从0.001％到10％的ctDNA的不同比例的混合物(图1A，列面)。我们还模拟了ctDNA与正常cfDNA相比具有1、10、100、1000或10000个DMR(差异性甲基化区域)的情况(图1A，行面)。然后在每个基因座上以不同的测序深度(10X、100X、1000X和10000X)对读数进行采样(图1A，x轴)。我们发现，随着DMR数量的增加，检测到至少1个癌症特异性事件的概率越来越高(图1A)，甚至在低丰度的癌ctDNA且浅覆盖率下也如此。We used bioinformatics methods to simulate mixtures of different proportions of ctDNA, ranging from 0.001% to 10% (Fig. 1A, column plane). We also simulated cases where ctDNA had 1, 10, 100, 1000, or 10000 DMRs (differential methylation regions) compared to normal cfDNA (Fig. 1A, row plane). Readings were then sampled at different sequencing depths (10X, 100X, 1000X, and 10000X) at each locus (Fig. 1A, x-axis). We found that the probability of detecting at least one cancer-specific event increased with increasing DMR count (Fig. 1A), even with low abundance of cancer ctDNA and shallow coverage.

此外，来自癌症基因组图谱(TCGA)的全癌数据显示在几乎所有肿瘤类型中，肿瘤与正常组织之间存在大量DMR[11]。因此，这些发现强调了，一种成功地从ctDNA中回收癌症特异性DNA甲基化改变的测定法可以作为一种非常灵敏的工具，来以低测序相关成本检测、分类和监测恶性疾病。Furthermore, whole-cancer data from the Cancer Genome Atlas (TCGA) show that there is a significant amount of DMR between tumor and normal tissue in almost all tumor types[11]. Therefore, these findings highlight that a assay that successfully recovers cancer-specific DNA methylation alterations from ctDNA could serve as a highly sensitive tool for detecting, classifying, and monitoring malignant diseases at low sequencing-related costs.

然而，由于循环中DNA的数量且片段化非常少，因此血浆cfDNA中DNA甲基化的全基因组映射具有挑战性[12]。因此，以前对cfDNA进行甲基化图谱分析的努力主要限于基于基因座特异性PCR的测定[2，3]，例如FDA批准的用于结肠直肠癌筛查的SEPT9甲基化测定[13]。尽管最近已经对片段化cfDNA进行了全基因组亚硫酸氢盐测序[14-16]，但是CpGs在全基因组内的低丰度可能会减少可从测序获得的有用的甲基化相关信息量。因此，WGBS对血浆DNA的主要问题是成本高，效率低以及与亚硫酸氢盐转化相关的DNA损失。另一方面，选择性富集易于甲基化的富含CpG的特征的方法可能会使每次读取可用的有用信息量最大化，降低成本并减少DNA损失。However, whole-genome mapping of DNA methylation in plasma cfDNA is challenging due to the limited quantity and fragmentation of circulating DNA [12]. Therefore, previous efforts to map methylation in cfDNA have largely been limited to locus-specific PCR assays [2, 3], such as the FDA-approved SEPT9 methylation assay for colorectal cancer screening [13]. Although whole-genome bisulfite sequencing of fragmented cfDNA has recently been performed [14-16], the low abundance of CpGs across the genome may reduce the amount of useful methylation-related information available from sequencing. Thus, the main problems with WGBS for plasma DNA are high cost, low efficiency, and DNA loss associated with bisulfite conversion. On the other hand, methods that selectively enrich readily methylated CpG-rich features could maximize the amount of useful information available per read, while reducing cost and DNA loss.

适用于cfDNA甲基化映射的全基因组方法Genome-wide approach for cfDNA methylation mapping

我们开发了一种称为cfMeDIP-seq(无细胞甲基化DNA免疫沉淀和高通量测序)的新方法，以使用无细胞DNA进行全基因组DNA甲基化映射。文中所描述的cfMeDIP-seq方法是通过修改现有的低输入MeDIP-seq方案[7]而开发的，根据我们的经验，该方案在低至100ng输入DNA的情况下仍非常稳健。但是，大多数血浆样品产生的DNA远远少于100ng。为了克服这一挑战，我们将外源λDNA(填充DNA)添加到了衔接子连接的cfDNA文库中，以人为地将起始DNA的量增加到100ng。这将抗体的非特异性结合量最小化，并且还由于与塑料制品结合而将DNA损失量最小化。填充DNA由大小与衔接子连接的cfDNA文库相似的扩增子组成，并由不同CpG密度的未甲基化和体外甲基化DNA组成。这种填充DNA的添加也有实际用途，因为不同的患者会产生不同量的cfDNA，从而可以将输入DNA量归一化为100ng。这样确保所有样品的下游方案均保持完全相同，而与可用cfDNA的量无关。We have developed a novel approach called cfMeDIP-seq (cell-free methylated DNA immunoprecipitation and high-throughput sequencing) for whole-genome DNA methylation mapping using cell-free DNA. The cfMeDIP-seq approach described herein was developed by modifying an existing low-input MeDIP-seq protocol [7], which, in our experience, remains very robust with input DNA as low as 100 ng. However, most plasma samples produce far less than 100 ng of DNA. To overcome this challenge, we added exogenous λDNA (filler DNA) to the adaptor-linked cfDNA library to artificially increase the amount of starting DNA to 100 ng. This minimizes the amount of nonspecific binding of antibodies and also minimizes the amount of DNA loss due to binding to plastics. The filler DNA consists of amplicons of similar size to the adaptor-linked cfDNA library and consists of unmethylated and in vitro methylated DNA with varying CpG densities. This addition of filler DNA also has practical uses, as different patients produce different amounts of cfDNA, thus normalizing the amount of input DNA to 100 ng. This ensures that the downstream protocols for all samples remain exactly the same, regardless of the amount of cfDNA available.

我们首先使用来自被剪切成类似于cfDNA中观察到的片段大小的人结肠直肠癌细胞系HCT116的DNA验证了cfMeDIP-seq方案。选择HCT116是因为可获得公共DNA甲基化数据。我们同时使用100ng剪切的细胞系DNA进行黄金标准MeDIP-seq方案[7]，并使用10ng、5ng和1ng相同的剪切细胞系DNA进行cfMeDIP-seq方案。这是在两个生物学重复中进行的。在所有条件下，我们都获得了超过99％的反应特异性(1-[加入的未甲基化对照DNA的回收率超过加入的甲基化对照DNA的回收率])，以及已知的甲基化区域相对于非甲基化区域的非常大量的富集(分别为TSH2B0和GAPDH)(图6B)。We first validated the cfMeDIP-seq protocol using DNA from the human colorectal cancer cell line HCT116, which was cut to fragment sizes similar to those observed in cfDNA. HCT116 was chosen because public DNA methylation data were available. We simultaneously performed the gold standard MeDIP-seq protocol using 100 ng of cut cell line DNA [7] and the cfMeDIP-seq protocol using 10 ng, 5 ng, and 1 ng of the same cut cell line DNA. This was performed in two biological replicates. Under all conditions, we obtained more than 99% response specificity (1 - [the recovery rate of the added unmethylated control DNA exceeded the recovery rate of the added methylated control DNA]) and very large enrichment of known methylated regions relative to unmethylated regions (TSH2B0 and GAPDH, respectively) (Fig. 6B).

以每个文库约30至7,000万个读取(补充表1)将该文库测序至饱和(图6A)。将原始读取与人类基因组和λ基因组比对，发现与λ基因组几乎没有比对(补充表1)。因此，添加外源λDNA作为填充DNA不会干扰测序数据的产生。最后，我们计算CpG富集得分作为免疫沉淀步骤的质量控制量度[8]。所有文库都显示出类似的CpGs富集，而输入对照如预期的那样没有富集(图6C)，即使在极低的输入(1ng)下也能验证我们的免疫沉淀。The libraries were sequenced to saturation with approximately 30 to 70 million reads per library (Supplementary Table 1) (Figure 6A). The raw reads were aligned to the human genome and the λ genome, and almost no alignment was found with the λ genome (Supplementary Table 1). Therefore, adding exogenous λ DNA as filler DNA did not interfere with the generation of sequencing data. Finally, we calculated the CpG enrichment score as a quality control metric for the immunoprecipitation step [8]. All libraries showed similar CpG enrichment, while the input control was not enriched as expected (Figure 6C), validating our immunoprecipitation even at very low input (1 ng).

比较不同输入DNA水平的全基因组相关性估计表明，MeDIP-seq(100ng)和cfMeDIP-seq(10、5和1ng)方法都非常稳健，任何两个生物学重复之间的Pearson相关性为至少0.94(图1B)。分析还表明，5和10ng输入DNA的cfMeDIP-seq可以稳健地概括通过100ng的传统MeDIP-seq获得的甲基化图谱(Pairwise Pearson相关性为至少0.9)(图1B)。与100ng的MeDIP-seq相比，1ng输入DNA的cfMeDIP-seq的性能有所降低，但仍表现出>0.7的强Pearson相关性(图1B)。我们还观察到，cfMeDIP-seq方案使用黄金标准RRBS(简化代表性亚硫酸氢盐测序)和WGBS(全基因组亚硫酸氢盐测序)概括了HCT116的DNA甲基化图谱(图1C)。总而言之，我们的数据表明，cfMeDIP-seq是用于片段化和低输入DNA物质(例如循环cfDNA)的全基因组甲基化映射的可靠方案。Comparison of genome-wide correlation estimates at different input DNA levels showed that both MeDIP-seq (100 ng) and cfMeDIP-seq (10, 5, and 1 ng) methods were highly robust, with Pearson correlations of at least 0.94 between any two biological replicates (Figure 1B). The analysis also showed that cfMeDIP-seq with 5 and 10 ng input DNA robustly generalized the methylation profiles obtained by conventional MeDIP-seq with 100 ng (Pairwise Pearson correlations of at least 0.9) (Figure 1B). cfMeDIP-seq with 1 ng input DNA showed slightly lower performance compared to 100 ng MeDIP-seq, but still exhibited strong Pearson correlations >0.7 (Figure 1B). We also observed that the cfMeDIP-seq protocol generalized the DNA methylation profile of HCT116 using the gold standard RRBS (Reduced Representative Bisulfite Sequencing) and WGBS (Whole Genome Bisulfite Sequencing) (Figure 1C). In summary, our data demonstrate that cfMeDIP-seq is a reliable protocol for whole-genome methylation mapping of fragmented and low-input DNA material (e.g., circulating cfDNA).

cfMeDIP-seq对检测肿瘤来源的ctDNA显示出高灵敏度cfMeDIP-seq showed high sensitivity for detecting tumor-derived ctDNA.

为了评估cfMeDIP-seq方案的灵敏度，我们将结肠直肠癌(CRC)HCT116细胞系DNA系列稀释到多发性骨髓瘤(MM)MM1.S细胞系DNA中，二者均被剪切为模拟cfDNA大小。我们将CRC DNA从100％、10％、1％、0.1％、0.01％、0.001％稀释到0％，并对每种稀释液进行cfMeDIP-seq。我们还执行了超深(10,000X中值覆盖率)靶向测序，以检测同一样品中的三点突变。使用5％错误发现率(FDR)阈值观察到的在每个CRC稀释点相对于纯MM DNA识别到的DMR数量与基于降至0.001％稀释度的稀释因子的DMR预期数量几乎是完全线性的(r²＝0.99，p<0.0001)(图1D)。此外，这些DMR中的DNA甲基化信号在观察到的信号与预期信号之间也显示出几乎完美的线性关系(r²＝0.99，p<0.0001)(图1E；补充表2B)。相比之下，超出1％稀释度，由于PCR或测序错误，超深层靶向测序无法可靠地区分CRC特异性变异和伪变体(图1F；补充表2A)。因此，cfMeDIP-seq对癌症来源的DNA的检测显示出出色的灵敏度，超过了超深度靶向测序使用标准方案检测变体的性能。To evaluate the sensitivity of the cfMeDIP-seq protocol, we serially diluted colorectal cancer (CRC) HCT116 cell line DNA into multiple myeloma (MM) MM1.S cell line DNA, both of which were cleaved to a simulated cfDNA size. We diluted CRC DNA from 100%, 10%, 1%, 0.1%, 0.01%, 0.001% to 0% and performed cfMeDIP-seq on each dilution. We also performed ultra-deep (10,000X median coverage) targeted sequencing to detect three-point mutations in the same sample. The number of DMRs identified at each CRC dilution relative to pure MM DNA, observed using a 5% false discovery rate (FDR) threshold, was almost perfectly linear with the expected number of DMRs based on the dilution factor reduced to 0.001% ( ^r² = 0.99, p < 0.0001) (Figure 1D). Furthermore, the DNA methylation signals in these DMRs showed a near-perfect linear relationship between the observed and expected signals ( ^r² = 0.99, p < 0.0001) (Figure 1E; Supplementary Table 2B). In contrast, beyond 1% dilution, ultra-deep targeted sequencing could not reliably distinguish between CRC-specific variants and pseudovariants due to PCR or sequencing errors (Figure 1F; Supplementary Table 2A). Therefore, cfMeDIP-seq demonstrated superior sensitivity for detecting cancer-derived DNA, surpassing the performance of ultra-deep targeted sequencing using standard protocols for variant detection.

癌症DNA经常在富含CpG的区域超甲基化[17]。由于cfMeDIP-seq特异性靶向甲基化的富含CpG的序列，因此我们假设ctDNA在免疫沉淀过程中会被优先富集。为了测试这一点，我们从两名结肠直肠癌患者中产生了患者源性异种移植物(PDX)，并收集了小鼠血浆。肿瘤来源的人cfDNA在输入样品中以低于1％频率存在于总cfDNA池中，并且在免疫沉淀后的丰度提高了2倍(图1G；补充表3)。这些结果表明，通过对ctDNA进行偏向测序，cfMeDIP程序可以进一步提高ctDNA检测灵敏度。Cancer DNA is frequently hypermethylated in CpG-rich regions [17]. Since cfMeDIP-seq specifically targets methylated CpG-rich sequences, we hypothesized that ctDNA would be preferentially enriched during immunoprecipitation. To test this, we generated patient-derived xenografts (PDXs) from two colorectal cancer patients and collected mouse plasma. Tumor-derived human cfDNA was present in the total cfDNA pool at a frequency of less than 1% in the input samples and its abundance increased by 2-fold after immunoprecipitation (Fig. 1G; Supplementary Table 3). These results suggest that the cfMeDIP procedure can further improve the sensitivity of ctDNA detection by biased sequencing of ctDNA.

循环血浆cfDNA甲基化图谱可区分多种癌症类型和健康供体Circulating plasma cfDNA methylation profiles can differentiate between various cancer types and healthy donors.

DNA甲基化模式具有组织特异性，已用于将癌症患者分为胶质母细胞瘤[18]、室管膜瘤[6]、结肠直肠癌[19]和乳腺癌[20，21]等许多癌症类型的临床相关疾病亚组。我们询问是否可以使用cfDNA相关图谱来识别多种肿瘤类型的起源组织。为此，我们分析了来自5种不同肿瘤类型的196个样品以及来自早期和晚期肿瘤的正常对照。我们使用线性建模来识别每个成对比较中映射到CpG海岸、CpG架子、CpG岛和FANTOM5增强子的前300个DMR，总共产生了2,100个独特DMR(图2A)。基于这些特征的甲基化状态，对196个血浆样品进行的基于t分布随机邻域嵌入(tSNE)的密度聚类[22]显示了样品基于起源组织和肿瘤类型的不同聚类(图2B，图2C)。使用适合这些特征的弹性网多癌症分类器(图2A)，我们观察到了不同肿瘤类型之间的高度准确的区分(图2D)。DNA methylation patterns are tissue-specific and have been used to classify cancer patients into clinically relevant disease subgroups for many cancer types, including glioblastoma[18], ependymoma[6], colorectal cancer[19], and breast cancer[20, 21]. We asked whether cfDNA-related mapping could be used to identify the tissue of origin for multiple tumor types. To this end, we analyzed 196 samples from 5 different tumor types and normal controls from early and late-stage tumors. We used linear modeling to identify the top 300 DMRs mapped to CpG shores, CpG scaffolds, CpG islands, and FANTOM5 enhancers in each pairwise comparison, resulting in a total of 2,100 unique DMRs (Fig. 2A). Density clustering based on t-distribution random neighborhood embedding (tSNE) of the 196 plasma samples[22] based on the methylation status of these features showed different clusters of samples based on tissue of origin and tumor type (Fig. 2B, Fig. 2C). Using an elastic mesh multi-cancer classifier suited to these characteristics (Fig. 2A), we observed highly accurate differentiation between different tumor types (Fig. 2D).

疾病亚型的区分Differentiation of disease subtypes

我们评估了cfDNA MeDIP图谱在五种不同情况下区分疾病亚型的能力-基因表达模式(乳腺癌中的ER状态)、拷贝数畸变(乳腺癌中的HER2状态)、重排(AML中的FLT3 ITD状态)、点突变(GBM中的IDH突变)、最后是肺癌的组织学。在每种情况下，如前所述，将线性模型用于选择特征并对其进行排名。在每种情况下，使用层次聚类来评估样品的分组。基于所选特征的甲基化状态的基于t分布随机邻域嵌入(tSNE)的密度聚类[22]显示，基于这五个癌症亚型分类的不同实例中的每一个对样品进行不同聚类。We evaluated the ability of cfDNA MeDIP maps to distinguish disease subtypes in five different cases—gene expression patterns (ER status in breast cancer), copy number aberrations (HER2 status in breast cancer), rearrangements (FLT3 ITD status in AML), point mutations (IDH mutations in GBM), and finally, histology in lung cancer. In each case, as previously described, a linear model was used to select features and rank them. In each case, hierarchical clustering was used to evaluate the grouping of samples. Density clustering based on t-distributed random neighborhood embedding (tSNE) based on the methylation status of the selected features [22] showed that the samples were clustered differently based on each of the different instances of these five cancer subtype classifications.

使用机器学习检测癌症并分类癌症类型Using machine learning to detect and classify cancer types

为了严格评估cfMeDIP图谱检测癌症并进一步分类癌症类型的能力，我们随后对发现队列进行了一组机器学习分析。为了加快计算分析，我们最初将我们的cfMeDIP发现队列缩减为映射到CpG岛、CpG海岸、CpG架子和FANTOM5增强子的特征(n＝505,027个窗口)。然后，我们对我们的发现队列样品实施了一项策略，以得出对性能的无偏估计，同时考虑了训练集偏向。To rigorously evaluate the ability of the cfMeDIP map to detect cancer and further classify cancer types, we subsequently performed a set of machine learning analyses on the discovery cohort. To accelerate computational analysis, we initially reduced our cfMeDIP discovery cohort to features mapped to CpG islands, CpG shores, CpG frames, and FANTOM5 enhancers (n = 505,027 windows). We then implemented a strategy on our discovery cohort samples to derive an unbiased estimate of performance while taking into account training set bias.

在文中，我们将发现队列分为均衡的训练和测试集(训练集为80％，测试集为20％)。仅使用训练集中的样品，我们根据limma-trend测试统计数据，对每个类别(样品类型)相对于其他类别选择前300个DMR，并使用这些特征对训练集数据训练一系列一类对其他类别的GLMnet。训练程序由3轮对α和λ值网格的10倍交叉验证(CV)组成，并对Cohen的Kappa进行了优化。希望利用更多随机化来进行更通用的模型调整，从而促使使用多轮10倍CV。In this paper, we find that the queue is divided into balanced training and test sets (80% training, 20% test). Using only samples from the training set, we select the top 300 DMRs for each class (sample type) relative to other classes based on limma-trend test statistics, and use these features to train a series of one-class versus other-class GLMnets on the training set data. The training procedure consists of 3 rounds of 10x cross-validation (CV) on grids of α and λ values, optimized using Cohen's Kappa. We hope to leverage more randomization for more general model tuning, thus encouraging the use of multiple rounds of 10x CV.

然后使用从测试集样品(在DMR选择和随后的GLMnet训练/调整步骤中保留)得出的AUROC(接收者工作特性曲线下的区域)评估性能。通过将发现队列50次分为不同训练集和测试集来重复进行此过程，以减轻训练集偏差的影响。对于每个类别对其他类别的比较，最终收集了50个模型(总共480个模型)。因此，我们将此模型集合称为E50。Performance was then evaluated using the AUROC (the region under the receiver operating characteristic curve) derived from the test set samples (preserved during DMR selection and subsequent GLMnet training/tuning steps). This process was repeated 50 times from the discovery queue to different training and test sets to mitigate training set bias. A total of 50 models were ultimately collected for comparisons of each class against the others (out of a total of 480 models). Therefore, we refer to this set of models as the E50.

随后，我们通过生成另外152个血浆样品的验证队列来评估各批次的性能：AML(n＝35)，肺癌(n＝55)和健康对照(n＝62)样品。对于每个类别，我们在E50中对模型输出的类别概率取平均值，并估计一个类别相对于所有其他类别的AUROC(图3A)。对于AML相对于其他(0.993)，LUC相对于其他(0.943)和正常相对于其他(1.000)的分类，分类器显示出高AUROC值。这进一步证实了cfMeDIP-seq与机器学习方法相结合以准确地检测和分类肿瘤类型的能力。最后，我们观察到，分类器在早期样品(0.950)中与在晚期样品(0.934)中一样准确(图3B)，表明该方法适用于癌症早期检测以及早期和晚期癌症的检测。We then evaluated the performance of each batch by generating a validation cohort of 152 additional plasma samples: AML (n=35), lung cancer (n=55), and healthy controls (n=62). For each category, we averaged the class probabilities of the model output in E50 and estimated the AUROC of one category relative to all other categories (Figure 3A). The classifier showed high AUROC values for the classification of AML relative to others (0.993), LUC relative to others (0.943), and normal relative to others (1.000). This further confirms the ability of combining cfMeDIP-seq with machine learning methods to accurately detect and classify tumor types. Finally, we observed that the classifier was as accurate in early samples (0.950) as in late samples (0.934) (Figure 3B), indicating that the method is suitable for early cancer detection as well as the detection of both early and late-stage cancers.

使用cfMeDIP-seq进行cfDNA甲基化组图谱分析的其他优点Other advantages of using cfMeDIP-seq for cfDNA methylome mapping analysis

cfDNA甲基化模式准确代表起源组织的能力也克服了基于突变的测定法的局限性，其中由于不同组织中许多癌症的潜在驱动基因突变的复发性质，因此对起源组织的特异性可能较低[23]。基于突变的测定法也可能由于肿瘤的克隆结构而变得不灵敏，在这种情况下，由于在ctDNA中的丰度较低，因此亚克隆驱动子可能更难检测到[24]。基于突变的ctDNA方法也容易受到良性组织中驱动子突变的潜在混淆的影响，这已被观察到[25]，并有文献证明阳性选择的证据[26]。The ability of cfDNA methylation patterns to accurately represent the tissue of origin also overcomes the limitations of mutation-based assays, which may have low specificity for the tissue of origin due to the recurrent nature of potential driver gene mutations in many cancers across different tissues [23]. Mutation-based assays may also become insensitive due to the clonal structure of tumors, in which case subclonal drivers may be more difficult to detect due to their low abundance in ctDNA [24]. Mutation-based ctDNA methods are also susceptible to potential confounding effects from driver mutations in benign tissues, which has been observed [25] and there is evidence of positive selection in the literature [26].

综上所述，我们发现-基于迄今获得的最大的癌症cfDNA甲基化组-将cfMeDIP-seq建立为一种有效且具有成本效益的工具，其具有影响癌症管理和早期检测的潜力。cfMeDIP-seq的准确性和多功能性可用于在耐药性与表观遗传学改变相关的环境中(例如前列腺癌中对雄激素受体抑制的敏感性)提供治疗决策[27]。早期诊断和筛查的潜在机会在肺癌中尤为明显，在肺癌中，这种筛查已经显示出临床实用性，但是对于现有的筛查测试(即低剂量CT扫描)存在明显的局限性，例如电离辐射暴露和高假阳性率。In summary, we found that based on the largest cancer cfDNA methylome obtained to date, cfMeDIP-seq can be established as an effective and cost-efficient tool with the potential to impact cancer management and early detection. The accuracy and versatility of cfMeDIP-seq can be used to provide treatment decisions in contexts associated with drug resistance and epigenetic alterations, such as sensitivity to androgen receptor inhibition in prostate cancer [27]. The potential opportunities for early diagnosis and screening are particularly evident in lung cancer, where such screening has shown clinical utility, but there are significant limitations to existing screening tests (i.e., low-dose CT scans), such as ionizing radiation exposure and high false positive rates.

总之，我们的研究结果强调了cfDNA甲基化图谱的实用性，将其作为非侵入性、具有成本效益、灵敏、高度准确的早期肿瘤检测、多癌分类和癌症亚型分类的基础。In summary, our findings highlight the practicality of cfDNA methylation mapping as a basis for non-invasive, cost-effective, sensitive, and highly accurate early tumor detection, multi-cancer classification, and cancer subtype classification.

尽管文中已经描述了本发明的优选实施例，但是本领域技术人员将理解，在不脱离本发明的精神或所附权利要求的范围的情况下，可以对其进行变化。本文公开的所有文件，包括以下参考文献列表中的那些文件，均通过引用并入本文。Although preferred embodiments of the invention have been described herein, those skilled in the art will understand that variations may be made without departing from the spirit of the invention or the scope of the appended claims. All documents disclosed herein, including those in the following list of references, are incorporated herein by reference.

表1：使用被剪切以模拟cfDNA的HCT116细胞系DNA的各种起始输入制备的测序MeDIP-seq(100ng Rep 1和Rep 2)和cfMeDIP-seq(10ng、5ng和1ng，Rep 1和Rep 2)文库与人(Hg19)基因组和λ基因组的读取次数和映射效率。两个生物学重复用于起始输入DNA。对于小于100ng的起始输入，在MeDIP之前向样品加入外源λDNA以将起始量人工地增加至100ng。Table 1. Read counts and mapping efficiencies of sequencing MeDIP-seq (100 ng Rep 1 and Rep 2) and cfMeDIP-seq (10 ng, 5 ng, and 1 ng, Rep 1 and Rep 2) libraries prepared using HCT116 cell line DNA spliced to mimic cfDNA, compared with human (Hg19) genome and λ genome. Two biological replicates were used for the starting input DNA. For starting inputs smaller than 100 ng, exogenous λDNA was added to the sample before MeDIP to artificially increase the starting amount to 100 ng.

表2A：使用CRC细胞系HCT116 DNA在MM细胞系MM1.SDNA中的系列稀释的超深度靶向变体测序的平均覆盖率Table 2A: Average coverage of ultra-deep targeted variant sequencing using serial dilutions of CRC cell line HCT116 DNA in MM cell line MM1.sDNA

表2B：CRC细胞系HCT116 DNA在MM细胞系MM1.SDNA中的系列稀释所观察到的DMR和DNA甲基化信号Table 2B: DMR and DNA methylation signals observed in serial dilutions of CRC cell line HCT116 DNA in MM cell line MM1.sDNA

表3：PDX和输入对照样品的cfMeDIP-seq文库在与人类(Hg19)基因组比对后的读取数和映射效率Table 3: Read count and mapping efficiency of cfMeDIP-seq libraries of PDX and input control samples after alignment with the human (Hg19) genome.

参考文献References

1.Diaz,L.A.,Jr.and A.Bardelli,Liquid biopsies:genotyping circulatingtumor DNA.J ClinOncol,2014.32(6):p.579-86.1.Diaz, L.A., Jr.and A.Bardelli, Liquid biopsies:genotyping circulatingtumor DNA.J ClinOncol, 2014.32(6):p.579-86.

2.Lehmann-Werman,R.,et al.,Identification of tissue-specific celldeath using methylationpatterns of circulating DNA.Proc Natl Acad Sci U S A,2016.113(13):p.E1826-34.3.Visvanathan,K.,et al.,Monitoring of Serum DNAMethylation as an Early IndependentMarker of Response and Survival inMetastatic Breast Cancer:TBCRC 005ProspectiveBiomarker Study.J Clin Oncol,2016:p.JCO2015662080.2.Lehmann-Werman,R.,et al.,Identification of tissue-specific celldeath using methylationpatterns of circulating DNA.Proc Natl Acad Sci U S A, 2016.113(13):p.E1826-34.3.Visvanathan,K.,et al. ,Monitoring of Serum DNAMethylation as an Early IndependentMarker of Response and Survival inMetastatic Breast Cancer:TBCRC 005ProspectiveBiomarker Study.J Clin Oncol,2016:p.JCO2015662080.

4.Newman,A.M.,et al.,An ultrasensitive method for quantitatingcirculating tumor DNAwith broad patient coverage.Nat Med,2014.20(5):p.548-54.4.Newman,A.M.,et al.,An ultrasensitive method for quantitatingcirculating tumor DNAwith broad patient coverage.Nat Med,2014.20(5):p.548-54.

5.Aravanis,A.M.,M.Lee,and R.D.Klausner,Next-Generation Sequencing ofCirculatingTumor DNA for Early Cancer Detection.Cell,2017.168(4):p.571-574.5. Aravanis, A.M., M.Lee, and R.D.Klausner, Next-Generation Sequencing ofCirculating Tumor DNA for Early Cancer Detection.Cell, 2017.168(4):p.571-574.

6.Mack,S.C.,et al.,Epigenomic alterations define lethal CIMP-positiveependymomas ofinfancy.Nature,2014.506(7489):p.445-50.6.Mack, S.C., et al., Epigenomic alterations define lethal CIMP-positive ependymomas ofinfancy. Nature, 2014.506(7489): p.445-50.

7.Taiwo,O.,et al.,Methylome analysis using MeDIP-seq with low DNAconcentrations.NatProtoc,2012.7(4):p.617-36.7.Taiwo,O.,et al.,Methylome analysis using MeDIP-seq with low DNAconcentrations.NatProtoc, 2012.7(4):p.617-36.

8.Lienhard,M.,et al.,MEDIPS:genome-wide differential coverageanalysis of sequencingdata derived from DNA enrichmentexperiments.Bioinformatics,2014.30(2):p.284-6.9.Law,C.W.,et al.,voom:Precision weights unlock linear model analysis tools for RNA-8.Lienhard,M.,et al.,MEDIPS:genome-wide differential coverageanalysis of sequencingdata derived from DNA enrichmentexperiments.Bioinformatics,2014.30(2):p.284-6.9.Law,C.W.,et al.,voom:Precision weights unlock linear model analysis tools for RNA-

seq read counts.Genome Biol,2014.15(2):p.R29.seq read counts.Genome Biol,2014.15(2):p.R29.

10.Chakravarthy,A.,et al.,Human Papillomavirus Drives TumorDevelopment Throughoutthe Head and Neck:Improved Prognosis Is Associated Withan Immune ResponseLargely Restricted to the Oropharynx.J Clin Oncol,2016.34(34):p.4132-4141.11.Hoadley,K.A.,et al.,Multiplatform analysis of 12cancertypes reveals molecularclassification within and across tissues oforigin.Cell,2014.158(4):p.929-44.10.Chakravarthy,A.,et al.,Human Papillomavirus Drives TumorDevelopment Throughoutthe Head and Neck:Improved Prognosis Is Associated Withan Immune ResponseLargely Restricted to the Oropharynx.J Clin Oncol,2016.34(34):p.4132-4141.11.Hoadley,K.A.,et al.,Multiplatform analysis of 12cancertypes reveal molecularclassification within and across tissues oforigin.Cell,2014.158(4):p.929-44.

12.Fleischhacker,M.and B.Schmidt,Circulating nucleic acids(CNAs)andcancer--a survey.12.Fleischhacker,M.and B.Schmidt,Circulating nucleic acids(CNAs)andcancer--a survey.

Biochim Biophys Acta,2007.1775(1):p.181-232.Biochim Biophys Acta,2007.1775(1):p.181-232.

13.Potter,N.T.,et al.,Validation of a real-time PCR-based qualitativeassay for the detectionof methylated SEPT9 DNA in human plasma.Clin Chem,2014.60(9):p.1183-91.14.Legendre,C.,et al.,Whole-genome bisulfite sequencingof cell-free DNA identifiessignature associated with metastatic breastcancer.Clin Epigenetics,2015.7:p.100.15.Sun,K.,et al.,Plasma DNA tissuemapping by genome-wide methylation sequencing fornoninvasive prenatal,cancer,and transplantation assessments.Proc Natl Acad Sci U SA,2015.112(40):p.E5503-12.13.Potter, N.T., et al., Validation of a real-time PCR-based qualitative assay for the detection of methylated SEPT9 DNA in human plas ma.Clin Chem,2014.60(9):p.1183-91.14.Legendre,C.,et al.,Whole-genome bisulfite sequencing of cell-free DNA identifysignature ass associated with metastatic breastcancer.Clin Epigenetics,2015.7:p.100.15.Sun,K.,et al.,Plasma DNA tissuemapping by genome-wide methy lation sequencing for noninvasive prenatal, cancer, and transplantation assessments. Proc Natl Acad Sci U SA, 2015.112(40):p.E5503-12.

16.Chan,K.C.,et al.,Noninvasive detection of cancer-associatedgenome-widehypomethylation and copy number aberrations by plasma DNAbisulfite sequencing.ProcNatl Acad Sci U S A,2013.110(47):p.18761-8.16.Chan,K.C.,et al.,Noninvasive detection of cancer-associatedgenome-widehypomethylation and copy number aberrations by plasma DNAbisulfite sequencing.ProcNatl Acad Sci U S A, 2013.110(47):p.18761-8.

17.Sharma,S.,T.K.Kelly,and P.A.Jones,Epigenetics incancer.Carcinogenesis,2010.17. Sharma, S., T. K. Kelly, and P. A. Jones, Epigenetics incancer. Carcinogenesis, 2010.

31(1):p.27-36.31(1):p.27-36.

18.Sturm,D.,et al.,Hotspot mutations in H3F3A and IDH1 definedistinct epigenetic andbiological subgroups of glioblastoma.Cancer Cell,2012.22(4):p.425-37.18.Sturm, D., et al., Hotspot mutations in H3F3A and IDH1 defined epigenetic and biological subgroups of glioblastoma. Cancer Cell, 2012.22(4):p.425-37.

19.Hinoue,T.,et al.,Genome-scale analysis of aberrant DNA methylationin colorectalcancer.Genome Res,2012.22(2):p.271-82.19.Hinoue, T., et al., Genome-scale analysis of aberrant DNA methylationin colorectalcancer. Genome Res, 2012.22(2):p.271-82.

20.Stirzaker,C.,et al.,Methylome sequencing in triple-negative breastcancer revealsdistinct methylation clusters with prognostic value.Nat Commun,2015.6:p.5899.21.Fang,F.,et al.,Breast cancer methylomes establish anepigenomic foundation formetastasis.Sci Transl Med,2011.3(75):p.75ra25.20.Stirzaker,C.,et al.,Methylome sequencing in triple-negative breastcancer revealsdistinct methylation clusters with prognostic value.Nat Comm un,2015.6:p.5899.21.Fang,F.,et al.,Breast cancer methylomes establish anepigenomic foundation formetastasis.Sci Transl Med,2011.3(75):p.75ra25.

22.Laurens van der Maaten,G.H.,Visualizing Data using t-SNE.Journalof MachineLearning Research,2008.9:p.2579-2605.22. Laurens van der Maaten, G.H., Visualizing Data using t-SNE. Journal of Machine Learning Research, 2008.9: p.2579-2605.

23.Kandoth,C.,et al.,Mutational landscape and significance across 12major cancer types.23.Kandoth,C.,et al.,Mutational landscape and significance across 12major cancer types.

Nature,2013.502(7471):p.333-9.Nature,2013.502(7471):p.333-9.

24.McGranahan,N.,et al.,Clonal status of actionable driver events andthe timing ofmutational processes in cancer evolution.Sci Transl Med,2015.7(283):p.283ra54.25.Zauber,P.,S.Marotta,and M.Sabbath-Solitare,KRAS genemutations are morecommon in colorectal villous adenomas and in situcarcinomas than in carcinomas.Int JMol Epidemiol Genet,2013.4(1):p.1-10.24.McGranahan,N.,et al.,Clonal status of actionable driver events and the timing ofmutational processes in cancer evolution.Sci Transl Med,2015.7(283):p.283ra54.25.Zauber,P.,S.Ma rotta, and M.Sabbath-Solitare, KRAS genemutations are more common in colorectal villous adenomas and in situcarcinomas than in carcinomas.Int JMol Epidemiol Genet, 2013.4(1):p.1-10.

26.Martincorena,I.,et al.,Tumor evolution.High burden and pervasivepositive selection ofsomatic mutations in normal human skin.Science,2015.348(6237):p.880-6.26.Martincorena,I.,et al.,Tumor evolution.High burden and pervasivepositive selection ofsomatic mutations in normal human skin.Science,2015.348(6237):p.880-6.

27.Beltran,H.,et al.,Divergent clonal evolution of castration-resistant neuroendocrineprostate cancer.2016.22(3):p.298-305。27. Beltran, H., et al., Divergent clonal evolution of castration-resistant neuroendocrineprostate cancer. 2016.22(3): p.298-305.

Claims

1. A method for detecting the presence of DNA from cancer cells in a subject, comprising:

(a) Provide a sample of cell-free DNA from the subject;

(b) Perform library preparation on the sample to allow subsequent sequencing of the cell-free methylated DNA;

(c) Add a first amount of filler DNA to the sample, wherein at least a portion of the filler DNA is methylated, and then optionally denature the sample;

(d) Capturing cell-free methylated DNA using a binding agent that is selective for methylated polynucleotides;

(e) Sequencing of the captured cell-free methylated DNA;

(f) Compare the sequences of the captured cell-free methylated DNA with control cell-free methylated DNA sequences from healthy and cancer individuals;

(g) The presence of DNA from cancer cells is identified if the captured cell-free methylated DNA is statistically significantly similar to one or more sequences of cell-free methylated DNA from a cancer individual.

2. The method of claim 1, wherein the sample is derived from the subject's blood or plasma.

3. The method according to claim 1 or 2, wherein the comparison step (f) is based on a fit using a statistical classifier.

4. The method of claim 3, wherein the classifier is derived from machine learning.

5. A method for detecting the presence of DNA from cancer cells and identifying cancer subtypes, the method comprising:

a. Receiving sequencing data of cell-free methylated DNA from subject samples;

b. Compare the sequences of the captured cell-free methylated DNA with control cell-free methylated DNA sequences from healthy and cancer individuals;

c. If the captured cell-free methylated DNA is statistically significantly similar to one or more sequences of cell-free methylated DNA from a cancer individual, the presence of DNA from cancer cells is identified; and

d. If DNA from cancer cells is identified in step c, the origin of the cancer cell tissue and cancer subtype is further identified based on the comparison in step b.

6. A computer-implemented method for detecting the presence of DNA from cancer cells and identifying cancer subtypes, the method comprising:

a. At least one processor receives sequencing data of cell-free methylated DNA from a subject sample;

b. At the at least one processor, the sequence of the captured cell-free methylated DNA is compared with the sequence of control cell-free methylated DNA from healthy and cancer individuals;

c. If the captured cell-free methylated DNA is statistically significantly similar to one or more sequences of cell-free methylated DNA from a cancer individual, the presence of DNA from cancer cells is identified at the at least one processor; and if DNA from cancer cells is identified in step c, the origin of the cancer cell tissue and cancer subtype is further identified based on the comparison in step b.

7. A computer program product for use in conjunction with a general-purpose computer having a processor and a memory connected to the processor, the computer program product comprising a computer-readable storage medium thereon having a computer mechanism encoded thereon, wherein the computer program mechanism can be loaded into the memory of the computer and cause the computer to perform the method of claim 6.

8. A computer-readable medium having a data structure thereon for storing the computer program product of claim 7.

9. An apparatus for detecting the presence of DNA from cancer cells and identifying cancer subtypes, said apparatus comprising:

At least one processor; and

An electronic memory communicating with the at least one processor, the electronic memory storing processor-executable code, which, when executed at the at least one processor, causes the at least one processor to:

a. Receiving sequencing data of cell-free methylated DNA from subject samples;

c. If the captured cell-free methylated DNA is statistically significantly similar to one or more sequences of cell-free methylated DNA from a cancer individual, the presence of DNA from cancer cells is identified, and if DNA from cancer cells is identified in step c, the cancer cell tissue of origin and cancer subtype are further identified based on the comparison in step b.

10. A method for detecting the presence of DNA from cancer cells and determining the location of cancer cells from two or more possible organs, the method comprising:

(a) Provide a sample of cell-free DNA from the subject;

(b) Capture cell-free methylated DNA from the sample using a binding agent that is selective for methylated polynucleotides;

(c) Sequencing of the captured cell-free methylated DNA;

(d) Compare the sequence patterns of the captured cell-free methylated DNA with the DNA sequence patterns of two or more groups of control individuals, each of which has localized cancer in different organs;

(e) Based on the statistically significant similarity between the methylation pattern of the cell-free DNA and one of the two or more populations, determine which organ the cancer cells originated from.