CN113278706B

CN113278706B - Method for distinguishing somatic mutation from germline mutation

Info

Publication number: CN113278706B
Application number: CN202110835236.7A
Authority: CN
Inventors: 刘成林; 王俊; 张周; 揣少坤; 汉雨生
Original assignee: Guangzhou Burning Rock Dx Co ltd
Current assignee: Guangzhou Burning Rock Dx Co ltd
Priority date: 2021-07-23
Filing date: 2021-07-23
Publication date: 2021-11-12
Anticipated expiration: 2041-07-23
Also published as: CN113278706A

Abstract

The present application relates to methods for differentiating somatic and germline mutations: obtaining at least one mutation site from a sample of a subject; acquiring a wild type support fragment and a mutant type support fragment; the wild type supporting fragment is a cfDNA fragment containing a wild type base sequence, the mutant type supporting fragment is a cfDNA fragment containing a mutant type base sequence, the wild type base sequence has the same sequence as a nucleotide sequence of a human reference genome at a position corresponding to the mutation site, and the mutant type base sequence is different; obtaining the number of the wild type support fragments with at least one length, obtaining the number of the corresponding mutant type support fragments with the same length, and calculating the difference value of the ratio of the wild type support fragments with the same length to the total number of the corresponding support fragments; the difference is used as an indicator of discrimination. Methods and apparatus are provided to identify ctDNA from cfDNA. The method is used for tumor family management and TMB detection.

Description

Method for distinguishing somatic mutation from germline mutation

Technical Field

The application relates to the field of biological information, in particular to a method for distinguishing somatic mutation from germline mutation.

Background

In the plasma of tumor patients, cfDNA is widely present, including the presence of small amounts of tumor-specific ctDNA. These ctdnas differ from other normal cfdnas in the manner of splicing during cellular senescence and apoptosis. In other words, the fragmentation distribution pattern of ctDNA in free DNA in plasma is different from that of other conventional cfDNA. Therefore, this difference in distribution pattern can serve as a marker for ctDNA recognition.

Somatic mutations are a non-genetic variation that is distinguished from germline mutations (also known as germline mutations) that accumulate gradually over the life cycle of humans. Somatic mutations are important markers of tumor formation due to their close association with molecular signaling pathways for tumorigenesis. Germline mutations are heritable mutations that occur in germ cells and are of great significance for the study of genetic diseases and genomic evolution. Mention is made in "cancer mutation load detection and clinical application chinese experts consensus (2020 edition): in the standardization requirements of the Tumor Mutation Burden (TMB) algorithm, the core element is the detection and calculation of somatic mutations that can affect protein coding. Because the currently disclosed crowd databases are mainly owned by European and American people and are not suitable for TMB detection of Chinese people, the method suggests that the somatic mutation of the TMB determines that a control sample (peripheral blood or tissues beside cancer) should be used for removing the germ line variation of a patient, or a background database is constructed by using a large sample hereditary mutation database of the Chinese crowd to filter the germ line variation. Therefore, correctly distinguishing the type and origin of mutations in cells has an important role in the classification, treatment, prognosis, etc. of tumors.

However, the current somatic mutation discrimination method mainly depends on the detection of the matched sample, and the parallel sequencing of the matched sample can accurately determine the source of the mutation, but for the sample which is not harvested for the first time with the matched material, it is very difficult to harvest the matched sample again. In addition, high throughput sequencing at the same depth as tumor samples results in a large expenditure of expenditure and computational resources. Meanwhile, the method has high requirements on the integrity of sample collection and computational storage resources, and the mutation detection cost is obviously increased. In addition, the methods of mutation frequency filtering and mutation annotation database alignment still cannot meet the requirements in terms of accuracy.

Disclosure of Invention

The application provides a method for distinguishing somatic mutations from germline mutations, a method for identifying ctDNA in cfDNA, and corresponding devices and applications of the methods. The method and/or apparatus described herein has at least one of the following features: (1) only a single sample, i.e., a sample derived from the subject, need be used; (2) the application range is wide, and the method can be suitable for identifying somatic mutation in different cancer types and/or identifying ctDNA; (3) high sensitivity; (4) high accuracy, for example, on the basis of a mutation database, population frequency and mutation abundance, a plurality of factors can participate in the method to improve the reliability of the discrimination result; (5) the method is easy to implement, and the number of mutation sites is not limited; (6) the operation is quick, for example, the plasma of a subject can be taken as a sample; (7) a new dimension of differentiation is introduced.

In one aspect, the present application provides a method for differentiating somatic mutations from germline mutations comprising the steps of:

(1) obtaining at least one mutation site from a sample of a subject; obtaining a wild type support fragment and a mutant type support fragment aiming at each mutation site; wherein the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant-type support fragment is a cfDNA fragment comprising a mutant-type base sequence, wherein the wild-type base sequence is the same sequence as a nucleotide sequence of a reference genome at a corresponding position of the mutation site, wherein the mutant-type base sequence is a different sequence as a nucleotide sequence of a reference genome at a corresponding position of the mutation site, wherein the reference genome is a human reference genome in the gene sequencing; (3) aiming at each mutation site, acquiring the number of the wild type support fragments with at least one length, acquiring the number of the corresponding mutant type support fragments with the same length, and calculating the ratio WC of the number of the wild type support fragments with the length to the total number of the wild type support fragments; calculating the ratio MC of the number of the mutant support fragments of the same length to the total number of the mutant support fragments; calculating the difference value of the ratio WC and the ratio MC under the same length; (4) the difference is used as an index to distinguish whether the mutation site is a somatic mutation or a germline mutation.

In one aspect, the present application provides a method for identifying ctDNA in cfDNA, comprising the steps of:

(1) obtaining at least one mutation site from a sample of a subject; obtaining a wild type support fragment and a mutant type support fragment aiming at each mutation site; wherein the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant-type support fragment is a cfDNA fragment comprising a mutant-type base sequence, wherein the wild-type base sequence is the same sequence as a nucleotide sequence of a reference genome at a corresponding position of the mutation site, wherein the mutant-type base sequence is a different sequence as a nucleotide sequence of a reference genome at a corresponding position of the mutation site, wherein the reference genome is a human reference genome in the gene sequencing; (3) aiming at each mutation site, acquiring the number of the wild type support fragments with at least one length, acquiring the number of the corresponding mutant type support fragments with the same length, and calculating the ratio WC of the number of the wild type support fragments with the length to the total number of the wild type support fragments; calculating the ratio MC of the number of the mutant support fragments of the same length to the total number of the mutant support fragments; calculating the difference value of the ratio WC and the ratio MC under the same length; (4) the difference was used as an index for identifying whether the mutation site was located in ctDNA.

In one aspect, the present application provides a training method for a machine learning model, which includes the following steps:

(1) obtaining at least one mutation site from a sample of a subject; obtaining a wild type support fragment and a mutant type support fragment aiming at each mutation site; wherein the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant-type support fragment is a cfDNA fragment comprising a mutant-type base sequence, wherein the wild-type base sequence is the same sequence as a nucleotide sequence of a reference genome at a corresponding position of the mutation site, wherein the mutant-type base sequence is a different sequence as a nucleotide sequence of a reference genome at a corresponding position of the mutation site, wherein the reference genome is a human reference genome in the gene sequencing; (3) aiming at each mutation site, acquiring the number of the wild type support fragments with at least one length, acquiring the number of the corresponding mutant type support fragments with the same length, and calculating the ratio WC of the number of the wild type support fragments with the length to the total number of the wild type support fragments; calculating the ratio MC of the number of the mutant support fragments of the same length to the total number of the mutant support fragments; calculating the difference value of the ratio WC and the ratio MC under the same length; (4) and inputting the difference value serving as an index of training into the machine learning model to perform machine learning training.

In one aspect, the present application provides a database establishing method, which includes the following steps:

(1) obtaining at least one mutation site from a sample of a subject; obtaining a wild type support fragment and a mutant type support fragment aiming at each mutation site; wherein the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant-type support fragment is a cfDNA fragment comprising a mutant-type base sequence, wherein the wild-type base sequence is the same sequence as a nucleotide sequence of a reference genome at a corresponding position of the mutation site, wherein the mutant-type base sequence is a different sequence as a nucleotide sequence of a reference genome at a corresponding position of the mutation site, wherein the reference genome is a human reference genome in the gene sequencing; (3) aiming at each mutation site, acquiring the number of the wild type support fragments with at least one length, acquiring the number of the corresponding mutant type support fragments with the same length, and calculating the ratio WC of the number of the wild type support fragments with the length to the total number of the wild type support fragments; calculating the ratio MC of the number of the mutant support fragments of the same length to the total number of the mutant support fragments; calculating the difference value of the ratio WC and the ratio MC under the same length; (4) storing the difference in a database to distinguish between somatic and germline mutations, and/or identify ctDNA in cfDNA.

In certain embodiments, the gene sequencing comprises next generation gene sequencing (NGS).

In certain embodiments, the methods use only a sample derived from the subject.

In certain embodiments, the sample comprises a blood sample.

In certain embodiments, the method further comprises the steps of: a sample derived from a subject is obtained.

In certain embodiments, the mutation site comprises a Single Nucleotide Variation (SNV).

In certain embodiments, the mutation site comprises more than two nucleotide variations.

In certain embodiments, the length of the wild-type support fragment and/or the mutant support fragment ranges from about 1 nucleotide to about 550 nucleotides.

In certain embodiments, the length of the wild-type support fragment and/or the mutant support fragment ranges from about 1 nucleotide to about 400 nucleotides.

In certain embodiments, the length of the wild-type support fragment and/or the mutant support fragment ranges from about 1 nucleotide to about 200 nucleotides.

In certain embodiments, the method comprises the steps of: (4') obtaining a distribution of said differences of step (3), selecting a maximum value of said distribution as Dev (Max), and using said Dev (Max) as said differentiated index and/or as said training sample.

In certain embodiments, the method comprises the steps of: (4') obtaining a distribution of said differences of step (3), called first distribution.

In certain embodiments, the method comprises the steps of: (5) sequentially accumulating each difference in the first distribution over a length of an effective fragment interval covering the length of the nucleosome-wrapped nucleic acid sequence to obtain an additive value.

In certain embodiments, the nucleic acid sequence can be wrapped around the nucleosome for more than 2 weeks, or, can be wrapped around the nucleosome for less than 1 week.

In certain embodiments, the effective fragment interval is from about 1 to about 167 nucleotides in length, and/or, about 200 or more nucleotides in length.

In certain embodiments, the effective fragment interval is from about 1 to about 167 nucleotides in length, and/or from about 250 to about 400 nucleotides in length.

In certain embodiments, the method comprises the steps of: (6) obtaining a second distribution of the addition values of step (5), and calculating the maximum value of the addition values in the second distribution. In some embodiments, the maximum value of the added value is taken as dev (max), the dev (max) is taken as the index of differentiation and/or as the training sample.

In some embodiments, the difference is smoothed, wherein the smoothing comprises the steps of:

(a) determining a smoothing window value; wherein the smoothing window value is an integer from about 1-10; (b) determining a number of smoothed sample length ranges having length values equal to a smoothing window value, wherein the minimum value of each smoothed sample length range is a starting length, wherein the starting length ranges from the length of the wild-type and/or mutant support fragment; (c) obtaining the number of wild-type support fragments of at least one smoothing sample length in any smoothing sample length range, obtaining the corresponding number of mutant support fragments of the same length,

calculating the ratio WC of the number of said wild-type supported fragments of that length to the total number of said wild-type supported fragments;

calculating the ratio MC of the number of the mutant support fragments of the same length to the total number of the mutant support fragments;

calculating the difference value of the ratio WC and the ratio MC under the same length; (d) calculating an average difference value of the range of smoothed sample lengths based on said difference value of said at least one smoothed sample length; (e) the resulting average difference is taken as a representative value for the range of smoothed sample lengths.

In certain embodiments, the smoothing window value is an integer from about 2 to 6.

In certain embodiments, the smoothing window value is 3.

In some embodiments, the smoothing process comprises the steps of: (f) obtaining a first distribution of the average difference values of step (e).

In some embodiments, the smoothing process comprises the steps of: (g) and sequentially accumulating each average difference value in the first distribution within the length range of the effective fragment interval, wherein the length of the effective fragment interval is the length of the nucleic acid sequence wound around the nucleosome, to obtain an addition value.

In some embodiments, the smoothing process comprises the steps of: (h) obtaining a second distribution of the added values of step (g), calculating the maximum of the added values in the second distribution.

In some embodiments, the maximum value of the added value is taken as dev (max), the dev (max) is taken as the index of differentiation and/or as the training sample.

In certain embodiments, the indicator further comprises one or more of the following parameters: the chromosome position of the mutation site, the base substitution pattern of the mutation site, the count value of nucleic acid fragments with various lengths in a wild type of the mutation site and/or the count value of nucleic acid fragments with various lengths in a mutant type of the mutation site, the allelic variation of the mutation site, the age of a subject and the mutation type of the mutation site.

In certain embodiments, the indicator further comprises one or more of the following parameters: the chromosome position of the SNV locus, the base substitution pattern of the SNV locus, the count value of nucleic acid fragments with various lengths in a wild type of the SNV locus and/or the count value of nucleic acid fragments with various lengths in a mutant type of the SNV locus, the allelic variation of the SNV locus, the age of a subject and the mutation type of the SNV locus.

In certain embodiments, detecting the mutation site comprises the steps of:

(1) obtaining data from the sample; (2) performing variation identification on the data obtained in the step (1); (3) performing variation annotation on the variation identified in the step (2); and, (4) filtering the variation annotated in step (3) to obtain a mutation site; optionally, quality control is performed on the mutation site.

In another aspect, the present application provides an apparatus for distinguishing somatic mutations from germline mutations, comprising: the calculating module is used for calculating the difference value of the ratio WC and the ratio MC with the same length; wherein, for each mutation site, the number of wild-type support fragments of at least one length and the corresponding number of mutant support fragments of the same length are determined; the ratio WC is the ratio of the number of the wild-type supported fragments of a length to the total number of the wild-type supported fragments; wherein the ratio MC is the ratio of the number of corresponding mutant support fragments of the same length to the total number of mutant support fragments; wherein the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant-type support fragment is a cfDNA fragment comprising a mutant-type base sequence, wherein the wild-type base sequence is the same sequence as a nucleotide sequence of a reference genome at a position corresponding to the mutation site, wherein the mutant-type base sequence is a different sequence as a nucleotide sequence of a reference genome at a position corresponding to the mutation site, wherein the reference genome is a human reference genome in the gene sequencing; the mutation site is obtained by a gene sequencing method, and the judging module is used for obtaining a recognition result for recognizing the somatic mutation according to a machine learning model which is subjected to machine learning training, wherein the machine learning training comprises the step of inputting the difference value serving as a training sample into the machine learning model for machine learning training.

In another aspect, the present application provides an apparatus for identifying ctDNA in cfDNA, including:

the calculating module is used for calculating the difference value of the ratio WC and the ratio MC with the same length; wherein, for each mutation site, the number of wild-type support fragments of at least one length and the corresponding number of mutant support fragments of the same length are determined; the ratio WC is the ratio of the number of the wild-type supported fragments of a length to the total number of the wild-type supported fragments; wherein the ratio MC is the ratio of the number of corresponding mutant support fragments of the same length to the total number of mutant support fragments; wherein the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant-type support fragment is a cfDNA fragment comprising a mutant-type base sequence, wherein the wild-type base sequence is the same sequence as a nucleotide sequence of a reference genome at a position corresponding to the mutation site, wherein the mutant-type base sequence is a different sequence as a nucleotide sequence of a reference genome at a position corresponding to the mutation site, wherein the reference genome is a human reference genome in the gene sequencing; the mutation site is obtained by a gene sequencing method, and the judging module is used for obtaining a judgment result for identifying the ctDNA in the cfDNA according to a machine learning model which is subjected to machine learning training, wherein the machine learning training comprises inputting the difference value serving as a training sample into the machine learning model for machine learning training.

In another aspect, the present application provides a training apparatus for a machine learning model, including:

the calculating module is used for calculating the difference value of the ratio WC and the ratio MC with the same length; wherein, for each mutation site, the number of wild-type support fragments of at least one length and the corresponding number of mutant support fragments of the same length are determined; the ratio WC is the ratio of the number of the wild-type supported fragments of a length to the total number of the wild-type supported fragments; wherein the ratio MC is the ratio of the number of corresponding mutant support fragments of the same length to the total number of mutant support fragments; wherein the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant-type support fragment is a cfDNA fragment comprising a mutant-type base sequence, wherein the wild-type base sequence is the same sequence as a nucleotide sequence of a reference genome at a position corresponding to the mutation site, wherein the mutant-type base sequence is a different sequence as a nucleotide sequence of a reference genome at a position corresponding to the mutation site, wherein the reference genome is a human reference genome in the gene sequencing; the mutation site is derived from a sample of a subject, wherein the mutation site is obtained by a gene sequencing method, and a training module is used for inputting the difference value as a training sample to the machine learning model for machine learning training.

In certain embodiments, the device uses only a sample derived from the subject.

In some embodiments, the apparatus further comprises: an output module, configured to display the identification result of the somatic mutation generated by the determination module and/or the determination result of ctDNA identified in the cfDNA.

In certain embodiments, the device further comprises a sample acquisition module for acquiring the sample of the subject.

In certain embodiments, the sample comprises a blood sample.

In certain embodiments, the sample acquisition module comprises reagents and/or instruments for acquiring the sample.

In certain embodiments, the device further comprises a data receiving module for obtaining the mutation site in the sample.

In certain embodiments, detecting the mutation site in the device comprises the steps of:

In certain embodiments, the data receiving module comprises reagents and/or instrumentation required for sequencing the gene.

In certain embodiments, the apparatus further comprises an input module to obtain the number of wild-type support fragments of the at least one length, and/or the corresponding number of mutant support fragments of the same length.

In certain embodiments, the input module is capable of distinguishing between the wild-type support fragment and the mutant support fragment.

In some embodiments, the input module counts the number of the wild-type supported fragments of different lengths; and counting the number of the wild-type support fragments of different lengths.

In some embodiments, in the computing module: obtaining a distribution of said differences, selecting a maximum of said distribution as Dev (Max), using said Dev (Max) as said indicator of discrimination and/or as said training samples. In some embodiments, smoothing the difference in the computing module, wherein the smoothing comprises: (a) determining a smoothing window value, wherein the smoothing window value is an integer from about 1-10; (b) determining a number of smoothed sample length ranges having length values equal to a smoothing window value, wherein the minimum value of each smoothed sample length range is a starting length, wherein the starting length ranges from the length of the wild-type and/or mutant support fragment; (c) obtaining the number of the wild-type support fragments of at least one smoothing sampling length in any smoothing sampling length range, obtaining the corresponding number of the mutant support fragments with the same length,

In certain embodiments, the smoothing window value is 3.

In some embodiments, the smoothing process comprises the steps of: (g) sequentially accumulating each average difference in the first distribution over a length of an effective fragment interval covering the length of the nucleosome-wrapped nucleic acid sequence to obtain an additive value.

In certain embodiments, the computing module outputs the dev (max).

In certain embodiments, the index and/or training sample further comprises one or more of the following parameters: the chromosome position of the mutation site, the base substitution pattern of the mutation site, the count value of nucleic acid fragments with various lengths in a wild type of the mutation site and/or the count value of nucleic acid fragments with various lengths in a mutant type of the mutation site, the allelic variation of the mutation site, the age of a subject and the mutation type of the mutation site.

In certain embodiments, the index and/or training sample further comprises one or more of the following parameters: the chromosome position of the SNV locus, the base substitution pattern of the SNV locus, the count value of nucleic acid fragments with various lengths in a wild type of the SNV locus and/or the count value of nucleic acid fragments with various lengths in a mutant type of the SNV locus, the allelic variation of the SNV locus, the age of a subject and the mutation type of the SNV locus.

In another aspect, the present application provides an electronic device comprising a memory; and a processor coupled to the memory, the processor configured to execute, based on instructions stored in the memory, to implement the method of distinguishing somatic mutations from germline mutations described herein; a method of identifying ctDNA in cfDNA described herein, or a method of training a machine learning model described herein.

In another aspect, the present application provides a non-transitory computer readable storage medium having stored thereon a computer program for execution by a processor to perform the method of distinguishing somatic mutations from germline mutations described herein; a method of identifying ctDNA in cfDNA described herein, or a method of training a machine learning model described herein.

In another aspect, the present application provides a database system comprising a memory; and a processor coupled to the memory, the processor configured to execute, based on instructions stored in the memory, to implement the method of distinguishing somatic mutations from germline mutations described herein; a method of identifying ctDNA in cfDNA described herein, or a database building method described herein.

In another aspect, the present application provides the use of the method of differentiating somatic mutations from germline mutations described herein for tumor pedigree management.

In another aspect, the present application provides the use of the method of differentiating somatic mutations from germline mutations described herein in the detection of Tumor Mutation Burden (TMB).

Other aspects and advantages of the present application will be readily apparent to those skilled in the art from the following detailed description. Only exemplary embodiments of the present application have been shown and described in the following detailed description. As those skilled in the art will recognize, the disclosure of the present application enables those skilled in the art to make changes to the specific embodiments disclosed without departing from the spirit and scope of the invention as it is directed to the present application. Accordingly, the descriptions in the drawings and the specification of the present application are illustrative only and not limiting.

Drawings

The specific features of the invention to which this application relates are set forth in the appended claims. The features and advantages of the invention to which this application relates will be better understood by reference to the exemplary embodiments described in detail below and the accompanying drawings. The brief description of the drawings is as follows:

FIG. 1 shows a training set used in performing a machine learning model according to the methods described herein, and the validation set required to distinguish somatic mutations from germline mutations according to the methods described herein.

FIG. 2 shows machine training results of a machine learning model obtained using the methods described herein.

FIG. 3 shows the case of distinguishing somatic mutations and germline mutations in validation set 1 for the machine learning model obtained by the method described in the present application.

FIG. 4 shows the case of distinguishing somatic mutations and germline mutations in validation set 2 for the machine learning model obtained by the methods described herein.

FIG. 5 shows that somatic and germline mutations can be distinguished for different tumor species using the methods described herein.

FIG. 6 shows the AUC results of the method described herein for distinguishing somatic and germline mutations of a heterogeneous test set of cancer species.

FIG. 7 shows the AUC results of the discrimination between model cell mutations and germline mutations in 11 cancers by the methods described herein.

FIG. 8 shows the distribution of the lengths of the wild type support fragment and the mutant support fragment for a mutation site of human chromosome 4.

FIG. 9 shows the distribution of the lengths of the wild-type support fragment and the mutant support fragment for a mutation site of human chromosome 5.

FIG. 10 shows the distribution of the lengths of the wild-type support fragment and the mutant support fragment for a mutation site of human chromosome 17.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification.

Definition of terms

In the present application, the term "somatic mutation" generally refers to a class of mutations that occur in non-embryonic cells that are acquired at an acquired date. In the present application, the somatic mutation may include a genetic change that occurs in a somatic tissue (e.g., an extragermline cell). In the present application, the somatic mutations can include point mutations (e.g., exchange of a single nucleotide for another nucleotide (e.g., silent, missense, and nonsense mutations)), insertions and deletions (e.g., addition and/or removal of one or more nucleotides (e.g., indels)), amplifications, gene duplications, Copy Number Alterations (CNAs), rearrangements, and splice variants. The somatic mutations may be closely related to the growth, programming, senescence and apoptosis processes of the cells. For example, the somatic mutation may be associated with altered signaling pathways in tumorigenesis, angiogenesis and/or metastasis of a tumor.

In the present application, the term "germline mutation" generally refers to a mutation that occurs heritable in germ cells (e.g., eggs or sperm). The germline mutation can be passed on to the offspring, for example, into the DNA of each cell (e.g., germline and somatic) that can be taken into the offspring. The germline mutation may be less correlated with the occurrence of a tumor. For example, the germline mutation can serve as a "baseline" in the TMB analysis.

In this application, the term "gene sequencing" generally refers to a technique for determining the order of the nucleotide bases adenine, guanine, cytosine, and thymine in a DNA molecule. The gene sequencing may include one-generation gene sequencing, two-generation gene sequencing, three-generation gene sequencing, or Single Molecule Sequencing (SMS). Second or next generation gene sequencing may refer to techniques that use advanced techniques (optical) methods of detecting base position while generating many sequences (see for example, overview by Metzker, 2009). The term "Next-generation gene sequencing" or "Next-generation sequencing" (NGS) is a High-throughput sequencing technique that can sequence hundreds of thousands to millions of DNA molecules in parallel at a time, typically with short read lengths. According to development history, influence, sequencing principle and technology difference and the like, the following are mainly provided: massively Parallel Signature Sequencing (MPSS), polymerase cloning (Polony Sequencing), 454 pyrosequencing (454 pyro Sequencing), illumina (solexa) Sequencing, Ion semiconductor Sequencing (Ion semiconductor Sequencing), DNA nanoball Sequencing (DNA nano-ball Sequencing), DNA nanoarrays of Complete Genomics, and combinatorial probe-anchored ligation Sequencing methods, etc. The sequencing of second generation genes enables a detailed and comprehensive analysis of the transcriptome and genome of a species and is therefore also referred to as deep sequencing.

In this application, the term "mutation site" generally refers to the site at which a nucleotide is located that is different from the nucleotide sequence of a control sequence. For example, the control sequence may be a reference sequence used in gene sequencing (e.g., may be a human reference genome). In the present application, the mutation site may include a difference (e.g., the difference may include a nucleotide substitution, repetition, deletion, and/or addition) in the nucleotide sequence at least 1 (e.g., 1, 2, 3, 4, or more) site. For example, the mutation site may include a nucleotide mutation at least 1 nucleotide site. The nucleotide mutation may be a natural mutation or an artificial mutation. The mutation site may comprise a Single Nucleotide Variation (SNV).

In the present application, the term "wild-type base sequence" generally refers to the same sequence as compared to the nucleotide sequence of a reference genome (which may be, for example, a human reference genome) at the corresponding position of the mutation site. In some cases, the wild-type base sequence may be a nucleotide sequence of a human reference genome at a position corresponding to the mutation site. In some cases, the wild-type base sequence may not include the mutation site for a particular mutation site described herein.

In the present application, the term "mutant base sequence" generally refers to a sequence that is different compared to the nucleotide sequence of a reference genome (which may be, for example, a human reference genome) at the corresponding position of the mutation site. In some cases, the mutant base sequence may include the mutation site for a particular mutation site described herein.

In the present application, the term "wild-type supporting fragment" generally refers to a cfDNA fragment comprising the wild-type base sequence described herein. In the present application, the wild-type support fragment may have a different sequence length for a particular mutation site described herein. In some cases, the wild-type support fragment may not contain the mutation site for a particular mutation site described herein. In some cases, the wild-type support fragment may or may not contain the mutation site for a particular mutation site, whereas the wild-type support fragment may or may not contain another mutation site for another mutation site described herein. The term "length of the wild-type-supporting fragment" refers to the length of the wild-type-supporting fragment described herein, in "number of nucleotides".

In the present application, the term "mutant support fragment" generally refers to a cfDNA fragment comprising a mutant base sequence described herein. In some cases, the mutant support fragment may contain the mutation site for a particular mutation site described herein. In some cases, for a particular mutation site described herein, the mutant support fragment may or may not include the mutation site, while for another other mutation site described herein, the mutant support fragment may or may not include the other mutation site. The term "length of the mutant support fragment" refers to the length of the mutant support fragment described herein, in terms of the number of "nucleotides".

In the present application, the term "human reference genome" generally refers to a human genome that can serve a reference function in gene sequencing. The information of the human reference genome can be referred to UCSC (http:// genome. UCSC. edu/index. html). The human reference genome may have different versions, for example, it may be hg19, GRCH37, or ensembl 75.

In the present application, the term "at the corresponding position" generally refers to the position of a specific base in one sequence for at least one, and the position of the specific base in the sequence in another sequence. For example, the corresponding position may be a nucleotide position at the mutation site in the wild type base sequence or the mutant base sequence described herein, or a position of the mutation site in the reference genome described herein. For example, if the mutation site in the mutant base sequence is the 100 th nucleotide, then the corresponding position in the reference genome may be the 100 th position of the corresponding sequence in the reference genome.

In the present application, the term "cfDNA" generally refers to the abbreviation of Cell free DNA, which may refer to plasma free DNA. For example, the cfDNA may be an extracellular DNA fragment located in the peripheral circulation.

In the present application, the term "ctDNA" generally refers to circulating tumor DNA. ctDNA is a fragment DNA of tumor origin that is not associated with cells in blood. The ctDNA may be produced by the genome in apoptotic or necrotic tumor cells entering the blood. The ctDNA may carry gene characteristics specific to the primary tumor or metastatic tumor. The ctDNA may be the cfDNA considered to be one particular.

In this application, the term "machine learning model" generally refers to a collection of system or program instructions and/or data configured to implement an algorithm, process, or mathematical model. In the present application, the algorithm, process, or mathematical model may predict and provide a desired output based on a given input. In the present application, the parameters of the machine learning model may not be explicitly programmed, and in the traditional sense, the machine learning model may not be explicitly designed to follow specific rules in order to provide the desired output for a given input. For example, the use of the machine learning model may mean that the machine learning model and/or a data structure/set of rules as a machine learning model are trained by a machine learning algorithm.

In this application, the term "database" generally refers to an organized entity of related data, regardless of the manner in which the data or organized entity is represented. For example, the organized entity of related data may take the form of a table, map, grid, packet, datagram, file, document, list, or any other form. In the present application, the database may include any data collected and maintained in a computer accessible manner.

In this application, the term "Single Nucleotide Variation (SNV)" generally refers to a variation in a single nucleotide that occurs at a specific location in a genome that is different (e.g., a substitution, a duplication, a deletion, or an addition of one nucleotide) from a nucleotide at a corresponding location in a reference genome (e.g., a human reference genome described herein).

In this application, the term "smoothing process" generally refers to a method of data processing that reduces the variance between more than one of the differences described herein. For example, the smoothing process may include obtaining an average of a number of differences described herein. For example, the smoothing process may include selecting the number of the wild-type support segments and/or the mutant-type support segments corresponding to different lengths (for example, the smoothing sampling length may be described in the present application) according to a certain interval length (for example, the smoothing window value may be described in the present application), and calculating the difference between the ratio of the number of the wild-type support segments and the ratio of the number of the mutant-type support segments and the total number of the wild-type support segments and the ratio of the number of the mutant-type support segments. For example, the smoothing process may include dividing the accumulated value of the difference values by the interval length within a certain length range to obtain a ratio. For example, the ratio may be considered as the average difference of the differences for the length range.

In the present application, the term "smoothing window value" generally refers to the length value of the nucleotides between which the wild-type and/or mutant support fragments of different lengths are selected during the smoothing process described herein. For example, in the smoothing process, the length of the wild-type support fragment and/or the mutant-type support fragment selected may be 1, 4, 7, 10, 13 … … nucleotides in sequence, and the smoothing window value may be 3. The smoothing window value may be an integer from about 1-30, and may be, for example, 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10. For example, it may be 1, 2, 3, 4, 5 or 6.

In the present application, the term "smoothed sample length" generally refers to the length value of the wild-type support fragment selected for counting and/or the length value of the mutant support fragment selected for counting in the smoothing process described herein. For example, the smoothed sample length may be a length value of each support fragment within the range of smoothed sample lengths within the range of lengths of the wild-type support fragment and/or the mutant support fragment described herein. For example, within each smoothing sample length range, there can be from a starting length (e.g., which can be from 1 nucleotide in length) to a maximum value of the smoothing sample length range (e.g., which can be starting length + (smoothing window value-1)), where the length value of the respective supported fragment. For example, if the smoothing window value can be 3, if the starting length is 1 nucleotide, the smoothing sample length can range from 1-3, 4-6, 7-9 … …; for example, if the smoothing window value can be 3, if the starting length is 1 nucleotide, the smoothing sample length can also range from 1-3, 2-4, 3-5 … …. In the present application, the starting length may also be other than 1 (e.g., may be from 2 nucleotides in length). For example, if the starting length is 2 nucleotides, the smoothed sample length may range from 2-4, 5-7, 8-10 … …; for example, if the smoothing window value can be 3, if the starting length is 2 nucleotides, the smoothing sample length can also range from 2-4, 3-5, 4-6 … ….

In this application, the term "first distribution" generally refers to the distribution of the average difference over the various smoothed sample lengths described herein. In some cases, the first distribution may be a collection of average difference values as described herein.

In this application, the term "length of nucleic acid sequence wound around nucleosomes" generally refers to the length required for one nucleic acid sequence to wind around nucleosomes. For example, the nucleic acid sequence may be wrapped around the nucleosome by a factor (e.g., may be wrapped within one-fold, or may be wrapped more than 2-fold).

In the present application, the term "length of the valid fragment interval" generally refers to the range of lengths corresponding to the wild-type and/or mutant-type support fragments required for calculating the addition value described herein.

In the present application, the term "second distribution" generally refers to the distribution of added values described herein. In some cases, the second distribution may be a set of additive values as described herein for each application.

In the present application, the term "calculation module" generally refers to a functional module for calculating the difference between the number of wild-type support fragments described herein and the number of mutant-type support fragments described herein of the same length. The calculation module can input the number of wild-type support fragments described herein, and the number of mutant support fragments of correspondingly the same length. The calculation module may output the difference values described herein. For example, the Dev (Max) described herein may be output. In the calculation module, the smoothing process described in the present application may be performed.

In the present application, the term "judgment module" refers generally to a module for obtaining relevant judgment results according to a machine learning model that has been subjected to machine learning training (for example, the judgment results may include the identification results of somatic mutations described in the present application, and/or the judgment results of ctDNA identified in cfDNA described in the present application). In this application, the determining module may input the difference value (e.g., the dev (max)) described in this application. The judging module may output the related judgment result. In the determination module, the determination may be performed by means of the machine learning model.

In this application, the term "training module" generally refers to a functional module for inputting the difference values described herein (e.g., the dev (max)) as training samples to the machine learning model for machine learning training. The "machine learning" may refer to an artificial intelligence system configured to learn from data without explicit programming. The "machine learning model" may be a collection of parameters and functions that may train parameters on a set of training samples. The parameters and functions may be a set of linear algebraic operations, non-linear algebraic operations, and tensor algebraic operations. The parameters and functions may include statistical functions, tests, and probabilistic models. In this application, the training module may input the difference value described in this application (e.g., the Dev (Max) described). The training module may output a machine learning model that has been machine learning trained.

In the present application, the term "output module" generally refers to a functional module for displaying the result of the identification of the somatic mutation by the determination module described in the present application and/or the result of the determination of ctDNA in the cfDNA. For example, the output module may include a display, which may display (e.g., in the form of a graph and/or text) the identification result of the somatic mutation generated by the determination module described herein and/or the determination result of ctDNA identified in the cfDNA.

In the present application, the term "sample obtaining module" generally refers to a functional module for obtaining said sample of a subject. For example, the sample acquisition module may include reagents and/or instruments necessary to obtain the sample (e.g., a blood sample). For example, lancets, blood collection tubes and/or blood sample transport containers may be included. The sample acquisition module can output a sample as described herein.

In this application, the term "data receiving module" generally refers to the functional module used to obtain the mutation site in the sample. In the present application, the data receiving module may input a sample (e.g., a blood sample) as described herein. The data receiving module may output the mutation site. The data receiving module can detect the mutation site of the sample. For example, the data receiving module can perform gene sequencing (e.g., next generation gene sequencing) as described herein on the sample. For example, the data receiving module may include reagents and/or instrumentation necessary to perform the gene sequencing. The data receiving module can detect the single nucleotide variation.

In the present application, the term "input module" generally refers to a functional module for obtaining the number of said wild type support fragments of said at least one length, and/or the number of said corresponding mutant support fragments of the same length. In the present application, the input module can input the mutation sites described herein. The input module can output (e.g., can display) the number of wild-type support fragments of the at least one length and/or the corresponding number of mutant support fragments of the same length. The input module may include reagents and/or instruments capable of counting a particular length of the wild-type support fragment. The input module may include reagents and/or instruments capable of counting the mutant support fragments of a particular length. The input module can identify the lengths of the wild-type support segments and count the lengths respectively; the input module may identify the lengths of the mutant support fragments and count them separately. The input module can determine whether the length of the wild-type support fragment and the length of the mutant fragment are the same.

In the present application, the term "tumor pedigree management" generally refers to providing tumor-related help to familial hereditary tumor patients, their relatives, and/or high risk populations. For example, the tumor pedigree management may include counseling and/or implementation of genetic counseling for the population, detection and interpretation of tumor-associated genes, risk assessment of developing tumors, preventive intervention.

In this application, the term "Tumor Mutation Burden (TMB)" refers to Tumor Mutation Burden (TMB), which generally refers to the number of non-synonymous mutations per megabase pair (Mb) of somatic cells in a particular genomic region, usually expressed in terms of how many mutations per megabase (Mb) (XX mutations/Mb), as defined in Chinese experts consensus for tumor mutation burden detection and clinical applications (2020 th). The TMB may serve as a biomarker associated with an immunotherapeutic response. The TMB may indirectly reflect the ability and extent of a tumor to produce new antigens and has been shown to predict the response to immunotherapy, e.g., NSCLC guidelines, 2019, 1 edition, indicate that TMB is useful for identifying lung cancer patients eligible to receive "Nivolumab + Ipilimumab" combination immunotherapy and "Nivolumab" monotherapy. TMB expression levels may be related to a variety of factors, such as microsatellite instability (MSI-H) and the presence of certain driver genes.

In the present application, the term "comprising" is generally intended to include the explicitly specified features, but not to exclude other elements.

In the present application, the term "about" generally means varying from 0.5% to 10% above or below the stated value, for example, varying from 0.5%, 1%, 1.5%, 2%, 2.5%, 3%, 3.5%, 4%, 4.5%, 5%, 5.5%, 6%, 6.5%, 7%, 7.5%, 8%, 8.5%, 9%, 9.5%, or 10% above or below the stated value.

Detailed Description

Method

In the present application, the gene sequencing may include next generation gene sequencing (NGS). In the present application, the NGS may be selected from the group consisting of: solexa sequencing technology, 454 sequencing technology, SOLID sequencing technology, Complete Genomics sequencing method, and semiconductor (Ion Torrent) sequencing technology. The gene sequencing can be high throughput, e.g., hundreds of thousands, millions of orders of magnitude of DNA molecules can be sequenced at a time. The gene sequencing may be short-segmented, for example, NGS may read no more than 500bp long.

In the present application, the gene sequencing may comprise the steps of: (1) constructing a library; for example, modification of the ends of the DNA molecule and addition of a linker (e.g., a Y-linker may be formed) followed by PCR amplification; (2) sequencing; for example, DNA replication using oligonucleotides as primers and library fragments as templates; "bridge" amplification was then performed and sequencing was performed as synthesis. Sequencing primer Index primers were then added and the Index sequence in the adaptors were read to determine to which library the DNA at each site belongs.

In the present application, the method may use only a sample derived from the subject. In the present application, the method may not require the use of a paired sample. The methods described herein can therefore greatly reduce the requirements for a sample of a subject.

In the present application, the sample may comprise a blood sample.

In the present application, the method may further comprise the steps of: a sample derived from a subject is obtained. For example, a step of taking a blood sample from the subject using a lancet system may be included. The method of obtaining a sample may include a vacuum blood collection tube method.

In the present application, the mutation site may include a Single Nucleotide Variation (SNV). In the present application, the mutation site may comprise two or more nucleotide variations. For example, the mutation site described herein may include 1 SNV, or may include two or more SNVs (e.g., may include 2, 3, 4, 5, 6, 7, 8, 9, 10, or more nucleotide variations). In the present application, the nucleotide sequence at the position of the mutation site of the wild-type support fragment and the mutant support fragment differs for a particular said mutation site. The mutation site may include substitution of nucleotides, and may also include deletion and/or insertion of nucleotides in some cases. In the present application, the mutation site may include substitution of nucleotides.

In the present application, the division of the wild-type support fragment and/or the mutant support fragment may be for a particular one of the mutation sites. For example, if the nucleotide sequence at the mutation site is identical to the nucleotide sequence of the reference genome at the corresponding position of the mutation site, it can be considered to be the wild-type support fragment when directed to the mutation site; the mutant support fragment may be considered as being directed to the mutation site if the nucleotide sequence at the mutation site is different from the nucleotide sequence of the reference genome at the corresponding position of the mutation site.

In the present application, the length of the wild-type support fragment and/or the mutant support fragment may range from about 1 nucleotide to about 550 nucleotides (e.g., may be about 1-500, about 1-450, about 1-400, about 1-350, about 1-300, about 1-250, about 1-200, or about 1-100). For example, it may be from about 1 nucleotide to about 400 nucleotides. For example, it may be from about 1 nucleotide to about 200 nucleotides.

In the present application, the method may comprise the steps of: (4') obtaining a distribution of said differences of step (3), selecting a maximum value of said distribution as Dev (Max), and using said Dev (Max) as said differentiated index and/or as said training sample.

In this application, the distribution may be a set of the difference values. The dev (max) may be the maximum of the differences in the set.

In this application, the difference may be smoothed. In the present application, through the smoothing process, the difference value can reflect the difference value between the number of wild-type support fragments and the number of mutant-type support fragments of the same length more intuitively and accurately. Further, the difference subjected to the smoothing process can more accurately, specifically, and/or sensitively distinguish the somatic mutation from the systematic mutation, and/or identify cfDNA in the ctDNA.

In the present application, the smoothing process may include the steps of:

(a) determining a smoothing window value; wherein the smoothing window value is an integer from about 1-10; (b) determining smoothed sample length ranges, wherein the minimum value of each smoothed sample length range is the starting length, wherein the maximum value of each smoothed sample length range is the starting length + (smoothing window value-1), i.e.: the length value of each smooth sampling length range is equal to the determined smooth window value; wherein the starting length ranges from the length of the wild-type support fragment and/or the mutant support fragment; (c) obtaining the number of wild-type support fragments of at least one smoothing sample length in any smoothing sample length range, obtaining the corresponding number of mutant support fragments of the same length,

calculating the ratio WC of the number of said wild-type supported fragments of that length to the total number of said wild-type supported fragments; calculating the ratio MC of the number of the mutant support fragments of the same length to the total number of the mutant support fragments;

calculating the difference value of the ratio WC and the ratio MC under the same length; (d) accumulating each of the difference values obtained in step (c) and dividing by the smoothing window value to obtain an average difference value for the range of the smooth sampling length; (e) the resulting average difference is taken as a representative value for the range of smoothed sample lengths.

In the present application, the smoothing window value may be adjusted for different subject conditions, different genetic sequencing methods and/or different discrimination purposes, as long as the smoothing window value is selected such that the smoothing process is performed. In the present application, the smoothing window value may be an integer of about 2-6 (e.g., the smoothing window value may be 2, 3, 4, 5, or 6). For example, the smoothing window value may be 3.

In the present application, the smoothing process may include the following specific steps:

(a) determining a smoothing window value; wherein the smoothing window value is an integer of about 1-10 (e.g., the smoothing window value is selected to be 3);

(b) determining smoothed sample length ranges, wherein the minimum value of each smoothed sample length range is a starting length, wherein the maximum value of each smoothed sample length range is a starting length + (smoothing window value-1); wherein the starting length ranges from the length of the wild-type support fragment and/or the mutant support fragment (e.g., can be from about 1 nucleotide to about 400 nucleotides); in the present application, the starting length may be any length within the range of the length of the wild-type and/or mutant support fragment. In the present application, the "length" can be measured in terms of the number of nucleotides.

In the present application, the minimum value of each of the smoothing sample length ranges may be a first term, a second term, a third term, and up to an nth term in an arithmetic progression within a range of lengths of the wild-type support fragment and/or the mutant-type support fragment with the starting length as the first term and the smoothing window value as a tolerance. For example, when the smoothing window value is 3 and the starting length is 1, the smoothing minimum values may be 1, 4, 7, 10 … … 400 in order, ranging from about 1 nucleotide to about 400 nucleotides.

For example, when the starting length is 1, if the smoothing window value is 3, and if the respective smoothing sample length ranges do not overlap with each other, the smoothing sample length ranges may be 1-3, 4-6, 7-10 … …. For example, when the starting length is 1, if the smoothing window value is 3, and if the respective smoothing sample length ranges may overlap with each other, the smoothing sample length ranges may be 1-3, 2-4, 3-5 … …, or 1-3, 3-5, 5-7 … …. For another example, when the starting length is 2, if the smoothing window value is 3, and if the smoothing sample length ranges do not overlap with each other, the smoothing sample length ranges may be 2-4, 5-7, 8-11 … …. For example, when the starting length is 2, if the smoothing window value is 3, and if the respective smoothing sample length ranges may overlap with each other, the smoothing sample length ranges may be 2-4, 3-5, 4-6 … ….

(c) Obtaining the number of the wild-type supported fragments of at least one (e.g., at least 1, at least 2, at least 3 or more) smoothed sampling length in any one smoothed sampling length range, obtaining the number of the corresponding mutant-type supported fragments of the same length, and calculating the ratio WC of the number of the wild-type supported fragments of the length to the total number of the wild-type supported fragments; calculating the ratio MC of the number of the mutant support fragments of the same length to the total number of the mutant support fragments; calculating the difference between the ratio WC and the ratio MC at the same length.

For example, the number of wild-type-supported fragments of 1 nucleotide in length is obtained and divided by the total number of wild-type-supported fragments W_totalObtaining a ratio WC 1; obtaining the mutant support sheet with the length of 1 nucleotideThe number of segments divided by the total number of mutant support segments M_totalObtaining a ratio MC1, and calculating a difference WC 1-MC 1 between the ratio MC1 and the ratio MC 1; for example, the number of the wild-type-supported fragments of 4 nucleotides in length is obtained and divided by the total number of the wild-type-supported fragments W_totalObtaining a ratio WC 4; obtaining the number of said mutant support fragments of 4 nucleotides MC4, dividing the number by the total number of said mutant support fragments M_totalCalculating the difference WC 4-MC 4; so as to respectively obtain the difference of the ratios corresponding to different smoothing sampling lengths (such as 1, 4, 7 and 10 … … 400); for example, the difference between the ratio of the number of wild-type supported fragments to the total number of wild-type supported fragments and the ratio of the number of mutant supported fragments to the total number of mutant supported fragments for each of the smoothed sample lengths can be obtained within each of the smoothed sample lengths. For example, (WC 1-MC 1), (WC 2-MC 2), and (WC 3-MC 3) can be obtained for the smoothed sample length ranges 1-3, respectively.

(d) Calculating an average difference value of the range of smoothed sample lengths based on said difference value of said at least one smoothed sample length; for example, the sum of (WC 1-MC 1), (WC 2-MC 2), and (WC 3-MC 3) is calculated and divided by the smoothing window value to obtain an average difference. Alternatively, only a portion of the difference values in a single range of smoothed sample lengths may be calculated, for example: (WC 1-MC 1) and (WC 3-MC 3), and then calculating their average values as an average difference value;

(e) the resulting average difference is taken as a representative value of the average difference for the range of smoothed sample lengths. For example, the average difference value B1 obtained by dividing the accumulated value of (WC 1-MC 1), (WC 2-MC 2) and (WC 3-MC 3) by the smoothing window value 3 can be taken as a representative value of the smoothing sample length range. For example, the average difference value B4 obtained by dividing the accumulated value of (WC 4-MC 4), (WC 5-MC 5) and (WC 6-MC 6) by the smoothing window value 3 can be taken as a representative value of the smoothing sample length range.

In the present application, the smoothing process may include the steps of: (f) obtaining a first distribution of the average difference values of step (e). For example, the respective accumulated values B1, B4, B7, and the like are formed into the first distribution D = [ B1, B4, B7 … … B400 ].

In this application, the smoothing process may further include the steps of: (g) sequentially accumulating each average difference in the first distribution over a length of an effective fragment interval covering the length of the nucleosome-wrapped nucleic acid sequence to obtain an additive value.

In the present application, the nucleic acid sequence can be wrapped around the nucleosome for more than 2 weeks, or can be wrapped around the nucleosome for less than 1 week. For example, the effective fragment interval can be from about 1 to about 180 nucleotides in length (e.g., can be from about 1 to about 180, from about 1 to about 179, from about 1 to about 178, from about 1 to about 177, from about 1 to about 176, from about 1 to about 175, from about 1 to about 174, from about 1 to about 173, from about 1 to about 172, from about 1 to about 171, from about 1 to about 170, from about 1 to about 169, from about 1 to about 168, from about 1 to about 167, from about 1 to about 166, or from about 1 to about 165), and/or can be more than about 200 nucleotides in length (e.g., can be more than about 200, more than about 210, more than about 220, more than about 230, more than about 240, more than about 250, more than about 260, more than about 270, more than about 280, or, More than about 290, more than about 300, more than about 350, or more than about 400). For example, the effective fragment interval can be from about 1 to about 167 nucleotides in length, and/or from about 250 to about 400 nucleotides in length.

For example, B1 and B4 in the first distribution may be added up to obtain an addition value D1; b1, B4, and B7 in the first distribution may be summed to obtain an added value D2.

In the present application, the smoothing process may include the steps of: (h) obtaining a second distribution of the added values of step (g), calculating the maximum of the added values in the second distribution. For example, each of the added values D1, D2, and the like may be formed into the second distribution a = [ D1, D2 … … Di ]. Wherein i may be the length of the valid segment interval.

In this application, the maximum value in the second distribution may be taken as dev (max). In this application, the dev (max) may be used as the index of discrimination and/or as the training sample.

In the present application, in order to further improve the accuracy, sensitivity and/or specificity of the method described herein, other parameters may be used as the index of the differentiation and/or as the training sample based on the difference described herein (e.g., the dev (max)). For example, the indicator may further comprise one or more selected from the following group of parameters: the chromosome position of the mutation site, the base substitution pattern of the mutation site, the count value of nucleic acid fragments with various lengths in a wild type of the mutation site and/or the count value of nucleic acid fragments with various lengths in a mutant type of the mutation site, the allelic variation of the mutation site, the age of a subject and the mutation type of the mutation site.

In the present application, the indicator may further comprise one or more selected from the following group of parameters: the chromosome position of the SNV locus, the base substitution pattern of the SNV locus, the count value of nucleic acid fragments with various lengths in a wild type of the SNV locus and/or the count value of nucleic acid fragments with various lengths in a mutant type of the SNV locus, the allelic variation of the SNV locus, the age of a subject and the mutation type of the SNV locus.

In the present application, the method may further comprise the step of detecting the mutation site. The step of detecting the mutation site may be a routine step in the art, and for example, with reference to the gene sequencing, the detection of the mutation site may comprise the steps of: (1) obtaining data from the sample; (2) performing mutation identification on the data obtained in the step (1) (for example, the mutation identification can be performed by factors such as base quality, mapping quality, mismatching quantity, mutation frequency, reading number supporting mutation and the like); (3) mutational annotation of the variants identified in step (2) (e.g., annotation can be performed using ANNOVAR 20160201, 1000 Genomes database, ExAC database, and/or gnomaD genome database; e.g., database annotation, hot site annotation, mutation type, and/or population frequency annotation); and, (4) filtering the variations annotated in step (3) (e.g., filtering of human mutation site frequency, filtering of hot spot mutations, filtering of clonal hematopoietic mutations, and/or filtering of maximum depth can be performed) to obtain mutation sites. For example, the step may further comprise quality control of the mutation site after step (4) (e.g., the quality control may comprise removal of duplicate fragments, and/or filtering of low quality fragments).

Device for measuring the position of a moving object

In another aspect, the present application provides an apparatus for distinguishing somatic mutations from germline mutations, comprising:

the calculating module is used for calculating the difference value of the ratio WC and the ratio MC with the same length; wherein, for each mutation site, the number of wild-type support fragments of at least one length and the corresponding number of mutant support fragments of the same length are determined; the ratio WC is the ratio of the number of the wild-type supported fragments of a length to the total number of the wild-type supported fragments; wherein the ratio MC is the ratio of the number of corresponding mutant support fragments of the same length to the total number of mutant support fragments; wherein the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant-type support fragment is a cfDNA fragment comprising a mutant-type base sequence, wherein the wild-type base sequence is the same sequence as a nucleotide sequence of a reference genome at a position corresponding to the mutation site, wherein the mutant-type base sequence is a different sequence as a nucleotide sequence of a reference genome at a position corresponding to the mutation site, wherein the reference genome is a human reference genome in the gene sequencing; the mutation site is derived from a sample of a subject, wherein the mutation site is obtained by a method of gene sequencing,

and the judging module is used for obtaining a recognition result for recognizing the somatic mutation according to a machine learning model which is subjected to machine learning training, wherein the machine learning training comprises the step of inputting the difference value serving as a training sample into the machine learning model for machine learning training.

a judging module, configured to obtain a judgment result for identifying ctDNA in the cfDNA according to a machine learning model that has been subjected to machine learning training, where the machine learning training includes inputting the difference as a training sample to the machine learning model for machine learning training.

and the training module is used for inputting the difference value serving as a training sample to the machine learning model to perform machine learning training.

In the present application, the device may use only a sample derived from the subject.

In this application, the apparatus may further include: and the output module is used for displaying the identification result of the somatic mutation generated by the judgment module and/or the judgment result of ctDNA identified in the cfDNA.

In this application, the output module may display the result of identifying the somatic mutation and/or the result of identifying ctDNA in the cfDNA, which are generated by the determination module. For example, the output module may include an output device (e.g., a display) and/or an output program (e.g., a mobile terminal APP), so as to display the identification result of the somatic mutation and/or the judgment result of the cfDNA identified in the cfDNA, which are generated by the judgment module. In this application, the output module inputs the result of identifying the somatic mutation obtained by the determination module and/or the result of determining that ctDNA is identified in the cfDNA.

In the present application, the device may further comprise a sample obtaining module for obtaining the sample of the subject.

For example, the sample may comprise a blood sample. In the present application, the sample acquisition module may include reagents and/or instruments necessary to obtain the sample. For example, the sample obtaining module may include a lancet, a blood collection tube, and/or a blood sample transport case. For example, the sample acquisition module may include an anticoagulant. In this application, the sample acquisition module can output the samples described herein.

In this application, the apparatus may further comprise a data receiving module for obtaining the mutation site in the sample. For example, the data receiving module may input the sample. For example, the data receiving module can output the mutation sites described herein. In the present application, the data receiving module may include reagents and/or instruments necessary to obtain the mutation site. For example, the data receiving module may include reagents and/or instrumentation required for the gene sequencing. In the present application, the data receiving module can perform gene sequencing as described herein, for example, the gene sequencing can include next generation gene sequencing (NGS).

For example, the data receiving module may comprise a second generation gene sequencer (e.g., Roche454 sequencer, Illumina sequencer). For example, the data receiving module may comprise an automated sample preparation system. For example, the data receiving module may include fluorescently labeled dntps, a terminal repair enzyme, a terminal repair reaction buffer, a DNA ligase, a DNA ligation buffer, and/or a library amplification reaction.

In the present application, the mutation site may include a Single Nucleotide Variation (SNV). In the present application, the mutation site may comprise two or more nucleotide variations.

In the present application, the detection of the mutation site in the device may comprise the steps of: (1) obtaining data from the sample; (2) performing variation identification on the data obtained in the step (1); (3) performing variation annotation on the variation identified in the step (2); and, (4) filtering the variation annotated in step (3) to obtain a mutation site; optionally, quality control is performed on the mutation site.

In this application, the apparatus may further comprise an input module for obtaining the number of the wild-type support fragments of the at least one length, and/or the corresponding number of the mutant support fragments of the same length.

For example, the input module can input the mutation sites described herein. The input module may output the number of the wild-type support fragments of the at least one length, and/or the corresponding number of the mutant support fragments of the same length. In the present application, the input module may include reagents and/or instruments capable of counting the wild-type support fragments of a particular length. The input module may include reagents and/or instruments capable of counting the mutant support fragments of a particular length. In the present application, said input module may comprise an instrument (for example, a display) and/or an output program (for example, a mobile terminal APP) capable of displaying the number of said wild type supported fragments of said at least one length, and/or the number of said mutant supported fragments of said corresponding same length, so that the number of wild type and/or mutant supported fragments obtained with said input module may be displayed. In this application, the input module can distinguish between the wild-type support fragment and the mutant support fragment. In this application, the input module may count the number of the wild-type supported segments with different lengths; and counting the number of the wild-type support fragments of different lengths.

In the present application, the length of the wild-type support fragment and/or the mutant support fragment may range from about 1 nucleotide to about 550 nucleotides. For example, it may be from about 1 nucleotide to about 400 nucleotides. For example, it may be from about 1 nucleotide to about 200 nucleotides.

In the present application, the calculation module can input the number of wild-type support fragments described herein (e.g., obtainable by the input module described herein), and the number of mutant support fragments of correspondingly the same length. The calculating module may output the difference value described herein, for example, the calculating module may output the dev (max) described herein. The calculation module may include calculation logic and/or a calculation program to calculate the difference values described herein.

In this application, a distribution of the difference values may be obtained in the calculation module, with the maximum value in the distribution being selected as dev (max), with the dev (max) being the index of differentiation and/or as the training sample.

In this application, the difference value may be smoothed in the calculation module, wherein the smoothing process may include the steps of: (a) determining a smoothing window value, wherein the smoothing window value is an integer from about 1-30; (b) determining a number of smoothed sample length ranges having length values equal to a smoothing window value, wherein the minimum value of each smoothed sample length range is a starting length, wherein the starting length ranges from the length of the wild-type and/or mutant support fragment; (c) obtaining the number of the wild type supported fragments of at least one smoothing sampling length in any smoothing sampling length range, obtaining the number of the corresponding mutant type supported fragments with the same length, and calculating the ratio WC of the number of the wild type supported fragments with the length to the total number of the wild type supported fragments; calculating the ratio MC of the number of the mutant support fragments of the same length to the total number of the mutant support fragments; calculating the difference value of the ratio WC and the ratio MC under the same length; (d) calculating an average difference value of the range of smoothed sample lengths based on said difference value of said at least one smoothed sample length; (e) taking the average difference obtained in step (d) as a representative value of the range of smoothed sample lengths.

In the present application, the smoothing window value may be an integer of about 2 to 6. For example, the smoothing window value may be 3.

In the present application, the smoothing process may include the steps of: (f) obtaining a first distribution of the average difference values of step (e).

In the present application, the smoothing process may include the steps of: (g) sequentially accumulating each average difference in the first distribution over a length of an effective fragment interval covering the length of the nucleosome-wrapped nucleic acid sequence to obtain an additive value.

In the present application, the nucleic acid sequence may be capable of winding up nucleosomes for more than 2 weeks, or, alternatively, within 1 week. In the present application, the effective fragment interval can be from about 1 to about 167 nucleotides in length, and/or, more than about 200 nucleotides in length. In the present application, the effective fragment interval can be from about 1 to about 167 nucleotides in length, and/or from about 250 to about 400 nucleotides in length.

In the present application, the smoothing process may include the steps of: (h) obtaining a second distribution of the added values of step (g), calculating the maximum of the added values in the second distribution. In this application, the maximum value of the added value may be taken as dev (max), the dev (max) as the index of discrimination and/or as the training sample.

In this application, the judging module may obtain the relevant judgment result according to the machine learning model that has been trained by machine learning (for example, the judgment result may include the identification result of the somatic mutation described in this application and/or the judgment result of the ctDNA identified in the cfDNA described in this application). In this application, the determining module may input the difference value (e.g., the dev (max)) described in this application. The judging module may output the related judgment result. In this application, the module may include a machine learning model that has been machine learning trained. The machine learning model is obtained by using the verification set and the difference values (e.g., which may also include using the parameters) described in the present application, using the training method of the machine learning model described in the present application.

In the present application, the index and/or training samples may further include one or more of the following parameters: the chromosome position of the mutation site, the base substitution pattern of the mutation site, the count value of nucleic acid fragments with various lengths in a wild type of the mutation site and/or the count value of nucleic acid fragments with various lengths in a mutant type of the mutation site, the allelic variation of the mutation site, the age of a subject and the mutation type of the mutation site.

In the present application, the index and/or training samples may further include one or more of the following parameters: the chromosome position of the SNV locus, the base substitution pattern of the SNV locus, the count value of nucleic acid fragments with various lengths in a wild type of the SNV locus and/or the count value of nucleic acid fragments with various lengths in a mutant type of the SNV locus, the allelic variation of the SNV locus, the age of a subject and the mutation type of the SNV locus.

In this application, the apparatus may include the calculating module and the determining module. The apparatus may include the computing module and the training module.

In this application, the apparatus may include the sample acquiring module, the data receiving module, the input module, the calculating module, the judging module, and the output module. In the present application, the sample, and the information and/or calculation result derived from the sample may be sequentially transmitted from the sample obtaining module, the data receiving module, the input module, the calculation module, the judgment module, and the output module.

For example, the non-volatile computer-readable storage medium may include a floppy disk, a flexible disk, a hard disk, a Solid State Storage (SSS) (e.g., a Solid State Drive (SSD)), a Solid State Card (SSC), a Solid State Module (SSM)), an enterprise-level flash drive, a tape, or any other non-transitory magnetic medium, and so forth. The non-volatile computer-readable storage medium may also include punch cards, paper tape, a cursor sheet (or any other physical medium with a hole pattern or other optically recognizable indicia), compact disc read only memory (CD-ROM), compact disc rewritable (CD-RW), Digital Versatile Discs (DVD), Blu-ray discs (BD), and/or any other non-transitory optical medium.

For example, the database system may implement various mechanisms to ensure that the methods described herein performed on the database system produce correct results. In this application, the database system may use a disk as a persistent data store. In the present application, the database system may provide database storage and processing services for a plurality of database clients. The database client may store database data across multiple shared storage devices and/or may utilize one or more execution platforms having multiple execution nodes. The database system may be organized such that storage and computing resources may be effectively extended indefinitely.

Applications of

In the present application, the method can be used to determine whether the subject has a germline mutation. Subjects carrying certain specific germline mutations may have a higher lifetime risk of having a tumor (e.g., colorectal, endometrial, gastric, and/or ovarian cancer) than the general population. Thus, the method may be used to screen subjects with higher risk. The subject can receive individual monitoring of the tumor, thereby achieving the purpose of early diagnosis and early treatment.

In the present application, the methods can be used in clinical practice (e.g., it can be speculated whether certain specific tumor treatment modalities are appropriate for the subject) by detecting the TMB. In some cases, the TMB levels detected by the methods can be used in clinical practice in combination with other biomarkers such as immune checkpoints, T cell inflammation markers, and the like.

Without intending to be bound by any theory, the following examples are merely intended to illustrate the fusion proteins, preparation methods, uses, etc. of the present application, and are not intended to limit the scope of the invention of the present application.

Examples

Example 1 obtaining the mutant sites described herein

1. Data preparation

a) And (3) sequential replying: mapping (mapping) the sequences onto the human reference genome GRCh37/hg19 using the mem module in bwa 0.7.10 software to form an alignment.

2. Variant identification

Mutant calling (variant calling) was performed on SNV using vardict 1.5.1, with the following parameters:

a) removing bases with a base quality (base quality) < 30;

b) removing readings with too low a mapping quality (mapping quality), e.g., < 60 readings (reads);

c) removing mismatched reads (reads), for example: mismatches are greater than 12, 10, 8 or 6;

d) the mutation frequency should not be too small, for example: mutation frequency > =0.002, 0.001, 0.0005, 0.0002 or 0.0001;

e) reads supporting mutations (reads) > =3, 2, or 1;

3. variant notes

Including database annotation, hot spot mutation (hot) site annotation, mutation type and, population frequency annotation.

a) Annotation of variant sites using anovar 20160201;

b) annotation of hotspot mutation (hot) sites: if one mutation is in the hot spot mutation list, the mutation is a hot spot mutation, and in the subsequent mutation filtering, the hot spot mutation is not included in the prediction of the model;

c) mutational annotation of the variation using SnpEff V4.3;

d) annotation of crowd frequency: given a mutation site, the maximum of the population frequencies in the various databases is taken as the population frequency of the mutation site.

The databases used include, but are not limited to: 1000 Genomes database, ExAC database, ESP6500 database, and the like. .

4. SNV mutation filtration

All annotated mutation sites were annotated according to the following conditions:

a) filtering the mutation frequency of the population: mutations with a population mutation frequency less than a certain value remain after filtering, for example: less than or equal to 0.005, 0.002 or 0.001;

b) filtering hot spot mutation;

c) filtering the clonal hematopoietic mutation;

d) maximum depth filtration: mutations greater than a certain sequencing depth are filtered, for example: the sequencing depth is more than 20000 and the like;

5. quality control of SNV mutation site fragments

a) Removal of repetitive sequences: removing repetitive sequences generated in the PCR amplification process;

b) filtering low quality fragments: filtering fragments with a median base mass of less than Q20;

c) filtering fragments with sequencing errors: filtering fragments that cannot be aligned to the reference genome;

d) mutation removal at low depth of coverage: SNVs supporting less than 50 fragments were removed.

Example 2 method of obtaining the difference described in this application

2.1

The differences described in the present application were calculated according to the mutation site SNV obtained in example 1, according to the following procedure:

a) obtaining wild type support fragment and mutant type support fragment: wherein the wild type support fragment is a cfDNA fragment comprising a wild type base sequence, and the mutant type support fragment is a cfDNA fragment comprising a mutant base sequence, wherein the wild type base sequence is the same sequence as compared to the nucleotide sequence of a reference genome at the corresponding position of the mutation site, wherein the mutant base sequence is a different sequence as compared to the nucleotide sequence of a reference genome at the corresponding position of the mutation site, and the reference genome is a human reference genome in the gene sequencing.

b) Constructing distribution patterns of the wild type support fragment and the mutant type support fragment within a specific length range respectively:

the distribution of the wild type and mutant support fragments was calculated over a length range of 1 to 400 nucleotides.

c) Wherein the difference in fragmentation pattern (Dev) between the two groups is quantified over a specific interval, as follows:

WC in formula (1)_iAnd MC_iRespectively representing the number of the wild type support fragments with the length of i nucleotides and the number of the mutant type support fragments with the length of i nucleotides at a certain mutation site.

Wherein 3 is the smoothing window value;

where j is a length value in the smoothing sample length range, for example, j may be an integer in an arithmetic sequence such as 1, 4, 7, or 10;

wherein 400 is the range of the length of the wild-type support fragment and/or the mutant support fragment.

In other words, with 3 as the interval length, the accumulated values of the ratios at different the lengths are calculated according to equation (1) in the range of nucleotide lengths from 1 to 400, respectively, and the aggregate of these ratios constitutes the first distribution D (i.e., equation (2)).

The effective fragment interval is then set to a length of about 1 to about 167 nucleotides, and/or about 250 to about 400 nucleotides. In the present application, the length of the effective fragment interval may be the length of the nucleic acid sequence wound around the nucleosome. For example, the nucleic acid sequence can be wrapped around the nucleosome for more than 2 weeks, or, alternatively, can be wrapped around the nucleosome for less than 1 week (e.g., the effective fragment interval can be from about 1 to about 167 nucleotides, and/or from about 250 to about 400 nucleotides in length).

And accumulating the values of the B in the first distribution D (namely the accumulated value of the ratios) again in sequence within the interval of the effective segment to obtain the addition value (namely, see the formula (3)).

（3）

For example, assuming that the length of the interval of the valid fragments is 100 (i.e., i is 100), the sequential addition values of the values of each B in the first distribution D are calculated in the range of 1 to 100 nucleotides in length.

The set of added values constitutes the second distribution a, and the added value that is the largest in the second distribution is denoted dev (max) (i.e., see equation (4)).

（4）

For example, FIG. 8 shows the distribution frequency of the lengths of the wild-type support fragment and the mutant support fragment of the present application obtained by the method described in example 2.1 for the mutation site C-T at 20525808 of human chromosome 4.

For example, FIG. 9 shows the distribution frequency of the lengths of the wild-type support fragment and the mutant support fragment of the present application obtained by the method described in example 2.1 for the mutation site G-T at 56189455 of human chromosome 5.

For example, FIG. 10 shows the distribution frequency of the lengths of the wild-type support fragment and the mutant support fragment of the present application obtained by the method described in example 2.1 for the mutation site C-A at 7577141 of human chromosome 17.

2.2

Wherein 3 is the smoothing window value;

where j is a length value in the smoothing sample length range, for example, j may be an integer in an arithmetic sequence such as 1, 2, 3, or 4;

（3）

（4）

Example 3 machine learning as described herein

(1) The indices referred to in table 1 are input to the machine learning model described herein for machine learning training.

These indices can be divided into 7 types according to the type to which the different features belong, and the indices are all related to the mutation site.

TABLE 1

a) Position information: including the chromosomal location where the SNV is located, e.g., 68771372 for chromosome 16.

b) Base substitution pattern: in a single SNV site, the base from the wild type is converted into a newly introduced mutant base pattern. For example, chr3,178935093C > A, and the substitution pattern of bases is "CA". This feature uses a method of "one-hot coding", taking into account theoretically 12 alternative modes, respectively: AT, AC, AG, TA, TC, TG, CA, CT, CG, GA, GT, GC.

c) Dev values obtained for example 2 (i.e. patterns that can reflect fragmentation of cfDNA): it can also characterize the characteristic W of the direction of abrupt offset_ratioAnd M_ratio。

To show visuallyShowing the difference between the two groups, Delta can also be characterized_ratio. The calculation methods of the above three parameters are respectively and sequentially shown as formula (5), formula (6) and formula (7).

	（5）
		（6）
Delta _ratio = QUOTE	（7）

The number 167 may be any integer of 160-174.

In the formula (5), C_l>167And each C_l<167Representing the number of said wild-type supported fragments greater than 167 nucleotides in length, and lengthThe number of said wild type-supported fragments, W, of degree less than 167 nucleotides_ratioThen represents C_l>167And C_l<167The ratio of (a) to (b).

In the formula (6), C_l>167And each C_l<167Representing the number of said mutant support fragments greater than 167 nucleotides in length, and the number of said mutant support fragments less than 167 nucleotides in length, M_ratioThen represents C_l>167And C_l<167The ratio of (a) to (b).

Formula (7) then represents W_ratioAnd M_ratioThe difference of (a).

d) Counting the fragments: the mutant site contains all non-mutated wild-type fragments at a certain mutation site and all supported fragment numbers of single-base mutation at the site.

e) Allelic variation: this class of features includes two categories, namely sample frequency and crowd frequency. The sample Frequency refers to the Frequency of allelic mutations (Variant Allele frequencies) in which a mutation occurs in a sample, and the Population Frequency (Population Frequency) refers to the Frequency of the mutation in a Population.

f) Age: i.e., the age of the sample in which the mutation occurred.

g) Mutation types: i.e., the result of variant annotation, the class features include the following categories:

splice _ donor _ variant, (splice donor mutation)

synonymous _ variant, (synonymous mutation)

stop _ gated (terminator procured)

intron _ variant (intron mutation)

stop _ lost (terminator missing)

missense _ variant (nonsense mutation)

splice region variant

splice _ acceptor _ variant, (splice acceptor mutation)

promoter region variant

Start _ lost (mutant initiation codon)

After the encoding is completed, z-transform is performed on each feature type, i.e., all values are converted into a standard normal distribution with a mean value of 0 and a variance of 1.

(2) Model training

An ensemble module in a machine learning library sklern v.0.23.2 in python is used in the model training process

And setting parameters. The method for setting the separation purity of the discrimination category is 'entry', the maximum decision tree depth is determined by the minimum separation sample number of leaf nodes and is set to None, the minimum separable sample number of the nodes is 10, and the final result is determined by 40 decision tree votes.

Example 4 application of the methods described herein to specific tumors

The truth data included 1309 lung cancer blood samples in total, which were divided into a training set containing 928 samples and two validation sets containing 191 and 190 samples, respectively (i.e., training set, validation set 1, and validation set 2, respectively, in fig. 1).

First, the machine learning trained machine learning model was obtained by modeling the 12173 germline and 5816 germline mutations remaining in the training set after population frequency filtering, according to the procedure of examples 1-3.

Then, using the machine learning model that has been subjected to machine learning training, model verification is performed on the 2 verification sets respectively (see fig. 1).

During training, 20% equivalent of data for all 17989 mutations were divided into test sets. In 80% of training sets, internal 5-fold cross validation is adopted to select the hyper-parameters of all optimal models, and finally the result of each optimal model in 20% of testing sets is obtained. The results of the machine training of the model are shown in fig. 2. In fig. 2, the RF (+ Dev) or RF (-Dev) refers to the results of model verification of the 2 verification sets by machine learning models that contain Dev parameters and do not contain Dev parameters for machine learning training, respectively.

The results show that random forests performed best in all models, with AUC values of 0.9983. In addition, in the 2 verification sets described above (fig. 3-4). Wherein fig. 3-4 show the performance of the machine learning trained machine learning models described herein in validation set 1 and validation set 2, respectively.

Therefore, the machine learning model which is trained by machine learning and described in the application also embodies excellent performance, and the AUC respectively reaches 0.9962 and 0.9977, which proves the generalization capability of the method described in the application.

Example 5 application of the methods described in the present application to different tumors

In order to confirm that the machine learning trained machine learning model described in the present application can be comprehensively applied to the germ line system discrimination of pan-cancer species, 1008 samples (see fig. 5 for the specific case of the samples) from 11 cancer types are used, and finally the samples are included in the assessment by population frequency and other filtering methods, wherein the mutations comprise 6647 systematic mutations and 13567 germline mutations (fig. 5).

The machine learning model described in this application, which has been machine learning trained as a whole, has good predictive power for a mixed 1008 multi-cancer test set, with an AUC of 0.9947 (see fig. 6), where cfSvG denotes the name of the algorithm developed by the applicant.

In addition, the ability of the model to classify each cancer species was also tested. As a result, it was found that the AUC of the model was stable above 0.99 in almost all 11 cancers. However, in the bladder cancer data, the performance was slightly decreased, but the AUC thereof also reached 0.9886 (the result of AUC is shown in fig. 7).

The methods and/or models described herein perform well not only in lung cancer species, but also have superior performance in the ability to classify pan-cancerous species.

The embodiments of the present application have been described in detail, but the present application is not limited to the details of the above embodiments, and various simple modifications can be made to the technical solution of the present application within the technical idea of the present application, and the simple modifications belong to the protection scope of the present application. It should be noted that, in the foregoing embodiments, various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various possible combinations are not described in the present application. In addition, any combination of the various embodiments of the present application is also possible, and the same should be considered as disclosed in the present application as long as it does not depart from the idea of the present application.

Claims

1. A method for differentiating somatic and germline mutations without the direct objective of obtaining a disease diagnostic, comprising the steps of:

the method comprises the following steps: obtaining at least one mutation site from a sample of a subject;

step two: aiming at each mutation site, acquiring a wild type support fragment and a mutant type support fragment; wherein the wild-type supporting fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant supporting fragment is a cfDNA fragment comprising a mutant base sequence; the wild-type base sequence is the same sequence as the nucleotide sequence of the reference genome at the corresponding position of the mutation site; the mutant base sequence is a different sequence compared to the nucleotide sequence of the reference genome at the corresponding position of the mutation site;

step three: obtaining the number of said wild-type support fragments of 1-400 nucleotides in length and the corresponding number of said mutant support fragments of the same length for each mutation site,

calculating the difference value of the ratio WC and the ratio MC under the same length;

step four: obtaining a first distribution of the difference values in step three;

step five: sequentially accumulating each difference value in the first distribution within the length range of an effective segment interval to obtain an addition value;

step six: obtaining a second distribution of said addition values of step five, calculating the maximum of said addition values in said second distribution as Dev _ Max, using said Dev _ Max as said indicator for distinguishing said mutation site as a somatic mutation or a germline mutation;

inputting the Dev _ Max in the sixth step as a training sample to a machine learning model for machine learning training.

2. A method for identifying ctDNA in cfDNA without direct objective of obtaining a disease diagnosis, comprising the steps of:

step six: obtaining a second distribution of the addition values in step five, calculating the maximum value of the addition values in the second distribution as Dev _ Max, and using the Dev _ Max as the index for identifying whether the mutation site is ctDNA;

3. The method of any one of claims 1-2, wherein the mutation site is obtained by a method of gene sequencing.

4. The method according to any one of claims 1-2, wherein the length of the effective fragment interval covers the length of the nucleic acid sequence that winds around the nucleosome, or is capable of winding around the nucleosome for more than 2 weeks, or is capable of winding within 1 week.

5. The method according to any of claims 1-2, wherein the difference is smoothed, wherein the smoothing comprises the steps of:

step a: determining a smoothing window value, wherein the smoothing window value is an integer from 1 to 10;

step b: determining a plurality of smoothing sample length ranges with length values equal to the smoothing window value, wherein the minimum value of each smoothing sample length range is the starting length;

step c: obtaining the number of the wild type supported fragments of at least one smoothing sampling length in any smoothing sampling length range, obtaining the number of the corresponding mutant type supported fragments with the same length, and calculating the ratio WC of the number of the wild type supported fragments with the length to the total number of the wild type supported fragments; calculating the ratio MC of the number of the mutant support fragments of the same length to the total number of the mutant support fragments;

step d: calculating an average difference value of the range of smoothed sample lengths based on said difference value of said at least one smoothed sample length;

step e: the resulting average difference is taken as a representative value for the range of smoothed sample lengths.

6. The method of any one of claims 1-2, wherein the distinguishing of the mutation site as a somatic mutation or a germline mutation indicator or the indicator of whether the mutation site is ctDNA further comprises one or more of the following parameters selected from the group consisting of: the chromosome position of the mutation site, the base substitution pattern of the mutation site, the count value of nucleic acid fragments with various lengths in a wild type of the mutation site and/or the count value of nucleic acid fragments with various lengths in a mutant type of the mutation site, the allelic variation of the mutation site, the age of a subject and the mutation type of the mutation site.

7. The method of any one of claims 1-2, wherein detecting the mutation site comprises the steps of:

obtaining data from the sample;

performing variation identification on the obtained data;

performing variant annotation on the identified variants; and the number of the first and second groups,

the annotated variation was filtered to obtain the mutation site.

8. A method for training a machine learning model without direct objective of obtaining disease diagnosis results, for implementing the method of any one of claims 1-7, wherein the Dev _ Max in step six is input as a training sample to the machine learning model for machine learning training.