CN119360984A

CN119360984A - Target tumor DNA fragment screening method based on deep learning

Info

Publication number: CN119360984A
Application number: CN202411911205.5A
Authority: CN
Inventors: 崔品; 冯铭冀; 贺位皇
Original assignee: Shenzhen Ruifa Biotechnology Co ltd
Current assignee: Shenzhen Ruifa Biotechnology Co ltd
Priority date: 2024-12-24
Filing date: 2024-12-24
Publication date: 2025-01-24

Abstract

The present invention relates to a method for screening target tumor DNA fragments based on deep learning, and belongs to the technical field of screening target tumor DNA fragments. The present invention constructs a tumor DNA fragment screening model through a deep neural network, and the training set is used to establish the model, and the test set is used to test the model performance. The DNA sequencing data of the positive tumor sample to be identified is obtained, and the DNA sequencing data of the positive tumor sample to be identified is located and analyzed according to the trained tumor DNA fragment screening model, the location of the target tumor DNA fragment is obtained, and it is displayed in a preset manner. The present invention can minimize the interference of multiple different types of tumor DNA fragments in the same tumor training sample data on the model training, thereby improving the prediction accuracy of the model and improving the screening accuracy of tumor DNA fragments.

Description

Target tumor DNA fragment screening method based on deep learning

Technical Field

The invention relates to the technical field of DNA detection screening, in particular to a target tumor DNA fragment screening method based on deep learning.

Background

The sequencing technology is also called large-scale parallel sequencing, the core idea is sequencing while synthesizing, millions of even billions of DNA molecules can be sequenced at the same time, the aim of large-scale and high-throughput sequencing is achieved, and the method is a revolutionary progress after Sanger sequencing. Along with the rapid development of the second generation sequencing technology in recent years, the method is gradually applied to clinical medical detection and scientific research of blood tumor directions. The target sequence capture can selectively separate or enrich specific fragments of genome, so that higher sequencing depth can be obtained by using lower cost, and a good foundation is laid for low-frequency detection, big data accumulation and the like. At present, in clinical tumor practice, the second generation sequencing is mainly applied to driving gene sequencing, and is an important link of detection. The deep learning technology is a multi-domain interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It mainly studies how to enable a computer to simulate or implement learning behavior of a human being, thereby acquiring new knowledge or skills, and reorganizing existing knowledge structures to continuously improve own performance. The detection speed of the DNA fragments can be greatly improved by screening massive tumor DNA fragments through a deep learning technology, however, in practice, since in the prior art, tumor DNA fragments may exist in more than one type, and since mutation is nondirectional, a plurality of tumor DNA fragments of different types may exist in one free DNA fragment or ctDNA (tumor-derived free DNA) fragment, the problem is not considered in the prior art, and the problem is caused that the model is interfered by the plurality of tumor DNA fragments of different types during model training, so that the prediction precision of the model is greatly reduced.

Disclosure of Invention

The invention overcomes the defects of the prior art and provides a target tumor DNA fragment screening method based on deep learning.

In order to achieve the above purpose, the invention adopts the following technical scheme:

the first aspect of the invention provides a target tumor DNA fragment screening method based on deep learning, which specifically comprises the following steps:

Obtaining DNA sequencing data of a preset number of positive tumor samples and free DNA sequencing data of a control sample, and screening out specific characteristics of ctDNA fragments according to the DNA sequencing data of the positive tumor samples and the free DNA sequencing data of the control sample;

Constructing a training sample data set and a test sample data set according to the DNA sequencing data of the preset number of positive tumor samples and the specific characteristics of ctDNA fragments, and obtaining the training data set and the test data set after the processing by performing characteristic processing on the training sample data set and the test sample data set;

Constructing a tumor DNA fragment screening model based on a deep neural network, performing model building on the processed training set by using a neural network algorithm, and performing model performance test by using a test set to obtain a performance optimal model;

And acquiring positive tumor sample DNA sequencing data to be identified, carrying out positioning analysis on the positive tumor sample DNA sequencing data to be identified according to the trained tumor DNA fragment screening model, acquiring the position of a target tumor DNA fragment, and displaying according to a preset mode.

Further, in the method, the DNA sequencing data of the positive tumor samples and the free DNA sequencing data of the control samples with preset numbers are obtained, specifically:

acquiring index data of initial sample data of positive tumors, setting a sample evaluation index range, and judging whether the index data of the initial sample data of the positive tumors is within the sample evaluation index range;

if the index data of the initial sample data of the positive tumor is within the range of the sample evaluation index, extracting positive tumor sample DNA sequencing data from the corresponding initial sample data of the positive tumor by a gene detection technology;

Counting and acquiring DNA sequencing data of a preset number of positive tumor samples, collecting free DNA sequencing data of a control sample, and outputting the DNA sequencing data of the preset number of positive tumor samples and the free DNA sequencing data of the control sample.

Further, in the method, specific characteristics of ctDNA fragments are screened according to the DNA sequencing data of the positive tumor sample and the free DNA sequencing data of a control sample, and specifically include:

Introducing a graph traversal algorithm, and performing first traversal on the positive tumor sample DNA sequencing data through the graph traversal algorithm;

Performing a second traversal by the traversal algorithm of the map against the free DNA sequencing data of the sample;

The same part and the non-same part between the DNA sequencing data of the positive tumor sample and the free DNA sequencing data of the control sample are obtained through traversal, and the DNA sequencing data of the non-same part between the DNA sequencing data of the positive tumor sample and the free DNA sequencing data of the control sample are extracted;

And taking DNA sequencing data of a non-identical part between the DNA sequencing data of the positive tumor sample and the free DNA sequencing data of a control sample as the specific characteristics of the ctDNA fragment, and outputting the specific characteristics of the ctDNA fragment.

Further, in the method, a training sample data set and a test sample data set are constructed according to the preset number of positive tumor sample DNA sequencing data and specific characteristics of ctDNA fragments, and the training sample data set and the test sample data set are subjected to characteristic processing to obtain a processed training data set and test data set, which specifically include:

Selecting a plurality of training sample data from the DNA sequencing data of the positive tumor samples with preset quantity, and judging whether each training sample data has only specific characteristic data of one ctDNA fragment or not;

if only one specific characteristic data of ctDNA fragment exists in each training sample data, the corresponding training sample data is used as a training sample data set and a test sample data set, and the training sample data set and the test sample data set are output;

If the specific characteristic data of each ctDNA fragment is not only one, the type of the specific characteristic data of the ctDNA fragment is obtained, and the type of the specific characteristic data of each ctDNA fragment is taken as the independent ctDNA fragment type;

Traversing the DNA sequencing data of the positive tumor samples with the preset number by taking the independent ctDNA fragment types as a reference, and searching for sample data of only one independent ctDNA fragment type;

And selecting a preset number of sample data of only one independent ctDNA fragment type, constructing a training sample data set and a test sample data set, and outputting the training sample data set and the test sample data set.

Further, in the method, a tumor DNA fragment screening model is constructed based on a deep neural network, the processed training data set and the test data set are input into the tumor DNA fragment screening model for training and testing, and the trained tumor DNA fragment screening model is obtained, specifically:

Constructing a tumor DNA fragment screening model based on a deep neural network, setting a learning rate, a mean square error of a stopping condition and the maximum number of training neurons, taking a sine function as an activation function and taking the sine function as a training parameter, and focusing attention on specific characteristic data of ctDNA fragments through the action of an attention module;

The maximum number of the input neurons of the deep neural network is kept consistent with the dimension of the learning sample, the processed training data set is transmitted to a unit in the hidden layer, and all the neurons are weighted and summed at the last layer of the hidden layer;

Dividing the summation value output by the hidden layer through the output neuron to obtain a probability value of an output state, sequencing the probability value of the output state to obtain a sequencing result, and obtaining a dynamic data state with the maximum probability value of the output state through the sequencing result;

And taking the dynamic data state with the maximum probability value of the output state as an output result, continuously training and testing through a test data set, and outputting a trained tumor DNA fragment screening model when the training parameter reaches the mean square error of the stopping condition.

Further, in the method, positive tumor sample DNA sequencing data to be identified is obtained, positioning analysis is performed on the positive tumor sample DNA sequencing data to be identified according to the trained tumor DNA fragment screening model, and the position of the target tumor DNA fragment is obtained and displayed according to a preset mode, which specifically comprises the following steps:

acquiring DNA sequencing data of a plurality of positive tumor samples to be identified, and inputting the DNA sequencing data of the positive tumor samples to be identified into the trained tumor DNA fragment screening model for analysis;

And (3) obtaining positive tumor sample DNA sequencing data with target tumor DNA fragments through analysis, positioning the target tumor DNA fragments according to the positive tumor sample DNA sequencing data with the target tumor DNA fragments, obtaining the positions of the target tumor DNA fragments, and displaying according to a preset mode.

The invention solves the defects existing in the background technology, and has the following beneficial effects:

According to the invention, the specific characteristics of ctDNA fragments are screened out according to the DNA sequencing data of the positive tumor samples and the free DNA sequencing data of the control samples, a training sample data set and a test sample data set are constructed according to the specific characteristics of the DNA sequencing data of the positive tumor samples and the specific characteristics of the ctDNA fragments of the preset numbers, the training data set and the test sample data set are subjected to characteristic processing, the processed training data set and test data set are obtained, a tumor DNA fragment screening model is constructed based on a deep neural network, the processed training data set and test data set are input into the tumor DNA fragment screening model for training and testing, a trained tumor DNA fragment screening model is obtained, finally, the positive tumor sample DNA sequencing data to be identified is obtained, the position of the target tumor DNA fragment is obtained by carrying out positioning analysis according to the positive tumor sample DNA sequencing data to be identified of the trained tumor DNA fragment screening model, and the position of the target tumor DNA fragment is displayed according to the preset mode. According to the invention, the training sample data with a plurality of different types of tumor DNA fragments is processed, so that the interference of the plurality of different types of tumor DNA fragments in the same tumor training sample data during model training can be reduced to the greatest extent, and the prediction precision of the model and the screening precision of the tumor DNA fragments can be improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other embodiments of the drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 shows an overall flow chart of a method for deep learning-based screening of target tumor DNA fragments;

FIG. 2 shows a partial flow chart of a method for deep learning-based screening of target tumor DNA fragments.

Detailed Description

In order that the above-recited objects, features and advantages of the present application will be more clearly understood, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, without conflict, the embodiments of the present application and features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.

As shown in fig. 1, the first aspect of the present invention provides a method for screening target tumor DNA fragments based on deep learning, specifically comprising:

S102, acquiring DNA sequencing data of a preset number of positive tumor samples and free DNA sequencing data of a control sample, and screening out specific characteristics of ctDNA fragments according to the DNA sequencing data of the positive tumor samples and the free DNA sequencing data of the control sample;

S104, constructing a training sample data set and a test sample data set according to the DNA sequencing data of the positive tumor samples with preset quantity and the specific characteristics of ctDNA fragments, and obtaining the training data set and the test data set after the processing by carrying out characteristic processing on the training sample data set and the test sample data set;

S106, constructing a tumor DNA fragment screening model based on a deep neural network, performing model building on the processed training set by using a neural network algorithm, and performing model performance test by using a test set to obtain a performance optimal model;

S108, acquiring positive tumor sample DNA sequencing data to be identified, carrying out positioning analysis on the positive tumor sample DNA sequencing data to be identified according to the trained tumor DNA fragment screening model, acquiring the position of a target tumor DNA fragment, and displaying according to a preset mode.

By processing the training sample data with a plurality of tumor DNA fragments of different types, the method can reduce the interference of the plurality of tumor DNA fragments of different types in the same tumor training sample data to the maximum extent during model training, thereby improving the prediction precision of the model and the screening precision of the tumor DNA fragments.

acquiring index data of initial sample data of the positive tumor, setting a sample evaluation index range, and judging whether the index data of the initial sample data of the positive tumor is within the sample evaluation index range;

It should be noted that, the index data of the initial sample data of the positive tumor comprises indexes such as PH value, osmotic pressure, sterility, no pollution and the like of the detection sample, the abnormal indexes can be removed by the method, and the normally detected data are left.

Further, in the method, specific characteristics of ctDNA fragments are screened according to DNA sequencing data of positive tumor samples and free DNA sequencing data of control samples, and specifically include:

Through traversing, the same part and the non-same part between the DNA sequencing data of the positive tumor sample and the free DNA sequencing data of the control sample are obtained, and the DNA sequencing data of the non-same part between the DNA sequencing data of the positive tumor sample and the free DNA sequencing data of the control sample is extracted;

DNA sequencing data of a non-identical portion between DNA sequencing data of a positive tumor sample and free DNA sequencing data of a control sample is taken as the specific characteristics of the ctDNA fragment, and the specific characteristics of the ctDNA fragment are output.

It should be noted that, by traversing the DNA sequencing data of the positive tumor sample and the free DNA sequencing data of the control sample by the traversing algorithm of the graph, the DNA sequencing data of the non-identical portion can be extracted, thereby extracting the specific characteristics of the ctDNA fragment.

Further, in the method, a training sample data set and a test sample data set are constructed according to the preset number of positive tumor sample DNA sequencing data and the specific characteristics of ctDNA fragments, and the training data set and the test data set after being processed are obtained by performing characteristic processing on the training sample data set and the test sample data set, which specifically comprises the following steps:

if only one specific characteristic data of ctDNA fragment exists in each training sample data, the corresponding training sample data are used as a training sample data set and a test sample data set, and the training sample data set and the test sample data set are output;

If the specific characteristic data of each ctDNA fragment is not only one type in the training sample data, the type of the specific characteristic data of the ctDNA fragment is obtained, and the type of the specific characteristic data of each ctDNA fragment is taken as the independent ctDNA fragment type;

Traversing the DNA sequencing data of a preset number of positive tumor samples by taking the independent ctDNA fragment types as a reference, and searching for sample data of only one independent ctDNA fragment type;

It should be noted that, since one or more ctDNA fragment types may exist in one training sample data, by traversing the preset number of positive tumor sample DNA sequencing data, sample data of only one independent ctDNA fragment type is searched, so as to form a training sample data set corresponding to a plurality of independent ctDNA fragment types, and the plurality of training sample data sets are used for training the model independently, so that interference of the plurality of ctDNA fragment types on training of the model can be avoided, and recognition and positioning accuracy of the ctDNA fragment types can be improved.

As shown in fig. 2, in the method, a tumor DNA fragment screening model is further constructed based on a deep neural network, and the training data set and the test data set after processing are input into the tumor DNA fragment screening model for training and testing, so as to obtain the trained tumor DNA fragment screening model, which specifically comprises:

S202, constructing a tumor DNA fragment screening model based on a deep neural network, setting a learning rate, a mean square error of a stop condition and the maximum number of training neurons, taking a sine function as an activation function and taking the sine function as a training parameter, and focusing attention on specific characteristic data of ctDNA fragments through the action of an attention module;

s204, keeping the maximum number of the input neurons of the deep neural network consistent with the dimension of the learning sample, transmitting the processed training data set to a unit in the hidden layer, and carrying out weighted summation on all the neurons in the last layer of the hidden layer;

S206, dividing the summation value output by the hidden layer through the output neuron to obtain a probability value of an output state, sorting the probability values of the output state to obtain a sorting result, and obtaining a dynamic data state with the maximum probability value of the output state through the sorting result;

and S208, taking the dynamic data state with the maximum probability value of the output state as an output result, continuously training and testing through a test data set, and outputting a tumor DNA fragment screening model after training when the training parameter reaches the mean square error of the stopping condition.

By the method, attention can be focused on specific characteristic data of ctDNA fragments through interaction of an attention module and each layer, so that prediction accuracy of a model is improved.

Further, in the method, positive tumor sample DNA sequencing data to be identified is obtained, positioning analysis is performed on the positive tumor sample DNA sequencing data to be identified according to a trained tumor DNA fragment screening model, and the position of a target tumor DNA fragment is obtained and displayed according to a preset mode, and specifically comprises the following steps:

acquiring DNA sequencing data of a plurality of positive tumor samples to be identified, and inputting the DNA sequencing data of the positive tumor samples to be identified into a trained tumor DNA fragment screening model for analysis;

Furthermore, the method comprises the following steps:

Acquiring the maximum data processing capacity of the data processing system in unit time under each working temperature environment through big data, and constructing a maximum data processing capacity prediction model of the processing system based on a deep neural network;

Introducing a graph neural network, inputting the maximum data processing capacity of the data processing system within unit time under each working temperature environment into the graph neural network, taking the working temperature environment as a first node and taking the maximum data processing capacity as a second node;

Constructing a directional description relation, connecting the first node with a second node based on the directional description relation, constructing a topological structure diagram, acquiring a related adjacency matrix based on the topological structure diagram, and inputting the adjacency matrix into a maximum data processing amount prediction model of the processing system for training;

obtaining a maximum data processing amount prediction model of a processing system after training, obtaining working temperature information of the data processing system within preset time, inputting the working temperature information of the data processing system within the preset time into the maximum data processing amount prediction model of the processing system for prediction, and obtaining the maximum data processing amount of the data processing system within unit time under the current working temperature environment;

And setting a preset quantity of training sample data according to the maximum data processing amount of the data processing system in the unit time under the current working temperature environment, and dynamically adjusting the quantity of the training sample data in the unit time according to the maximum data processing amount of the data processing system in the unit time under the current working temperature environment.

It should be noted that, in practice, the maximum data processing amounts of the data processing system under different working temperatures are different, and the appropriate training amounts can enable the data processing system to be in a normal working state.

In addition, the number of training sample data in unit time is dynamically adjusted according to the maximum data processing amount of the data processing system in unit time under the current working temperature environment, and the training sample data is specifically:

introducing a genetic algorithm, setting a genetic algebra based on the genetic algorithm, acquiring the maximum data processing capacity of a data processing system in unit time under the current working temperature environment, and initializing the training quantity of training sample data;

Acquiring the data processing amount occupied by each training amount in unit time when training is performed, and calculating the real-time total data processing amount based on the data processing amount occupied by each training amount in unit time when training is performed and the training amount of training sample data;

judging whether the real-time total data processing amount is larger than the maximum data processing amount of the data processing system in unit time under the current working temperature environment, if not, outputting the training quantity of training sample data, and training according to the training quantity of the training sample data;

If the real-time total data processing amount is larger than the maximum data processing amount of the data processing system in unit time under the current working temperature environment, inheriting based on the inheritance algebra, and re-planning the training amount of training sample data until the real-time total data processing amount is not larger than the maximum data processing amount of the data processing system in unit time under the current working temperature environment.

The method can dynamically adjust the training quantity of the training sample data according to the change of the working temperature of the data processing system, and improves the training rationality during training.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is merely a logical function division, and there may be additional divisions of actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described as separate components may or may not be physically separate, and components displayed as units may or may not be physical units, may be located in one place or distributed on a plurality of network units, and may select some or all of the units according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, or each unit may be separately used as a unit, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of hardware plus a form of software functional unit.

It will be appreciated by those of ordinary skill in the art that implementing all or part of the steps of the above method embodiments may be implemented by hardware associated with program instructions, where the above program may be stored in a computer readable storage medium, where the program when executed performs the steps comprising the above method embodiments, where the above storage medium includes a removable storage device, a Read-only memory (ROM), a random access memory (RAM, randomAccessMemory), a magnetic or optical disk, or other various media that may store program code.

Or the above-described integrated units of the invention may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods of the embodiments of the present invention. The storage medium includes various media capable of storing program codes such as a removable storage device, a ROM, a RAM, a magnetic disk or an optical disk.

The foregoing is merely illustrative embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think about variations or substitutions within the technical scope of the present invention, and the invention should be covered. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. The target tumor DNA fragment screening method based on deep learning is characterized by comprising the following steps of:

2. The deep learning-based target tumor DNA fragment screening method according to claim 1, wherein a predetermined number of positive tumor sample DNA sequencing data and free DNA sequencing data of a control sample are obtained, specifically:

3. The deep learning-based target tumor DNA fragment screening method according to claim 1, wherein specific characteristics of ctDNA fragments are screened according to the positive tumor sample DNA sequencing data and free DNA sequencing data of a control sample, and specifically comprising:

4. The deep learning-based target tumor DNA fragment screening method of claim 1, wherein a training sample dataset and a test sample dataset are constructed according to the preset number of positive tumor sample DNA sequencing data and specific characteristics of ctDNA fragments, and the training sample dataset and the test sample dataset are subjected to characteristic processing to obtain a processed training dataset and test dataset, which specifically comprise:

5. The deep learning-based target tumor DNA fragment screening method according to claim 1, wherein a tumor DNA fragment screening model is constructed based on a deep neural network, the training set after processing is modeled by a neural network algorithm, and a test set is used for performing a model performance test to obtain a performance optimal model, specifically:

6. The deep learning-based target tumor DNA fragment screening method according to claim 1, wherein positive tumor sample DNA sequencing data to be identified is obtained, positioning analysis is performed on the positive tumor sample DNA sequencing data to be identified according to the trained tumor DNA fragment screening model, and positions of target tumor DNA fragments are obtained and displayed according to a preset mode, and the method specifically comprises the steps of: