CN115762646B

CN115762646B - A pan-cancer common driver pathway identification method based on GAN sample balance

Info

Publication number: CN115762646B
Application number: CN202211581374.8A
Authority: CN
Inventors: 欧阳扬; 吴璟莉; 李高仕; 朱凯; 龚艳霞
Original assignee: Guangxi Normal University
Current assignee: Guangxi Normal University
Priority date: 2022-12-09
Filing date: 2022-12-09
Publication date: 2025-02-14
Anticipated expiration: 2042-12-09
Also published as: CN115762646A

Abstract

The invention discloses a method for identifying a common driving path of a cancer flood based on GAN sample balance, which comprises the following steps of 1) generating somatic mutation data which corresponds to cancers and accords with real data distribution, 2) minimizing a model CDP-HA of the dispersion among total weights of all cancers, and 3) introducing a single parent genetic algorithm to solve the model CDP-HA. The method becomes a useful tool for identifying the common driving path of the cancer, and has strong expansibility and practicability.

Description

Method for identifying common driving path of cancer based on GAN sample balance

Technical Field

The invention relates to the field of bioinformatics, and is used for identifying a cancer driving channel, in particular to a common driving channel identification method for cancers based on GAN sample balance.

Background

Cancer is a disease that threatens human health and is quite complex, the etiology of which involves a variety of genetic and environmental factors. Understanding the mechanism of carcinogenesis from the molecular level is a great challenge, facilitating diagnosis, treatment and drug design of cancers in medicine. With the rapid development of new generation sequencing technologies (NGS), researchers can better characterize cancer molecules. Currently, several large cancer genome projects (cancer genome map (TCGA), international cancer genome alliance (ICGC), cancer Cell Line Encyclopedia (CCLE)) have generated and analyzed vast amounts of data, providing unprecedented opportunities for further understanding of molecular and oncogenic mechanisms of cancer. Previous studies have shown that only functionally driven mutations promote cancer progression, while passenger mutations have little impact on cancer progression. Distinguishing between function-driven mutations and passenger mutations has become an important task in studying cancer pathogenesis.

Early studies have largely focused on designing individual driver genes that can effectively recognize significantly higher mutation rates. However, cancer occurs due to the tremendous heterogeneity of mutations that occur in different driver gene mutations in the same cancer. Thus, identification of a single driver gene is not effective in understanding the mechanism of cancer progression. Further studies have shown that the occurrence of cancer is often caused by disruption of some of the pathways, which may be disturbed by different combinations of driving mutations (cell signaling or regulatory pathways). Thus, identifying the driving pathway is a key pathway for understanding the cancer carcinogenic mechanism at the pathway level. Currently, the drive path identification problem can be divided into three directions, identifying a single drive path, identifying a cooperative drive path, identifying a common drive path and a specific drive path for the flood. The primary study herein identifies common driving path problems for cancer.

The common driving pathway identified on the scale of pan-cancer is to investigate the commonality that may be present between different cancer types, which is beneficial for enhancing the understanding of the pathogenesis of cancer. The TCGA carcinomatous program has collected multi-platform mutation data generated by thousands of cancer patients of 12 cancer types, providing opportunities for further investigation of such problems. Recently, a class of a priori based research methods have been proposed, which generally use gene-gene interaction (GGI) networks, protein-protein interaction (PPI) networks, and pathway-pathway interaction (PaPaI) networks. While they can have better recognition, relying on a priori knowledge on the one hand can miss the discovery of better combinations of mutated genes and on the other hand can limit the scope of finding pathways, as the prior a priori knowledge is not perfect and contains part of the pathway information, as Leiserson et al propose HotNet method based on directed thermal diffusion model, which tries to obtain pathways and protein complexes by combining protein-protein interaction network analysis, kim et al study the mutual exclusivity of different types between various cancer types and propose MEMCover method for identifying sub-network/pathways based on HumanNet network. Hajkarim et al have proposed DAMOKLE algorithm based on a large gene-gene interaction network that attempts to identify sub-networks with significant differences in sample mutation frequencies in two cancers. Another category is de novo identification methods. Zhang et al propose ComMDP methods that exploit the two characteristics of the drive paths, high mutual exclusivity and high coverage, and then extend the maximum weight submatrix problem model applicable to a single cancer directly to be used for multiple cancer types, attempting to identify the common drive path by accumulating absolute weight values. Wu et al introduced a CDP-V model that used relative proportions instead of absolute numbers and utilized variances to minimize the dispersion of each proportion, while proposed a CDP-H model that used harmonic mean to minimize the dispersion of each proportion, reducing the use of parameters. An attempt is made to identify the common drive path by the relative weight value.

In the research method, the ComMDP method uses absolute weight values, and a group of gene sets with the maximum weight values are obtained by accumulating the absolute weight values of each cancer species, and the problem of unbalanced sample sizes among the cancer species is not considered, so that when the sample size difference of each cancer species is large, the identification result is more prone to cancer species with larger sample sizes, and certain driving paths with higher commonality can be omitted. In addition, the problem of unbalanced sample size is solved by using a relative weight calculation mode in the PGA-V method, but an artificial parameter is introduced, a large number of experiments are required for determining the parameter, and the process of the experiments is complicated, so that the expansion in practical application is not facilitated. However, the problem of sample size imbalance is not actually solved.

With the rapid development of deep learning, the problem of data imbalance is increasingly urgent. In 2012, a data enhancement strategy was proposed that generates additional data items that fit the distribution of real data by converting from existing real data. However, there is no better data generation method in the same period. Until 2014 Goodfellow et al proposed a powerful generation model based on game theory-generating an antagonism network (GANs). Although a large number of deep learning-based generative models were created in the early days, GANs was one of the most successful generative models and has been successfully applied to enhance data in various fields.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a method for identifying the common driving path of the cancer based on GAN sample balance, which is a useful tool for identifying the common driving path of the cancer, and has strong expansibility and practicability.

The technical scheme for realizing the aim of the invention is as follows:

A method for identifying a common driving path of a cancer cell based on GAN sample balance comprises the following steps:

1) Generating somatic mutation data corresponding to the true data distribution of the cancer:

1.1 Setting up an countermeasure generation network framework:

Assuming an example training set with m _r samples, n _r genes, the generator network of SNV-GANs is defined as G (z), the inputs of the generator are z-norm (0, 1), and the generator is defined as follows:

1.1.1 The input layer maps the noise vector z with GFC1 into a tensor zn of dimension (1, 128);

1.1.2 The hidden layer places the tensor zn in the step 1.1.1) into GFC2 for mapping, the obtained result is placed into GFC3 for mapping, the GFC4 is the same, and finally the tensor zn' is mapped into a dimension (1,1024);

1.1.3 The output layer maps tensor zn' to tensor gn of dimension (1, m _r*n_r) via GFC5 and resets gn to tensor of dimension (m _r,n_r) TensorAn output of the generator;

Wherein the input layer and the hidden layer both use dropout functions to freeze part of neurons, and adopt an activation function ReLU defined by a formula (1), the output layer uses an activation function Sigmod defined by a formula (2), a discriminator network is defined as D (x), and the input of the discriminator is real data x-P _real or generated data X represents a set of somatic mutation data samples, and the discriminator is defined as follows:

1.1.4 The input layer maps x with DFC1 to a tensor xn of the dimension (m _r, 256);

1.1.5 The hidden layer maps tensor xn of the step 1.1.4) into DFC2, the obtained result is mapped into DFC3, the DFC4 is the same, and finally the tensor xn' is mapped into dimension (m _r, 16);

1.1.6 The output layer maps the tensor xn' to the tensor dn of the dimension (m _r, 1), i.e. the output of the discriminator, by DFC5, wherein both the input layer and the hidden layer use the activation function ReLU defined by equation (1) and the output layer uses the activation function Sigmod defined by equation (2);

ReLU(x)=max(0,x) (1),

1.2 Training process of SNV-GANs):

1.2.1 Given a body cell mutation matrix A _r(m_r×n_r) and a proportion parameter of randomly extracted samples (oc < 1), randomly extracting a submatrix M _r with the number of samples of M, m=m _r X oc from the matrix A _r according to the proportion parameter (oc), constructing a training set X by extracting 64 submatrices M _r in total, and inputting the training set X into a generated countermeasure network for training;

1.2.2 Initializing parameters θ _d of discriminator D (), parameters θ _g of generator G ();

1.2.3 Let current round epoch=1, randomly generate a1×100 gaussian distributed noise vector z;

1.2.4 Using z in step 1.2.3) as input to the generator to obtain a vector of size m×n _r 1.2.5 Calculating a generator loss value according to equation (3), and then updating a parameter θ _g of the generator:

Where G (z ⁽ⁱ⁾) represents the generated data generated by the generator by the noise vector z ⁽ⁱ⁾, and D (G (z ⁽ⁱ⁾)) represents the probability that the discriminator generated data is determined to be true data, the smaller the loss _G is, the better;

1.2.6 Randomly extracting a sample set X from the training set X;

1.2.7 Randomly generating a1 x 100 gaussian distributed noise vector z;

1.2.8 Using z in step 1.2.7) as input to the generator to obtain a vector of size m×n _r 1.2.9 Calculating a generator loss value according to equation (4), and then updating a parameter θ _g of the generator:

Where D (x ⁽ⁱ⁾) represents the probability that the discriminator will determine that the generated data is true data, 1-D (G (z ⁽ⁱ⁾)) represents the probability that D will determine that the generated data is generated data, and the larger the loss _D, the better;

1.2.10 Judging whether the current round epoch reaches the set maximum round, if so, stopping training, otherwise, returning to the step 1.2.3) to finally obtain a trained generator G ();

1.3 Data processing:

1.3.1 Randomly generating a1 x 100 gaussian distributed noise vector z;

1.3.2 Inputting the vector z in step 1.3.1) to a generator G (), obtained by training, resulting in generated data G _data =g (z);

1.3.3 Setting the value of G _data which is more than or equal to 0.85 as 1 and the value of G _data which is less than 0.85 as 0 to obtain a new binary matrix A _fakedata;

1.3.4 Taking the maximum sample size of cancers in the somatic mutation matrixes of R cancers as m _max, 0< max < R, and inserting the matrix A _r into an amplification matrix At this time, the amplification matrixThe number of samples is

1.3.5 If the number of samples to be amplified is requiredA sample number m _r greater than matrix a _r, performing step 1.3.6), if an amplified sample number is desiredA sample number m _r smaller than matrix a _r, step 1.3.8) is performed;

Step 1.3.6), randomly extracting a number m _r of samples from A _fakedata in step 1.3.3) Separately calculating matricesAnd mutation rate of each gene in the matrix A _r to respectively obtain two corresponding mutation probability sets V and Q;

1.3.7 Inputting the two sets V and Q obtained in step 1.3.6) into JS divergence formula defined by formula (5) to obtain a distribution value, extracting matrix with smaller distribution value The more similar the mutation rate in matrix A _r is, the less the distribution value is 0.09, i.e., matrixInsertion amplification matrixIn (2) and updateOtherwise, repeating the step 1.3.6);

1.3.8 Randomly extracting a sample from A _fakedata in step 1.3.3) and directly adding the sample to the amplification matrix In (3), update

1.3.9 If the current amplification matrixIs a sample of (a)Equal to the maximum sample size m _max, and the sample supplementation is finished, otherwise, the step 1.3.5) is performed to finally obtain an amplification matrix equal to the maximum sample size m _max Re-matrix2) Model CDP-HA to minimize dispersion between individual total weights of cancer:

is provided with R, R.gtoreq.2 cancer types, for each of which a binary somatic mutation matrix is expressed as For recording whether a gene in a sample is mutated, it has m _r rows and n _r columns, the rows representing the sample or patient, the columns representing the gene, r=1, 2,3, r., a _i- represents the ith sample in matrix A _r, a _-j represents the jth gene in matrix A _r, and when mutation occurs in the jth gene of the ith sample in the mutation matrix of the nth cancer,OtherwiseGiven a set of gene sets S of size k,Representing a submatrix of size m _r xk in the corresponding matrix a _r,Representing a submatrixA sample in which the gene a is mutated,Representation ofThe total number of samples covered in (c) is used to measure the coverage of the gene set S, Overlapping the covered sample sum, and measuring the mutual exclusivity of the gene set S;

According to the definition of the symbol and the problem in the previous paragraph, a nonlinear maximization weight function model CDP-HA is constructed, wherein given m _r rows and n _r columns of binary somatic mutation matrixes A _r of R cancer types, a parameter K is adopted, W _C (S) is made to be the maximum weight and function, and an m multiplied by K submatrix is determined The specific formula (6) is as follows:

Wherein the method comprises the steps of Representing the absolute weight value of the gene set S in the r-th cancer species;

3) And (3) introducing a single parent genetic algorithm to solve the model CDP-HA:

3.1 Setting a fitness function:

Assuming that given chromosome E, let M _E represent a sub-matrix corresponding to the chromosome, the size of matrix M _E is m×K, the definition of the Fitness function Fitness (E) is shown in the following formula (7), and the larger the Fitness function value, the better the feasible solution scheme;

Fitness(E)=W_C(M_E) (7);

3.2 Setting a selection operator:

Adopting roulette selection and elite strategy to generate a new generation population, directly inheriting individuals with highest fitness from father to offspring, and then using a roulette selection operator to generate the rest N-1 individuals;

3.3 Setting a reorganization operator:

A recombination operator based on greedy strategy is adopted, and the steps are as follows: first, given a parent chromosome e= { E ₁,e₂,...,e_k}(e_i =1, 2, once again, n), wherein E _i represents a gene number, and thus E is also referred to as a gene set, thereby determining a candidate gene set Then randomly deleting one gene from gene set E to obtain gene set E ^′, finally selecting optimum gene from candidate set based on greedy strategyI.e.And generates the final new offspring

3.4 Setting parameters:

Inputting a somatic mutation matrix A _r after the enhancement of R cancers, the gene number g _number, the parameter k, the population size N, the algorithm execution times t and the maximum evolution algebra maxg;

3.5 Constructing an initial population:

The chromosomes are encoded in a decimal encoding mode, one chromosome represents an individual and is used for representing a solution vector of a problem, in a single parent genetic algorithm, a set of K genes is used as a problem solution, namely E= { E ₁,e₂,...,e_k}(e_i =1, 2, and the number n, the individual initialization method in the population is that natural data sets of 1 to n are randomly generated, each number represents one gene in a mutation matrix, and n genes are grouped in sequence, so that n/K gene sets S ₁,S₂,...,S_n/k are obtained. Order the Selecting the genes of the gene set S _max to form an initial chromosome, generating an initial population by generating N initial chromosomes, selecting the first K numbers as an initial chromosome, generating an initial population pop ₀, calculating the adaptive value of the pop ₀ population chromosome, comparing the optimal chromosomes in the pop ₀, storing the best individuals in a variable best, and enabling the initial iteration number step=0;

3.6 Performing an iterative operation:

3.6.1 If step > maxg, go to step 3.6.5), get the public drive channel with size K, otherwise go to step 3.6.2);

3.6.2 For the population pop _step, firstly putting best chromosomes with highest fitness values in pop _step into pop _step+1, and then executing a roulette selection operator to select the rest N-1 chromosomes to put into pop _step+1;

3.6.3 If step <0.7 x maxg or Fitness (E ') > Fitness (E), updating chromosome e=e', otherwise not updating, retaining E, step=step+1;

3.6.4 Taking the chromosome with the highest fitness value in the pop _step+1, and if the fitness value of the chromosome is larger than that of the best chromosome, updating the best chromosome, namely the best chromosome with best=pop _step+1;

3.6.5 The best chromosome is converted into a gene set, so that a submatrix M is obtained, the submatrix M is output, and the output M is a public driving channel S with the size of K.

According to the technical scheme, a countermeasure generation network (GENERATIVE ADVERSARIAL Networks, GAN for short) is used for generating samples of somatic mutation data of few-sample cancers, so that the samples among a plurality of cancer species reach balance, a mathematical model CDP-HA for accumulating absolute weight values of each cancer species by using a harmonic mean value is utilized, and finally a single-parent genetic algorithm is used for solving the model. For several types of cancers, the method provided by the technical scheme can effectively supplement cancers with small sample number, and solves the problem of sample number difference. Meanwhile, the gene set identified based on the proposed model is mutated not only in most samples of these cancers, but also in very close proportions of mutated samples in individual cancers. In addition, the method detects biologically significant gene sets that are deleted in other methods. Therefore, the sample data volume difference between different types of cancers can be reduced based on the countermeasure generation network, and a new thought is provided for identifying a common driving path of the cancers.

Compared with the prior art, the technical scheme has the following advantages:

(1) A non-linear, minimized dispersion maximization weighting function is designed to measure the relative weights of multiple cancer types.

(2) The data enhancement method SNV-GANs suitable for cancer somatic mutation data has high use value, and GANs is also applied to the cancer somatic mutation data for the first time.

(3) The common driving path of the cancer, which is found by the whole technical scheme, contains more genes which are enriched in the same important signal path, and the identified genes are enriched in more important signal paths.

The method can be used as a useful tool for enhancing somatic mutation data of biological cancers and a useful tool for identifying common driving paths of the cancers, can provide more biological information, has strong practicability before expansibility, can identify more genes enriched in important signal paths, and can identify the genes enriched in more important signal paths.

Drawings

FIG. 1 is a diagram showing an example of a common driving path in an embodiment;

FIG. 2 is an exemplary diagram of a single parent genetic algorithm in an embodiment;

FIG. 3 is a network model of SNV-GANs in an example;

FIG. 4 is a schematic diagram of a training process for SNV-GANs in an example;

FIG. 5 is a pseudo code of the SNV-GANs training process in the example;

Fig. 6 is a diagram showing an example of the actual effect of generating data in an example.

Detailed Description

The invention will now be described in further detail with reference to the drawings and specific examples, which are not intended to limit the invention thereto.

Examples:

In the experimental step 1), a Linux server (Intel (R) Xeon (R) Gold 6230.10 GHz CPU, the memory is 256G, the video memory is 32G) and the compiling operation environment is Python 3.7.9. Steps 2) and 3) are performed on a computer (Intel (R) Core (TM) i 5-6500.20 GHz CPU, memory is 32G), operating system is Windows 10, compiling running tool is Eclipse 4.23, and compiling environment Java 1.8.0.

This example is described with respect to the problem of common drive path identification for cancer.

Somatic mutation data a _r：A₁ (COADCORE has 95 samples, 211 genes), a ₂ (BLCA has 95 samples, 211 genes), a ₃ (LUAD has 95 samples, 211 genes) were assumed for three cancer types.

1.1 Setting up an countermeasure generation network framework:

Assuming that BLCA cancer somatic mutation data A ₂ is used as an example training set, the generator network of SNV-GANs is defined as G (z), the inputs of the generator are z-norm (0, 1), and the generator is defined as follows, as shown in FIG. 3:

1.1.3 The output layer maps tensor zn' to tensor gn of dimension (1, 95 x 211) via GFC5 and resets gn to tensor of dimension (95,211) TensorAn output of the generator;

Wherein the input layer and the hidden layer both use dropout functions to freeze part of neurons, and adopt an activation function ReLU defined by a formula (1), the output layer uses an activation function Sigmod defined by a formula (2), a discriminator network is defined as D (x), and the input of the discriminator is real data x-P _real or generated data X represents a set of somatic mutation data samples, and the discriminator is defined as follows, as shown in fig. 3:

1.1.4 The input layer maps x with DFC1 to a tensor xn of dimension (95,256);

1.1.5 The hidden layer places the tensor xn of the step 1.1.4) into the DFC2 for mapping, the obtained result is placed into the DFC3 for mapping, the DFC4 is the same, and finally the tensor xn' is mapped into the dimension (95,16);

1.1.6 The output layer maps the tensor xn' to the tensor dn of the dimension (95,1), i.e., the output of the discriminator, by DFC5, wherein both the input layer and the hidden layer use the activation function ReLU defined by equation (1) and the output layer uses the activation function Sigmod defined by equation (2);

ReLU(x)=max(0,x) (1),

1.2 Training process of SNV-GANs):

1.2.1 Given a body cell mutation matrix a ₂ (95×211) and a scaling parameter of randomly extracted samples ∈oc=0.7, randomly extracting a submatrix M ₂ with the number of samples m=95×0.7≡67 from the matrix a _r according to the scaling parameter ∈oc, and constructing a training set X by extracting 64 submatrices M ₂ in total, and inputting the training set X into a generated countermeasure network, as shown in fig. 4;

1.2.4 Using z in step 1.2.3) as input to the generator to obtain a vector of size 67 x 211

1.2.5 Calculating a generator loss value according to equation (3), and then updating a parameter θ _g of the generator:

Where G (z ⁽ⁱ⁾) represents the generated data generated by the generator by the noise vector z ⁽ⁱ⁾, and D (G (z ⁽ⁱ⁾)) represents the probability that the generator determines the generated data as real data, the smaller the loss _G is, the better;

1.2.6 Randomly extracting a sample set X from the training set X;

1.2.7 Randomly generating a1 x 100 gaussian distributed noise vector z;

1.2.8 Using z in step 1.2.7) as input to the generator to obtain a vector of size 95 x 211

1.2.9 Calculating a generator loss value according to equation (4), and then updating a parameter θ _g of the generator:

1.2.10 Judging whether the current cycle epoch reaches 10000 times, if so, stopping training, otherwise, returning to the step 1.2.3) to finally obtain a trained generator G ();

1.3 Data processing:

1.3.1 Randomly generating a1 x 100 gaussian distributed noise vector z;

1.3.4 Taking the maximum sample size of cancer in 3 cancer somatic mutation matrixes, namely m _max＝m₁ =489, and inserting matrix A ₂ into an amplification matrix At this time, the amplification matrixThe number of samples is

1.3.5 If the number of samples to be amplified is requiredA sample number m ₂ =95 greater than matrix a ₂, performing step 1.3.6), if an amplified sample number is requiredA sample number m ₂ =95 smaller than the matrix a ₂, step 1.3.8 is performed;

1.3.6 Randomly extracting a number of m ₂ samples from A _fakedata in step 1.3.3) Separately calculating matricesAnd mutation rate of each gene in matrix A ₂ to obtain two corresponding mutation probability sets V and Q respectively, 1.3.7) inputting the two sets V and Q obtained in step 1.3.6) into JS scattering formula defined by formula (5) to obtain a distribution value, extracting matrix as the distribution value is smallerThe more similar the mutation rate in matrix A _r is, the less the distribution value is 0.09, i.e., matrixInsertion amplification matrixIn (2) and updateOtherwise, repeating the step 1.3.6);

1.3.8 Randomly extracting a sample from A _fakedata in the step 1.3.3) and directly adding the sample into the matrix A _r to update

1.3.9 If the current amplification matrixIs a sample of (a)Equal to the maximum sample size m _max =489, and the sample supplementation is finished, otherwise, step 1.3.5) is performed to finally obtain a matrix equal to the maximum sample size m _max =489Obtaining a new amplification matrixIs substantially identical to the mutation rate of the original matrix A ₂, and then the matrix is madeTraining process pseudocode as in FIG. 5;

2) Model CDP-HA to minimize dispersion between individual total weights of cancer:

There are r=3 cancer types, for each of which a binary somatic mutation matrix is expressed as For recording whether a gene in a sample is mutated, it has m _r rows and n _r columns, the rows representing the sample or patient, the columns representing the gene, r=1, 2,3, r., a _i- represents the ith sample in matrix A _r, a _-j represents the jth gene in matrix A _r, and when mutation occurs in the jth gene of the ith sample in the mutation matrix of the nth cancer,OtherwiseGiven a set of gene sets S of size k,Representing a submatrix of size m _r xk in the corresponding matrix a _r,Representing a submatrixA sample in which the gene a is mutated,Representation ofThe total number of samples covered in (c) is used to measure the coverage of the gene set S, Overlapping the covered sample sum, and measuring the mutual exclusivity of the gene set S;

According to the definition, a nonlinear maximization weight function model CDP-HA is constructed, wherein given m _r rows and n _r columns of binary somatic mutation matrixes A _r of R cancer types, a parameter K (0<K is less than or equal to 10) is adopted, W _C (S) is made to be the maximum weight and function, and an m multiplied by K submatrix is determined The specific formula (6) is as follows:

Wherein the method comprises the steps of Representing the absolute weight of the gene set S in the r cancer species, as shown in FIG. 1, is a mutation matrix of three cancers, with two submatrices S ₁ and S ₂, the size of the scale K is 3, and can be obtained according to the model proposed in the exampleThe weight of S ₂ is found to be higher, and S ₂ is more in line with the commonality required by the common driving path;

3.1 Setting a fitness function:

Fitness(E)=W_C(M_E) (7);

3.2 Setting a selection operator:

3.3 Setting a reorganization operator:

A recombination operator based on greedy strategy is adopted, and the steps are as follows: first, given a parent chromosome e= { E ₁,e₂,...,e_k}(e_i =1, 2, once again, n), wherein a gene number is represented, thus E is also referred to as a gene set, thereby determining a candidate gene set Then randomly deleting one gene from gene set E to obtain gene set E ^′, finally selecting optimum gene from candidate set based on greedy strategyI.e.And generates the final new offspring

3.4 Setting parameters:

Inputting a somatic mutation matrix A _r after 3 kinds of cancer enhancement, wherein the gene number g _number =211, the parameter k=3, the population size N=20, the algorithm execution times step=10 and the maximum evolution algebra maxg =1000;

3.5 Constructing an initial population:

The chromosomes are encoded in a decimal encoding mode, one chromosome represents an individual and is used for representing a solution vector of a problem, in a single parent genetic algorithm, a set consisting of K=3 genes is used as a problem solution, namely E= { E ₁,e₂,...,e_k}(e_i =1, 2..the number n), and the individual in the population is initialized by randomly generating natural data sets of 1 to 20, wherein each number represents one gene in a mutation matrix, and the 20 genes are grouped in sequence to obtain n/k=20/3 approximately 7 gene sets S ₁,S₂,...,S_n/k. Order the Selecting the genes of the gene set S _max to form initial chromosomes, and generating an initial population by generating N initial chromosomes, wherein the first K=3 numbers are selected as an initial chromosome to generate an initial population pop ₀, the population size is N, the adaptive value of the chromosome of the pop ₀ population is calculated, the optimal chromosomes in the pop ₀ are compared, the best individuals are stored in a variable best, and the initial iteration number step=0 is shown in fig. 2;

3.6 Performing an iterative operation:

3.6.2 For population pop _step, firstly putting best chromosome with highest Fitness value in pop _step into pop _step+1, then executing roulette selection operator to select the rest N-1=20-1=19 chromosomes to put into pop _step+1, 3.6.3) if step <700 or Fitness (E ') > Fitness (E), updating chromosome E=E', otherwise not updating, retaining E, step=step+1;

3.6.5 The best chromosome is converted into a gene set, so that a submatrix M is obtained, the submatrix M is output, and the output M is a public driving channel S with the size of K=3.

Claims

1. A pan-cancer common driver pathway identification method based on GAN sample balance, characterized by comprising the following steps:

1) Generate somatic mutation data corresponding to cancer that conforms to the real data distribution:

1.1) Set up the adversarial generation network framework:

Assume an example training set with m _r samples and n _r genes. The generator network of SNV-GANs is defined as G(z). The input of the generator is: z~norm(0,1). The definition of the generator is as follows:

1.1.1) The input layer uses GFC1 to map the noise vector z into a tensor zn of dimension (1,128);

1.1.2) The hidden layer puts the tensor zn in step 1.1.1) into GFC2 for mapping, and the result is put into GFC3 for mapping. The same is true for GFC4, and finally it is mapped into a tensor zn′ of dimension (1,1024);

1.1.3) The output layer maps the tensor zn′ to a tensor gn of dimension (1,m _r *n _r ) through GFC5, and then resets gn to a tensor of dimension (m _r ,n _r ) Tensor is the output of the generator;

The input layer and the hidden layer both use the dropout function to freeze some neurons, and use the activation function ReLU defined by formula (1). The output layer uses the activation function Sigmod defined by formula (2). The discriminator network is defined as D(x). The input of the discriminator is real data x~P _real or generated data x represents a set of somatic mutation data samples, and the discriminator is defined as follows:

1.1.4) The input layer uses DFC1 to map x into a tensor xn of dimension (m _r ,256);

1.1.5) The hidden layer puts the tensor xn in step 1.1.4) into DFC2 for mapping, and the result is put into DFC3 for mapping. The same is true for DFC4, and finally it is mapped into a tensor xn′ of dimension (m _r ,16);

1.1.6) The output layer maps the tensor xn′ to a tensor dn of dimension (m _r ,1) through DFC5, which is the output of the discriminator;

The input layer and hidden layer both use the activation function ReLU defined by formula (1), and the output layer uses the activation function Sigmod defined by formula (2);

ReLU(x)=max(0,x) (1),

1.2) Training process of SNV-GANs:

1.2.1) Given a somatic mutation matrix A _r (m _r ×n _r ) and a random sample ratio parameter ∝, ∝<1, randomly extract a submatrix M _r with m samples, m=m _r *∝, and size m×n _r from the matrix A _r according to the ratio parameter ∝. A total of 64 submatrices M _r are extracted to construct the training set X, and input into the generative adversarial network for training;

1.2.2) Initialize the parameters θ _d of the discriminator D(.) and the parameters θ _g of the generator G(.);

1.2.3) Set the current round epoch = 1, and randomly generate a 1×100 Gaussian distributed noise vector z;

1.2.4) Take z from step 1.2.3) as the input of the generator to get a vector of size m×n _r

1.2.5) Calculate the generator loss value according to formula (3), and then update the generator parameters θ _g :

Where G(z ⁽ⁱ⁾ ) represents the generated data generated by the generator through the noise vector z ⁽ⁱ⁾ , D(G(z ⁽ⁱ⁾ )) represents the probability that the discriminator determines the generated data as real data, and the smaller the loss _G, the better;

1.2.6) Randomly select a sample group x from the training set X;

1.2.7) Randomly generate a 1×100 Gaussian distributed noise vector z;

1.2.8) Take z from step 1.2.7) as the input of the generator to get a vector of size m×n _r

1.2.9) Calculate the generator loss value according to formula (4), and then update the generator parameters θ _g :

Where D(x ⁽ⁱ⁾ ) represents the probability that the discriminator determines the generated data as real data, 1-D(G(z ⁽ⁱ⁾ )) represents the probability that D determines the generated data as generated data, and the larger the loss _D, the better;

1.2.10) Determine whether the current epoch has reached the set maximum epoch: if so, stop training; otherwise, return to step 1.2.3) to finally obtain the trained generator G(.);

1.3) Data processing:

1.3.1) Randomly generate a 1×100 Gaussian distributed noise vector z;

1.3.2) Input the vector z in step 1.3.1) into the generator G(.) obtained through training to obtain the generated data G _data =G(z);

1.3.3) Set the values in G _data greater than or equal to 0.85 to 1, and the values less than 0.85 to 0, to obtain a new binary matrix A _fakedata ;

1.3.4) Take the sample size of the cancer with the largest number of samples in the R cancer somatic mutation matrix as m _max , 0<max<R; then insert the matrix A _r into the amplification matrix At this point, the amplification matrix The sample size is 1.3.5) If the number of samples to be amplified If the number of samples m _r is greater than the number of samples in the matrix A _r , execute step 1.3.6). If the number of samples to be expanded is If the number of samples m _r is less than the number of samples in the matrix A _r , execute step 1.3.8);

1.3.6) Randomly extract m _r samples from A _fakedata in step 1.3.3) Calculate the matrix separately and the mutation rate of each gene in the matrix A _r , and obtain two corresponding mutation probability sets V and Q;

1.3.7) Input the two sets V and Q obtained in step 1.3.6) into the JS divergence formula defined by formula (5) to obtain a distribution value. The smaller the distribution value, the smaller the extraction matrix The more similar the mutation rate is to the matrix A _r , the distribution value is less than or equal to 0.09, that is, the matrix Insertion Amplification Matrix and update Otherwise, repeat step 1.3.6);

1.3.8) Randomly select a sample from _fakedata A in step 1.3.3) and add it directly to the amplification matrix Update

1.3.9) If the current amplification matrix Sample size If the maximum sample size m _max is equal to the maximum sample size, the sample supplementation is completed; otherwise, execute step 1.3.5) to finally obtain an amplification matrix equal to the maximum sample size m _max Then make the matrix 2) Model CDP-HA that minimizes the dispersion between the total weights of each cancer:

Suppose R, R ≥ 2 cancer types, for each cancer type, a binary somatic mutation matrix is represented as It is used to record whether the gene in the sample is mutated. It has m _r rows and n _r columns. The rows represent samples or patients, and the columns represent genes. r = 1, 2, 3, .., R. a _i- represents the i-th sample in the matrix A _r , and a _-j represents the j-th gene in the matrix A _r . When the j-th gene of the i-th sample in the mutation matrix of the r-th cancer mutates, otherwise Given a gene set S of size k, represents the submatrix of size m _r ×k in the corresponding matrix A _r , Represents a submatrix Samples with mutations in gene a, express The total number of samples covered in is used to measure the coverage of gene set S. The sum of samples with overlapping coverage measures the mutual exclusivity of gene set S;

According to the definition of the symbols and problems in the previous paragraph, a nonlinear maximization weight function model CDP-HA is constructed: given a binary somatic mutation matrix A _r with m _r rows and n _r columns for R cancer types, a parameter K, let W _C (S) be the maximum weight sum function, and determine an m×K submatrix The specific formula (6) is as follows:

in represents the absolute weight value of gene set S in the rth cancer type; 3) introducing a parthenogenetic algorithm to solve the model CDP-HA:

3.1) Set the fitness function:

Assume that a chromosome E is given, let _ME represent a submatrix corresponding to the chromosome, the size of the matrix _ME is m×K, and the definition of the individual fitness function Fitness(E) is as shown in the following formula (7). The larger the value of the individual fitness function, the better the feasible solution;

Fitness(E)=W _C (M _E ) (7);

3.2) Set the selection operator:

Roulette wheel selection and elite strategy are used to generate a new generation of population. The individuals with the highest fitness are directly inherited from the parent generation to the offspring, and then the roulette wheel selection operator is used to generate the remaining N-1 individuals.

3.3) Set the reorganization operator:

A recombination operator based on a greedy strategy is used. The steps are as follows: First, given a parent chromosome E = {e ₁ ,e ₂ ,...,e _k }, e _i = 1, 2,..., n, where e _i represents a gene number, so E is also called a gene set, and the candidate gene set is determined Secondly, a gene is randomly deleted from gene set E to obtain gene set E ^′ ; finally, the best gene is selected from the candidate set based on the greedy strategy. Right now And eventually produce new offspring

3.4) Set parameters:

Input the somatic mutation matrix A _r of R cancer enhancements, the number of genes g _number , the parameter k , the population size N , the number of algorithm executions t , and the maximum number of evolution generations maxg ;

3.5) Construct the initial population:

Chromosomes are encoded in decimal code. One chromosome represents one individual and is used to represent the solution vector of the problem. In the parthenogenetic algorithm, a set of K genes is used as a solution, that is, E = {e ₁ , e ₂ , ..., e _k }, e _i = 1, 2, ..., n. The method of initializing individuals in the population is: randomly generate natural data sets from 1 to n, each number represents a gene in the mutation matrix, and group the n genes in order to obtain n/k gene sets S ₁ , S ₂ , ..., Sn _/k , let Select the genes of gene set S _max to form the initial chromosome, and generate the initial population by generating N initial chromosomes; select the first K numbers as an initial chromosome to generate an initial population pop ₀ with a population size of N, calculate the fitness value of the chromosome of the pop ₀ population, compare the best chromosome in pop ₀ , save the best individual to the variable best, and the initial iteration number step = 0;

3.6) Perform iterative operations:

3.6.1) If step>maxg, go to step 3.6.5) to obtain a common driving path of size K, otherwise go to step 3.6.2);

3.6.2) For the population pop _step , first put the best chromosome with the highest fitness value in pop _step into pop _step+1 , then execute the roulette wheel selection operator to select the remaining N-1 chromosomes and put them into pop _step+1 ;

3.6.3) If step<0.7*maxg or Fitness(E′)>Fitness(E), update chromosome E=E′, otherwise do not update, keep X, step=step+1;

3.6.4) Take the chromosome with the highest fitness value in pop _step+1 . If the fitness value of this chromosome is greater than the fitness value of the best chromosome, then update the best chromosome, that is, best = the optimal chromosome of pop _step+1 ;

3.6.5) The best chromosome is converted into a gene set, thereby obtaining a submatrix M, and the submatrix M is output. The output M is the common driving pathway S of size K.