CN117877744A

CN117877744A - Construction method and system of auxiliary reproductive children tumor onset risk prediction model

Info

Publication number: CN117877744A
Application number: CN202311832304.XA
Authority: CN
Inventors: 虞慧婷; 王春芳; 蔡任之; 夏天; 崔欣; 陈蕾; 臧嘉捷; 钱耐思; 刘星航; 晋珊; 李琦; 张�诚; 道理; 钱晨嗣; 高雅
Original assignee: Shanghai Municipal Center For Disease Control & Prevention
Current assignee: Shanghai Municipal Center For Disease Control & Prevention
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-04-12

Abstract

The invention discloses a construction method and a construction system of an auxiliary reproductive children tumor morbidity risk prediction model, which relate to the technical field of model training and comprise the following steps: collecting original related data, and carrying out regression analysis and screening of influence data; preprocessing the influence data to obtain various characteristic data sets; carrying out importance analysis on each type of characteristic data set by adopting a convolutional neural network model, and dividing characteristic variables in each type of characteristic data set according to importance; and constructing a hybrid neural network model, taking main characteristic variables and composite characteristic variables of various characteristic data sets in a training set together as input characteristic variables to perform model training on the hybrid neural network model, and using the hybrid neural network model. The method and the system for constructing the auxiliary reproduction child tumor onset risk prediction model complete prediction and prevention of the child tumor onset risk through the related data of both parents and the child, reduce the family pressure and ensure the health of the child.

Description

Construction method and system of auxiliary reproductive children tumor onset risk prediction model

Technical Field

The invention relates to the technical field of model training, in particular to a method and a system for constructing an auxiliary reproductive children tumor onset risk prediction model.

Background

For children assisted in reproduction, tumors not only threaten the life of the children, but also bring economic burden and mental stress to society and families. But the children tumor is also treatable, and proper screening tools are adopted to screen, prevent and monitor auxiliary reproduction children, so that the children tumor early detection, early diagnosis and early treatment have very important public health significance.

Therefore, how to predict the tumor onset of a neonate after the completion of pregnancy of an assisted reproduction mother is a urgent problem to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, the invention provides a method and a system for constructing an auxiliary reproduction child tumor onset risk prediction model, which comprehensively predicts the risk of child tumor onset through father side data, mother side data, child data and child disease data, plays a technical effect of early prevention and early treatment, and reduces the economic pressure and mental pressure of families.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the construction method of the auxiliary reproduction children tumor onset risk prediction model comprises the following steps:

step 1: collecting primary related data of the auxiliary reproduction children, and carrying out regression analysis to screen influence data;

step 2: preprocessing the influence data according to the data type to obtain various characteristic data sets;

step 3: carrying out importance analysis on each type of characteristic data set by adopting a convolutional neural network model, dividing characteristic variables in each type of characteristic data set into main characteristic variables and secondary characteristic variables according to importance, selecting the secondary characteristic variables to construct a composite characteristic variable, and dividing a training set and a verification set;

step 4, constructing a hybrid neural network model, and carrying out model training on the hybrid neural network model by taking main characteristic variables and composite characteristic variables of various characteristic data sets in the training set together as input characteristic variables;

step 5: and carrying the verification set into the mixed neural network model verification model prediction accuracy rate after training, if the verification set meets the standard, using the mixed neural network model, otherwise, adjusting the mixed neural network model.

Preferably, the influence data comprises father side data, mother side data, child birth data, child postnatal disease data and auxiliary reproduction mode data.

Preferably, the step 1 specifically includes:

step 1.1: carrying out single-factor logistic regression analysis on the original related data, and screening all variables with probability values p smaller than a set threshold value;

step 1.2: and performing colinear analysis on all the screened variables, deleting the variables with the colinear characteristics to obtain residual variables, and taking the residual variables as the influence data.

Preferably, the step 2 specifically includes:

step 2.1: performing data cleaning, data filling and standardization processing on the father side data and the mother side data, and combining the father side data and the mother side data with child tumor data to obtain a first type of characteristic data set;

step 2.2: carrying out data cleaning, data filling and standardization processing on the child birth data and the child postnatal disease data, and combining the child birth data and the child postnatal disease data with the child tumor data to obtain a second class characteristic data set;

step 2.3: and carrying out data cleaning, data filling and standardization treatment on the auxiliary reproduction mode data, and combining the auxiliary reproduction mode data with the child tumor data to obtain a third type of characteristic data set.

Preferably, the step 3 specifically includes:

step 3.1: constructing a convolutional neural network model, setting super parameters, and training and testing the convolutional neural network model by utilizing various characteristic data sets to obtain model parameters when the convolutional neural network model is iterated;

step 3.2: calculating an importance characteristic value of each characteristic variable according to the model parameters;

step 3.3: dividing all feature variables into main feature variables and secondary feature variables according to the importance feature values;

step 3.4: processing the second type of feature data set and the third type of feature data set by adopting the methods from the step 3.1 to the step 3.3;

step 3.5: and dividing the training set and the verification set from the processed first type of feature data set, the processed second type of feature data set and the processed third type of feature data set.

Preferably, the adjusting the hybrid neural network model specifically includes:

when the model is in over fitting, reducing the number of hidden layer units to reduce the complexity of the model, and when the model is in under fitting, increasing the number of hidden layer units to improve the expression capacity of the model;

regularizing the model to reduce the risk of overfitting, and comparing the adjusted model performance and model prediction accuracy using the new data set.

Preferably, the father data includes fertility age, school, sperm quality, and reproductive system tumor history; the maternal data comprise birth age, academic, pregnancy times, abortion times, habitual abortion, reproductive system tumor history of infertility cause and gestational diseases; the child birth data includes gender, premature birth, low birth weight, birth defects, large children, children less than gestational age, and children more than gestational age; the childhood postnatal disease data comprise asthma, first time of onset of asthma, bronchitis, digestive system diseases and times of radiotherapy; the auxiliary reproduction mode data comprise natural conception, artificial insemination, test tube infant, single sperm egg cytoplasm microinjection and embryo freezing.

The system for constructing the auxiliary reproduction children tumor onset risk prediction model comprises the following components:

the influence data screening module is used for collecting the original related data of the assisted reproduction children and carrying out regression analysis and screening on the influence data;

the characteristic data set processing module is used for preprocessing the influence data according to the data type to obtain various characteristic data sets;

the importance dividing module is used for carrying out importance analysis on each type of characteristic data set by adopting a convolutional neural network model, dividing characteristic variables in each type of characteristic data set into main characteristic variables and secondary characteristic variables according to importance, selecting the secondary characteristic variables to construct a composite characteristic variable, and dividing a training set and a verification set;

the model training module is used for constructing a hybrid neural network model, and taking main characteristic variables and composite characteristic variables of various characteristic data sets in the training set together as input characteristic variables to carry out model training on the hybrid neural network model;

and the model adjustment module is used for leading the verification set into the trained mixed neural network model to verify the model prediction accuracy, if the model reaches the standard, using the mixed neural network model, and otherwise, adjusting the mixed neural network model.

Compared with the prior ART, the method and the system for constructing the auxiliary reproduction children tumor onset risk prediction model are disclosed, the importance degree of data for comprehensively predicting the children tumor onset risk of father side data, mother side data, children data and children disease data is screened, the model training speed is accelerated, the complexity of data processing is reduced, the model is adjusted in real time, the prediction accuracy and performance of the model are guaranteed, the established model can accurately predict the ART children tumor onset risk, the fertility decision of pregnant couples is facilitated, the obtained representation has practical medical significance, the method and the system can be used for identifying the risk factors of the ART children tumor, the prevention of the children tumor is facilitated, and the health of the children is promoted.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart provided by the invention.

Fig. 2 is a schematic view of ROC index provided by the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention discloses a method and a system for constructing an auxiliary reproductive children tumor onset risk prediction model, which are shown in fig. 1 and comprise the following steps:

step 2: preprocessing influence data according to data types to obtain various characteristic data sets;

step 4, constructing a hybrid neural network model, and carrying out model training on the hybrid neural network model by taking main characteristic variables and composite characteristic variables of various characteristic data sets in a training set as input characteristic variables;

step 5: and (3) bringing the verification set into the trained mixed neural network model verification model prediction accuracy, if the model meets the standard, using the mixed neural network model, otherwise, adjusting the mixed neural network model.

In a specific embodiment, the influence data comprises father side data, mother side data, child birth data, child postnatal illness data, assisted reproduction mode data.

In a specific embodiment, step 1 specifically includes:

step 1.1: carrying out single-factor logistic regression analysis on the original related data, and screening all variables with probability value p smaller than a set threshold value;

step 1.2: and performing colinear analysis on all the screened variables, deleting the variables with the colinear characteristics to obtain residual variables, and taking the residual variables as influence data.

The method has the beneficial effects that the original related data is analyzed through the step 1, so that the influence data related to the risk of the child suffering from the tumor is screened, and the interference of other irrelevant factors is eliminated.

In a specific embodiment, step 2 specifically includes:

step 2.1: performing data cleaning, data filling and standardization processing on father side data and mother side data, and combining the father side data and the mother side data with child tumor data to obtain a first type of characteristic data set;

step 2.2: carrying out data cleaning, data filling and standardization treatment on the child birth data and the child postnatal illness data, and combining the child birth data and the child illness data to obtain a second class characteristic data set;

In a specific embodiment, step 3 specifically includes:

step 3.1: constructing a convolutional neural network model, setting super parameters, and training and testing the convolutional neural network model by utilizing various characteristic data sets to obtain model parameters when the convolutional neural network model iteration is finished;

step 3.3: dividing all feature variables into main feature variables and secondary feature variables according to the importance feature values, and constructing a composite feature variable according to the secondary feature variables;

In a specific embodiment, the model parameters include: input layer-hidden layer connection weight, hidden layer-output layer connection weight, value of input layer neuron, output value of hidden layer neuron, output value of output layer neuron.

Wherein, step 3.2 specifically includes:

selecting a first quantity of convolutional neural network models with optimal performance from test results of the convolutional neural network models;

calculating the total value of the connection weights corresponding to each characteristic variable according to the input layer-hidden layer connection weights and the hidden layer-output layer connection weights corresponding to each characteristic variable

Calculating the connection weight product corresponding to each characteristic variable according to the input layer-hidden layer connection weight and the hidden layer-output layer connection weight corresponding to each characteristic variable

According to the input layer-hidden layer connection weight w corresponding to each characteristic variable _i×j Value x of input layer neuron _i And hidden layer neuron h _j Calculating the influence value of the ith neuron of the input layer on the jth neuron of the hidden layer

According to the hidden layer-output layer connection weight v corresponding to each characteristic variable _j Output value h of hidden layer neuron _j And output layer neuron transfusionThe value y ^pred Calculating the influence value of the jth neuron of the hidden layer on the neuron of the output layer

Calculating the influence value of the ith neuron of the input layer corresponding to each characteristic variable on the output value by adopting the following formula:

wherein M is the number of hidden layer neurons;

according to the absolute value |w of the input layer-hidden layer connection weight corresponding to each characteristic variable _i×j Absolute value of hidden layer-output layer connection weight |v _j I, calculate its product P _i×j ＝|w _i×j |×|v _j |；

Calculating the importance of each characteristic variable to each neuron of the hidden layer based on the weight absolute value by adopting the following method

Wherein N is the number of neurons of the input layer;

the importance characteristic value of each characteristic variable to the output layer based on the weight absolute value is calculated by adopting the following formula:

wherein M is the number of hidden layer neurons.

In a specific embodiment, step 3.3 specifically includes:

respectively calculating according to the calculation result of the step 3.2Mean value of>

The relative importance feature value of each feature variable is calculated using the following formula:

wherein n is the characteristic variable sequence number.

Respectively pairing all characteristic variablesFour times of sorting are carried out according to the size, and four types of relative importance sorting tables are generated;

dividing the four relative importance ranking tables into main characteristic variables and secondary characteristic variables according to threshold values;

and calculating the average value of four relative importance characteristic values of each secondary characteristic variable and the weight average value proportion of the four relative importance characteristic values in all the secondary characteristic variables, and carrying out weighted summation on all the secondary characteristic variables to construct the composite characteristic variable.

All variables are classified into a primary characteristic variable (i.e., a higher importance variable) and a secondary characteristic variable (i.e., a lower importance variable) according to a certain standard.

Because the output variable value is the result of the common influence of all the characteristic variables, the secondary characteristic variables contribute less, but not entirely useless, than the primary variable. Therefore, the invention does not discard all secondary characteristic variables, but weakens the connection weight of the secondary characteristic variables, and more clearly analyzes the influence of each input characteristic variable on the output variable.

Two percentage thresholds are set, one larger and one smaller. In the four orders, all characteristic variables with relative importance exceeding a larger threshold are taken out to form four sets D ₁ ,D ₂ ,D ₃ ,D ₄ Intersection D of four sets ₁ ∩D ₂ ∩D ₃ ∩D ₄ All the variables in (a) are defined as main characteristic variables, all the characteristic variables with relative importance lower than a smaller threshold are taken out, and four sets D are formed ₅ ,D ₆ ,D ₇ ,D ₈ Intersection D of four sets ₅ ∩D ₆ ∩D ₇ ∩D ₈ All feature variables in the list are deleted, and the rest feature variables are all defined as secondary feature variables.

Finding the mean of four relative importance values for each secondary feature variableAnd the weight average ratio of the same in all secondary characteristic variables +.>L is the number of secondary characteristic variables. According to the weight proportion, the secondary characteristic variables are weighted and summed to be integrated into a new composite characteristic variable. The weight proportion of each secondary characteristic variable in the composite characteristic variable is determined, and the weights are not trained in the subsequent training process of the hybrid neural network model. And the composite characteristic variable and the main characteristic variable are used as input characteristic variables of the hybrid neural network model. Too many feature variable inputs can complicate the model and increase the parameters of the neural network. In order to simplify the model, more data types are effectively utilized while the prediction precision is ensured, so that not only can all collected characteristic data be fully considered and the value of each characteristic variable be exerted, but also the connection between the low-importance characteristic variable and the neural network is reduced, thereby reducing the parameter quantity during model training and optimizing the model structure.

In a specific embodiment, the hybrid neural network model specifically includes:

the device comprises a parallel feature extraction channel formed by a first feature extraction channel, a second feature extraction channel and a third feature extraction channel, and a fully-connected neural network connected with the parallel feature extraction channel;

the first feature extraction channel is used for inputting main feature variables and compound feature variables of a first type of feature data set and extracting numerical values of father side data and mother side data;

the second feature extraction channel is used for inputting main feature variables and compound feature variables of the second class feature data set and extracting feature vectors of child birth data and child postnatal illness data;

the third feature extraction channel is used for inputting a third type of feature data set and extracting feature vectors of auxiliary reproduction mode data;

the fully-connected neural network is used for carrying out feature cascade on feature vectors extracted by the parallel feature extraction channel connection as input feature vectors, and predicting to obtain the probability of the auxiliary reproductive children tumor incidence.

The mixed neural network has complex structure and more parameters, so that the multi-source and multi-type data are classified and processed respectively, in each type of characteristic variable, importance analysis and characteristic selection are needed, the fully connected neural network is used for predicting a target value to obtain a weight under an optimal result, the total value of the connection weight of each variable, the product of the connection weight and the product of the connection weight are used for influencing an output value, the weight-based importance comprehensive analysis of an output layer is performed, the variable with high relative importance is reserved, the variable with low relative importance is weakened and integrated, the variable with low relative importance is removed, and the prediction precision is further improved.

In a specific embodiment, the adjusting the hybrid neural network model specifically includes:

In a specific embodiment, an elastic network regularization method is used to integrate an L1 regularization method and an L2 regularization method, and regularize a model, where the formula is as follows:

optimization objective = loss function + lambda ₁ *∑|w|+λ ₂ *Σw ² Wherein lambda is ₁ And lambda (lambda) ₂ Is a regularization parameter, and controls the regularized weight of L1 and L2;

the L1 regularization method and the L2 regularization method are as follows:

l1 regularization: adding the sum of absolute values of weight parameters in the loss function to punish the model, wherein the formula is as follows:

optimization objective = loss function + λ Σ|w|, where λ is the regularization parameter, controlling the intensity of regularization;

l1 regularization tends to generate a sparse weight matrix, i.e., weights of some unimportant features are reset to 0, thereby achieving the effect of feature selection;

l2 regularization: adding the square sum of weight parameters to the loss function to punish the model, wherein the formula is as follows:

optimization objective = loss function + λ x Σw ² Where λ is the regularization parameter;

l2 regularization encourages weight parameters to tend towards smaller values, but do not let them be strictly 0, and therefore do not have the sparsity effect of L1 regularization;

the father side data and the mother side data, the child birth data, the child postnatal illness data and the auxiliary reproduction mode data are used as input and converted into a feature vector A, a feature vector B and a feature vector C, and the feature vector A, the feature vector B and the feature vector C are input into a nonlinear mapping equation for learning features, and the nonlinear mapping equation is respectively expressed as:

wherein,FCNN, & gt>For gating the circulation network, +.>For cyclic convolutional neural networks, X _A 、X _B 、X _C The characteristics of the father side data and the mother side data which are respectively input into the FCNN, the characteristics of the child birth data and the child postnatal illness data which are input into the GRU, and the characteristics of the auxiliary reproduction mode data which are input into the cyclic convolutional neural network.

All the learned features are subjected to feature cascading to obtain an integrated feature, the feature cascading operation mode has several types of data which are required to be combined according to specific data types, and then the integrated feature is used as input of an economic recoverable reserve final evaluation model and expressed as

Y＝g _V (X _A ,X _B ,X _C )；

Wherein Y represents the probability of tumor incidence and g of the assisted reproduction children obtained by the final model _V Is a fully connected network layer.

Based on the characteristic extraction, the hybrid neural network (HDNN) model provided by the invention realizes an end-to-end deep learning model expressed as

Y＝F(A,B,C)；

Wherein F represents nonlinear mapping from multi-source multi-class input to target value, namely the HDNN model provided by the invention is formed by connecting FCNN, GRU and cyclic convolution neural network in parallel and then connecting with g _V And (5) a combined model after being connected in series. When the HDNN model is integrally trained, the method is conductedAnd training the model into the model with the optimal effect by using a neural network training algorithm to continuously optimize the weight W, the weight V and the threshold value of the model.

In one embodiment, the father data includes fertility age, school, sperm quality, reproductive system tumor history; the maternal data include birth age, academic, number of pregnancy, number of abortion, habitual abortion, reproductive system tumor history due to infertility, and gestational diseases; child birth data includes gender, premature birth, low birth weight, birth defects, megaly, child less than gestational age, and child greater than gestational age; the childhood disease data includes asthma, first time of asthma, bronchitis, digestive system diseases and times of radiation treatment; the auxiliary reproduction mode data comprise natural conception, artificial insemination, test tube infant, single sperm egg cytoplasm microinjection and embryo freezing.

the importance dividing module is used for carrying out importance analysis on each type of characteristic data set by adopting a convolutional neural network model, dividing characteristic variables in each type of characteristic data set into main characteristic variables and secondary characteristic variables according to importance, and dividing a training set and a verification set;

the model training module is used for constructing a hybrid neural network model, and taking main characteristic variables and composite characteristic variables of various characteristic data sets in a training set as input characteristic variables together to carry out model training on the hybrid neural network model;

and the model adjustment module is used for introducing the verification set into the trained hybrid neural network model verification model prediction accuracy, if the verification set meets the standard, using the hybrid neural network model, and otherwise, adjusting the hybrid neural network model.

In one specific embodiment, the assisted reproduction child raw related data includes: the data of the label of the child, father and mother in the seven dimensions are obtained from 7 major data sources, as shown in table 1.

TABLE 1 tag data information

In a specific embodiment, in step 2, the missing value distribution condition of the variables is probed by "frequency, missing value, maximum value, minimum value, median, mode, mean value, 25 quantiles, 75 quantiles, skewness, kurtosis, variance and standard deviation", and each variable is processed separately for different variable characteristics:

1) If the missing rate of the variable exceeds 99.9%, determining whether to reject the variable according to the business meaning of the variable. If not, filling is carried out according to the meaning of the variable.

2) And uniformly carrying out 0 supplementing treatment on the missing values which do not suffer from a certain disease. Wherein 0 represents that the tag corresponding to the ID does not have a disease. Such as: if the child suffers from birth defect, the missing value of the variable is uniformly complemented to be 0, so that the value is 0,1.

3) Diagnosis of outliers: diagnosing the data by adopting methods such as trend comparison, data distribution, business logic judgment and the like;

4) The processing and correction of the abnormal data adopts the methods of direct elimination and multiple filling for correction.

The first 15 factors obtained from the exploration of the risk factors of the tumor onset of the children are shown in table 2.

Table 2 importance information

In a specific embodiment, the construction of the model can be realized by adopting a LightGBM algorithm under the GBDT framework, so as to replace a hybrid neural network model to predict the onset risk of the tumor of the children.

(1) A Histogram (Histogram) algorithm accelerates the segmentation point finding process: firstly, carrying out 'barrel # bin' on original data of the features, dividing the data into different discrete areas, simplifying data expression, traversing the discrete data, and searching optimal dividing points.

(2) Leaf-wise growth strategy: leaf-wise finds one Leaf with the greatest splitting gain (and typically the greatest amount of data) from all the current leaves at a time, and then splits longitudinally. The maximum depth limit is applied to ensure high efficiency while preventing overfitting.

(3) Single-side gradient sampling: the samples with the gradient value of Top a% are preferentially taken, then the samples with the b% ratio are randomly sampled in the rest small gradient samples, and the (1-a)/b times (the small values of a and b) are uniformly weighted by the LightGBM on the samples with the b% ratio in the weight calculation, so that the problem of inconsistent distribution of the samples is further avoided.

Compared with the conventional machine learning algorithms XGBoost, GBDT and the like, the training speed is faster by introducing the LightGBM algorithm. Compared with the conventional logistic regression and neural network algorithm, the training precision is higher.

1. Faster training efficiency, lightGBM is Histogram (Histogram) poorly accelerated. In the whole construction process, the histogram of one leaf can be obtained by making the difference between the histogram of the father node and the histogram of the brother node, so that the speed is improved.

2. The low memory usage, the LightGBM uses a histogram algorithm (histogram algorithm), the occupied memory is lower, and the complexity of data segmentation is lower.

3. The higher accuracy, the LightGBM adopts Leaf-wise (according to the Leaf growth) growth strategy, and each time a Leaf with the largest splitting gain (generally the largest data size) is found from all the current leaves, then split, and thus more errors can be reduced and better precision can be obtained by cycling.

4. Parallel learning is supported, parallel learning is supported by the LightGBM native, and feature parallelism and data parallelism are supported at present, and LightGBM is optimized for both parallel methods. Feature parallelism: and respectively searching the optimal segmentation points on different characteristic sets of different machines, and then synchronizing the optimal segmentation points among the machines. Data parallelism: and constructing histograms locally by different machines, carrying out global combination, and finally searching the optimal division point on the combined histograms.

5. Large-scale data can be processed, and GBDT needs to traverse the entire training data multiple times at each iteration. If the entire training data is loaded into the memory at one time, the size of the training data is significantly limited. If the memory is not filled, the repeated reading and writing of the training data consumes a very large amount of time. For industrial mass data, the common GBDT algorithm cannot meet the requirements. One of the main reasons proposed by LightGBM is to solve the GBDT training problem under the large data level, so as to support a large data volume and ensure efficiency in industrial practice.

6. The support directly uses category features, the LightGBM optimizes the support for category features, can directly input category features, does not need additional codes or one-hot 0/1 expansion, and increases the decision rule of category features on a decision tree algorithm. The LightGBM solves the problem caused by one-hot coding by adopting a splitting mode of Manyvs Many, and realizes the optimal splitting of category characteristics.

Compared with other research methods, the research mainly carries out a machine learning algorithm, and improves the accuracy and precision of model prediction. The LightGBM belongs to one implementation of a gradient lifting tree, and by integrating a plurality of weak learners (decision trees), residuals of a model are gradually reduced, so that the performance of the whole model is improved. Such an ensemble learning approach generally performs well when dealing with complex nonlinear relationships.

The LightGBM can be efficiently trained using parallelism of data. It supports parallel feature splitting, which means that in each iteration it can take into account the splitting of multiple features at the same time, thus building the tree faster. The LightGBM uses a histogram algorithm to process the data, dividing the data into discrete bins during training, thereby reducing memory usage and speeding up the training process. Logistic regression typically requires linear combinations of raw features, which can lead to greater memory usage and computational complexity.

The LightGBM is able to directly handle class features without the need for One-Hot Encoding (One-Hot Encoding) as with logistic regression. This helps reduce the dimensions and may in some cases improve the performance of the model. In addition, a machine learning algorithm is introduced, so that the missing value and the abnormal value of the data are more inclusive, and a great deal of effort is not required to be expended on data preprocessing of the abnormal data in the subsequent application.

The LightGBM automatically selects important features in the tree construction process, which helps to reduce the risk of overfitting and improve the generalization capability of the model.

In a specific embodiment, the model building randomly segments the model dataset into: training set, test set, verification set. The ROC index for each dataset is shown in table 3 by model training:

TABLE 3 ROC index information

Measurement index	Test set	Verification set	Training set
				ROC index	0.9203	0.9242	0.9291

As can be seen from the indexes, the ROC values of the training set, the testing set and the verification set are all larger than 0.92, which indicates that the risk distinguishing capability of the model is strong; the ROC value difference of the training set, the test set and the verification set is within 0.009, which indicates that the model is stable, as shown in fig. 2.

The proportion of the overall tumor bearing of the test set data was 3.77%. Wherein the proportion of tumors in the first 5% of high risk ID humans screened by the model is 51.06% which is 13.56 fold higher than 3.77% of the whole; the corresponding cumulative tumor capture rate is 67.80%, which indicates that the model has good capture capability on ID with tumor risk, and indicates that the model can be used for screening children tumor risk groups, and can promote the prevention of children tumor, as shown in Table 4, a top5% high risk promotion degree table of a test data set is shown:

table 4 top5% high risk lift table

The invention effectively solves the problems that the traditional statistical model is difficult to process massive, gao Wei and sparse data and effectively classifies and predicts, develops the integrated learning-based children tumor risk prediction model, accurately predicts the risk of onset of children tumor, explores important risk factors of onset of children tumor, and is used for guiding clinical practice and realizing early prevention of diseases.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The construction method of the auxiliary reproduction children tumor onset risk prediction model is characterized by comprising the following steps:

2. The method for constructing a model for predicting tumor onset risk of assisted reproduction children according to claim 1, wherein the influence data includes father side data, mother side data, child birth data, child postnatal disease data, and assisted reproduction mode data.

3. The method for constructing a model for predicting the risk of developing a tumor in an assisted reproduction child according to claim 1, wherein the step 1 specifically comprises:

4. The method for constructing a model for predicting the risk of developing a tumor in an assisted reproduction child according to claim 2, wherein the step 2 specifically comprises:

5. The method for constructing a model for predicting the risk of developing a tumor in an assisted reproduction child according to claim 4, wherein the step 3 specifically comprises:

6. The method for constructing an assisted reproductive childhood tumor onset risk prediction model according to claim 1, wherein the adjusting the hybrid neural network model specifically comprises:

7. The method for constructing a model for predicting risk of developing a neoplasm in an assisted reproductive child of claim 2, wherein the paternal data comprises fertility age, academic, sperm quality, and a history of neoplasm in the reproductive system; the maternal data comprise birth age, academic, pregnancy times, abortion times, habitual abortion, reproductive system tumor history of infertility cause and gestational diseases; the child birth data includes gender, premature birth, low birth weight, birth defects, large children, children less than gestational age, and children more than gestational age; the childhood postnatal disease data comprise asthma, first time of onset of asthma, bronchitis, digestive system diseases and times of radiotherapy; the auxiliary reproduction mode data comprise natural conception, artificial insemination, test tube infant, single sperm egg cytoplasm microinjection and embryo freezing.

8. The system for constructing the auxiliary reproduction children tumor onset risk prediction model is characterized by comprising the following steps: