Disclosure of Invention
The present application aims to overcome the defects in the prior art, and provides a disease risk prediction method, device, computer equipment and computer-readable storage medium based on ruminococcus microbiota to at least solve the problems of low invasive detection efficiency, low accuracy and traumatic pain to a detected person in the related art, and in order to achieve the above object, the present application adopts the following technical scheme:
in a first aspect, the present invention provides a method for predicting risk of disease based on the microbiota of ruminococcus comprising:
acquiring relative abundance information of the metagenome data of the excrement samples of the disease population and the health population;
determining feature data of the intestinal flora according to the relative abundance information and pre-screened biomarkers of the diseases, wherein the biomarkers of the diseases are pre-screened according to literature review and historical information of relative abundance of differential bacteria, and the historical information of relative abundance of the differential bacteria is obtained by performing difference analysis on the historical information of relative abundance of disease people and healthy people;
determining classification variables of disease people and healthy people;
inputting the characteristic data of the intestinal flora and the classification variables into a pre-established machine learning model for training to obtain a disease risk prediction model;
predicting the disease risk by using a disease risk prediction model;
wherein the disease includes inflammation, atherosclerosis, tumors, hypertension, diabetes, infection;
wherein the biomarker is a ruminococcus microbiota;
wherein the categorical variables include gender, age, antibiotic usage, smoking status, smoking history, country;
the machine learning model comprises a random forest model, a decision tree model and an Adaboost model.
In some of these embodiments, the machine learning model is a random forest model.
In some of these embodiments, the inflammation comprises bronchitis, cystitis, otitis, pneumonia.
In some embodiments, obtaining relative abundance information of stool sample metagenomic data for a disease population and a healthy population comprises:
acquiring the metagenome data of the excrement samples of disease people and health people;
and performing species annotation analysis and function annotation analysis on the stool sample metagenome data to obtain the relative abundance information of disease people and healthy people.
In some of these embodiments, obtaining stool sample metagenomic data for the diseased and healthy population comprises:
obtaining classification and flora metagenomic information of a human microbiome sample in metamicdata repository data from an ExperimentHub R library by using a curatedMetagenomicData package;
screening and downloading sample metagenome data and sample general information from excrement, wherein the sample metagenome data comprises a flora classification spectrum and flora relative abundance, and the sample general information comprises an experimental scheme, a disease state, age, gender, antibiotic use condition, region (or country), smoking condition and smoking history.
In some of these embodiments, performing species annotation analysis and functional annotation analysis on the stool sample metagenomic data comprises:
and carrying out standardized naming on the rumen coccus microbiota in the fecal sample metagenome data according to the classification method of the national center for biotechnology information.
In some embodiments, after the standardized naming is performed, the method further comprises:
the abundance of the ruminococcus microbiota from different studies was pooled.
In some of these embodiments, further comprising:
the disease categories were standardized for naming and merging using the medical topic.
In some embodiments, before inputting the characteristic data of the intestinal flora into a pre-established machine learning model for training to obtain a disease risk prediction model, the method further includes:
screening the characteristic data of the intestinal flora to obtain sample data with the abundance of all rumen coccus microbiota being 0;
deleting sample data of which the abundances of all rumen coccus microbiota are 0 from the characteristic data of the intestinal flora;
and storing the characteristic data of the intestinal flora after the data deletion.
In some embodiments, before inputting the characteristic data of the intestinal flora into a pre-established machine learning model for training to obtain a disease risk prediction model, the method further includes:
performing dummy variable transformation on the classification variables by using IBM SPSS statistics 23.0;
and storing the classification variables after the dummy variable transformation.
In some of these embodiments, predicting a disease risk using a disease risk prediction model comprises:
adjusting parameters of the disease risk prediction model by using a grid search algorithm;
testing the disease risk prediction model after parameter adjustment by using the test data;
according to the test result, performing performance evaluation on the disease risk prediction model by using a confusion matrix;
and (5) performing disease risk prediction by using a disease risk prediction model qualified in performance evaluation.
In some of these embodiments, the number of gender variations is 3, NA, male, female, respectively.
In some of these embodiments, the number of age variations is 4, newborn, child, adult, elderly, respectively.
In some of these examples, the antibiotic usage variables are 3, NA, used, not used.
In some of these embodiments, the smoking status variable is 3, NA, present, or absent.
In some of these embodiments, the number of variables in the smoking history is 3, NA, present, or absent, respectively.
In some of these embodiments, the number of variables for the country is 9, NA, canada, china, the netherlands, israel, finland, russia, sweden, usa respectively.
In some of these embodiments, the IBM SPSS statistics 23.0 does not make dummy variable changes to predictor variables, wherein the predictor variables are disease categories.
In some of these embodiments, the number of variables for the disease category is 8, healthy, inflammatory, atherosclerotic, neoplastic, hypertensive, diabetic, infectious, or others.
In some of these embodiments, the pre-screened biomarkers of disease comprise:
Ruminococcus_gnavus;
Ruminococcus_obeum;
Ruminococcus_torques;
Ruminococcus_albus;
Ruminococcus_bromii;
Ruminococcus_callidus;
Ruminococcus_champanellensis;
Ruminococcus_flavefaciens;
Ruminococcus_lactaris;
Ruminococcaceae_Faecalibacterium_prausnitzii;
Ruminococcaceae_bacterium_D16;
Ruminococcaceae;
Ruminococcus。
in a second aspect, the present invention provides a ruminococcus microbiota-based disease risk prediction device comprising:
the data acquisition module is used for acquiring the relative abundance information of the metagenome data of the stool samples of the disease population and the health population;
the characteristic data determining module is used for determining the characteristic data of the intestinal flora according to the relative abundance information and the pre-screened biomarkers of the diseases, wherein the biomarkers of the diseases are pre-screened according to literature review and the historical information of the relative abundance of the differential bacteria, and the historical information of the relative abundance of the differential bacteria is obtained by performing difference analysis on the historical information of the relative abundance of the disease population and the historical information of the relative abundance of the healthy population;
the component variable determining module is used for determining classification variables of disease people and healthy people;
the model training module is used for inputting the intestinal flora characteristic data and the component variables into a pre-established machine learning model for training to obtain a disease risk prediction model;
the risk prediction module is used for predicting the disease risk by utilizing the disease risk prediction model;
wherein the disease includes inflammation, atherosclerosis, tumors, hypertension, diabetes, infection;
wherein the biomarker is a ruminococcus microbiota;
wherein the categorical variables include gender, age, antibiotic usage, smoking status, smoking history, country;
the machine learning model comprises a random forest model, a decision tree model and an Adaboost model.
In some of these embodiments, further comprising:
the parameter adjusting module is used for adjusting parameters of the disease risk prediction model by utilizing a grid search algorithm;
the test module is used for testing the disease risk prediction model after parameter adjustment by using test data;
the performance evaluation module is used for carrying out performance evaluation on the disease risk prediction model by using the confusion matrix according to the test result;
and the risk prediction module is also used for predicting the disease risk by using the disease risk prediction model qualified by the performance evaluation.
In some of these embodiments, further comprising:
and the data cleaning module is used for screening the characteristic data of the intestinal flora to obtain sample data with the abundance of all the rumen coccus microbiota being 0, deleting the sample data with the abundance of all the rumen coccus microbiota being 0 from the characteristic data of the intestinal flora, and storing the characteristic data of the intestinal flora after data deletion.
In some of these embodiments, further comprising:
and the variable transformation module is used for carrying out dummy variable transformation on the classification variable by using IBM SPSS statistics 23.0 and storing the classification variable subjected to the dummy variable transformation.
In some of these embodiments, the data acquisition module comprises:
the metagenome data sub-acquisition module is used for acquiring the metagenome data of the excrement samples of the disease population and the health population;
and the annotation analysis submodule is used for performing species annotation analysis and function annotation analysis on the stool sample metagenome data to obtain the relative abundance information of disease people and healthy people.
In some of these embodiments, the pre-screened biomarkers of disease comprise:
Ruminococcus_gnavus;
Ruminococcus_obeum;
Ruminococcus_torques;
Ruminococcus_albus;
Ruminococcus_bromii;
Ruminococcus_callidus;
Ruminococcus_champanellensis;
Ruminococcus_flavefaciens;
Ruminococcus_lactaris;
Ruminococcaceae_Faecalibacterium_prausnitzii;
Ruminococcaceae_bacterium_D16;
Ruminococcaceae;
Ruminococcus。
in a third aspect, the present invention provides a use of a ruminococcus microbiota for the prediction of risk of disease.
In some embodiments thereof, the ruminococcus microbiota comprises:
Ruminococcus_gnavus;
Ruminococcus_obeum;
Ruminococcus_torques;
Ruminococcus_albus;
Ruminococcus_bromii;
Ruminococcus_callidus;
Ruminococcus_champanellensis;
Ruminococcus_flavefaciens;
Ruminococcus_lactaris;
Ruminococcaceae_Faecalibacterium_prausnitzii;
Ruminococcaceae_bacterium_D16;
Ruminococcaceae;
Ruminococcus。
in some embodiments, the disease comprises inflammation, atherosclerosis, tumor, hypertension, diabetes, infection.
In a fourth aspect, the present invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the disease risk prediction method as described above when executing the computer program.
In a fifth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a disease risk prediction method as described above.
Compared with the related art, the disease risk prediction method, the device, the computer equipment and the computer storage medium based on the rumen coccus microbiota provided by the embodiment of the application can predict various disease risks through the rumen coccus microbiota, the sample acquisition mode is simple, the detected person does not have wound during non-invasive detection, and the detected person does not have wound; a random forest model is utilized to screen noninvasive biomarkers for predicting risks of various diseases from complex and various biological big data, so that the prediction accuracy is improved, and the blank of clinical early warning of different diseases is filled; the prediction method is simple and quick, has high efficiency, and can quickly guide or assist the detected people to carry out subsequent processing.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or elements (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
Example 1
The invention provides an application of a rumen coccus microbiota in disease risk prediction.
In some embodiments thereof, the ruminococcus microbiota comprises:
Ruminococcus_gnavus;
Ruminococcus_obeum;
Ruminococcus_torques;
Ruminococcus_albus;
Ruminococcus_bromii;
Ruminococcus_callidus;
Ruminococcus_champanellensis;
Ruminococcus_flavefaciens;
Ruminococcus_lactaris;
Ruminococcaceae_Faecalibacterium_prausnitzii;
Ruminococcaceae_bacterium_D16;
Ruminococcaceae;
Ruminococcus。
in some embodiments, the disease comprises inflammation, atherosclerosis, tumor, hypertension, diabetes, infection.
Fig. 1 is a flowchart (one) of a disease risk prediction method according to an embodiment of the present invention. As shown in fig. 1, a method for predicting risk of disease based on ruminococcus microbiota, comprising:
s102, obtaining relative abundance information of the metagenome data of the stool samples of the disease population and the health population;
step S104, determining feature data of the intestinal flora according to the relative abundance information and pre-screened biomarkers of the diseases, wherein the biomarkers of the diseases are pre-screened according to literature review and the historical information of the relative abundance of the differential bacteria, and the historical information of the relative abundance of the differential bacteria is obtained by performing difference analysis on the historical information of the relative abundance of the disease population and the historical information of the relative abundance of the health population;
s106, determining classification variables of disease people and health people;
step S108, inputting the characteristic data and the classification variables of the intestinal flora into a pre-established machine learning model for training to obtain a disease risk prediction model;
step S110, predicting the disease risk by using a disease risk prediction model;
wherein the disease includes inflammation, atherosclerosis, tumor, hypertension, diabetes, infection;
wherein the intestinal flora is rumen coccus flora;
wherein the classification variables comprise sex, age, antibiotic usage, smoking status, smoking history, country;
the machine learning model comprises a random forest model, a decision tree model and an Adaboost model.
Wherein the pre-screened biomarkers of the disease comprise:
Ruminococcus_gnavus;
Ruminococcus_obeum;
Ruminococcus_torques;
Ruminococcus_albus;
Ruminococcus_bromii;
Ruminococcus_callidus;
Ruminococcus_champanellensis;
Ruminococcus_flavefaciens;
Ruminococcus_lactaris;
Ruminococcaceae_Faecalibacterium_prausnitzii;
Ruminococcaceae_bacterium_D16;
Ruminococcaceae;
Ruminococcus。
in some embodiments, obtaining relative abundance information of stool sample metagenomic data for a disease population and a healthy population comprises:
acquiring the metagenome data of the excrement samples of disease people and health people;
and performing species annotation analysis and function annotation analysis on the stool sample metagenome data to obtain the relative abundance information of disease people and healthy people.
In some of these embodiments, obtaining stool sample metagenomic data for the diseased and healthy population comprises:
obtaining classification and flora metagenomic information of a human microbiome sample in metamicdata repository data from an ExperimentHub R library by using a curatedMetagenomicData package;
screening and downloading sample metagenome data and sample general information from the excrement, wherein the sample metagenome data comprises a flora classification spectrum and flora relative abundance, and the sample general information comprises an experimental scheme, a disease state, age, gender, antibiotic use condition, region (or country), smoking condition and smoking history.
In some of these embodiments, performing species annotation analysis and functional annotation analysis on the stool sample metagenomic data comprises:
carrying out standardized naming on the fecal sample metagenome data according to the classification method of the national center for biotechnology information and the rumen coccus microbiota;
the abundance of the ruminococcus microbiota from different studies was pooled.
In some of these embodiments, further comprising:
the disease categories were standardized for naming and merging using the medical topic.
Through the steps, the risk of various diseases can be predicted through the rumen coccus microbiota, the sample acquisition mode is simple, the detected person does not have wound during detection, and the detected person does not have wound; a random forest model is utilized to screen noninvasive biomarkers for predicting risks of various diseases from complex and various biological big data, so that the prediction accuracy is improved, and the blank of clinical early warning of different diseases is filled; the prediction method is simple and quick, has high efficiency, and can quickly guide or assist the detected people to carry out subsequent processing.
Fig. 2 is a flowchart of a disease risk prediction method according to an embodiment of the present invention (ii). As shown in fig. 2, before inputting the characteristic data of the intestinal flora into a pre-established machine learning model for training to obtain a disease risk prediction model, the method further includes:
s202, screening characteristic data of the intestinal flora to obtain sample data of which the abundances of all rumen coccus microbiota are 0;
step S204, sample data of which the abundances of all rumen coccus microorganisms are 0 are deleted from the characteristic data of the intestinal flora;
and S206, storing the characteristic data of the intestinal flora subjected to data deletion.
Specifically, considering that the 0 value of the flora abundance may have both systematic errors and real situations, the samples with the abundance of all the rumen coccus flora of the same strain being 0 are deleted.
Through the steps, the intestinal flora characteristic data is subjected to data cleaning, so that the problems of data redundancy, data missing values and abnormal values are solved.
Fig. 3 is a flowchart of a disease risk prediction method according to an embodiment of the present invention (iii). As shown in fig. 3, before inputting the characteristic data of the intestinal flora into a pre-established machine learning model for training to obtain a disease risk prediction model, the method further includes:
step S302, carrying out dummy variable transformation on the classification variables by using IBM SPSS statistics 23.0;
and step S304, storing the classification variables after the dummy variable transformation.
In some of these embodiments, the IBM SPSS statistics 23.0 does not make dummy variable changes to the predicted variables, where the predicted variables are disease classes.
In some of these embodiments, the number of gender variations is 3, NA, male, female, respectively.
In some of these embodiments, the number of age variations is 4, newborn, child, adult, elderly, respectively.
In some of these examples, the number of antibiotic usage variables was 3, NA, used, not used.
In some of these embodiments, the number of smoking event variables is 3, NA, present, or absent.
In some of these embodiments, the number of variables for the smoking history is 3, NA, present, or absent, respectively.
In some of these embodiments, the number of national variables is 9, NA, canada, china, the netherlands, israel, finland, russia, sweden, usa respectively.
In some of these embodiments, the number of variables for a disease category is 8, healthy, inflammatory, atherosclerotic, tumor, hypertension, diabetes, infection, among others.
Through the steps, since partial fields in the data are unordered second-class variables and multi-class variables, and influence of assigned values on the model is not eliminated, the IBM SPSS statistics 23.0 is used for carrying out dummy variable transformation on the class variables except the predicted variables so as to improve the goodness of fit of the model.
Fig. 4 is a flowchart of a disease risk prediction method according to an embodiment of the present invention (four). As shown in fig. 4, the disease risk prediction using the disease risk prediction model includes:
s402, adjusting parameters of a disease risk prediction model by using a grid search algorithm;
step S404, testing the disease risk prediction model after parameter adjustment by using the test data;
step S406, according to the test result, performing performance evaluation on the disease risk prediction model by using a confusion matrix;
and step S408, predicting the disease risk by using the disease risk prediction model qualified in performance evaluation.
Through the steps, the disease risk prediction model is optimized, and the prediction accuracy is improved.
Fig. 5 is a block diagram of a disease risk prediction apparatus according to an embodiment of the present invention. As shown in fig. 4, a ruminococcus microbiota-based disease risk prediction apparatus 500 includes:
the data acquisition module 501 is used for acquiring the relative abundance information of the metagenome data of the stool samples of the disease population and the healthy population;
a characteristic data determination module 502, configured to determine characteristic data of the intestinal flora according to the relative abundance information and pre-screened biomarkers of the disease, where the biomarkers of the disease are pre-screened according to review of literature and historical information of relative abundance of differential bacteria, and the historical information of relative abundance of differential bacteria is obtained by performing difference analysis on the historical information of relative abundance of the disease population and the historical information of relative abundance of the healthy population;
a component variable determination module 503 for determining classification variables of the disease population and the healthy population;
the model training module 504 is used for inputting the characteristic data of the intestinal flora into a pre-established machine learning model for training to obtain a disease risk prediction model;
a risk prediction module 508 for predicting a disease risk using the disease risk prediction model;
wherein the disease includes inflammation, atherosclerosis, tumor, hypertension, diabetes, infection;
wherein the intestinal flora is rumen coccus flora;
wherein the classification variables comprise sex, age, antibiotic usage, smoking status, smoking history, country;
the machine learning model comprises a random forest model, a decision tree model and an Adaboost model.
In some of these embodiments, the data acquisition module 501 includes:
the metagenome data sub-acquisition module is used for acquiring the metagenome data of the excrement samples of the disease population and the health population;
and the annotation analysis submodule is used for performing species annotation analysis and function annotation analysis on the stool sample metagenome data to obtain the relative abundance information of the disease population and the healthy population.
In some of these embodiments, the disease risk prediction device 500 further comprises:
in some examples thereof, the disease risk prediction apparatus 500 further comprises:
a parameter adjusting module 505, configured to perform parameter adjustment on the disease risk prediction model by using a grid search algorithm;
a testing module 506, configured to test the disease risk prediction model after parameter adjustment by using the test data;
the performance evaluation module 507 is used for evaluating the performance of the disease risk prediction model by using the confusion matrix according to the test result;
the risk prediction module 508 is further configured to perform disease risk prediction using the disease risk prediction model qualified by performance evaluation.
The data cleaning module 509 is configured to screen the characteristic data of the intestinal flora to obtain sample data with an abundance of all ruminococcus microbiota of 0, delete the sample data with an abundance of all ruminococcus microbiota of 0 from the characteristic data of the intestinal flora, and store the characteristic data of the intestinal flora after data deletion.
In some of these embodiments, the disease risk prediction device 500 further comprises:
and the variable transformation module 510 is configured to perform dummy variable transformation on classification variables in the intestinal flora feature data by using IBM SPSS statistics 23.0, and store the intestinal flora feature data after the dummy variable transformation, wherein the classification variables include gender, age, antibiotic usage, smoking condition, smoking history, and country.
In addition, the disease risk prediction method of the embodiment of the present application may be implemented by a computer device. Components of the computer device may include, but are not limited to, a processor and a memory storing computer program instructions.
In some embodiments, the processor may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of embodiments of the present Application.
In some embodiments, the memory may include mass storage for data or instructions. By way of example, and not limitation, memory may include a hard disk Drive (hard disk Drive, abbreviated HDD), a floppy disk Drive, a Solid State Drive (SSD), flash memory, an optical disk, a magneto-optical disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. The memory may include removable or non-removable (or fixed) media, where appropriate. The memory may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory is a Non-Volatile (Non-Volatile) memory. In particular embodiments, the Memory includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.
The memory may be used to store or cache various data files for processing and/or communication use, as well as possibly computer program instructions for execution by the processor.
The processor reads and executes the computer program instructions stored in the memory to implement any one of the disease risk prediction methods in the above embodiments.
In some of these embodiments, the computer device may also include a communication interface and a bus. The processor, the memory and the communication interface are connected through a bus and complete mutual communication.
The communication interface is used for realizing communication among units, devices, units and/or equipment in the embodiment of the application. The communication interface may also be implemented with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.
A bus comprises hardware, software, or both that couple components of a computer device to one another. Buses include, but are not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, a Bus may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (FSB), a Hyper Transport (HT) Interconnect, an ISA (ISA) Bus, an InfiniBand (InfiniBand) Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a microchannel Architecture (MCA) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, abbreviated VLB) bus or other suitable bus or a combination of two or more of these. A bus may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.
The computer device may perform the disease risk prediction method in the embodiments of the present application.
In addition, in combination with the disease risk prediction method in the above embodiments, the embodiments of the present application may be implemented by providing a computer-readable storage medium. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the disease risk prediction methods of the above embodiments.
Example 2
This embodiment is a specific application example of the present invention.
A method for predicting risk of disease based on ruminococcus microbiota comprising:
step S501, acquiring classification and flora metagenome information of human microbiome samples from MetagenomicData storage database data in an Experimenthub R library by using a curatedMetagenomicData package.
Among these, the samples were 10199 samples of more than 30 types of health and disease in 52 different studies.
Step S502, screening and downloading sample metagenome data (flora classification spectrum, flora relative abundance) and sample general information (experimental scheme, disease state, age, sex, antibiotic use condition, region (or country), smoking condition and the like) from the excrement.
The number of samples is 8799.
Step S503, determining characteristic data of the intestinal flora according to the relative abundance information and pre-selected biomarkers closely related to diseases and disease states, wherein the biomarkers of the diseases are pre-screened according to literature review and the historical information of the relative abundance of the differential bacteria, and the historical information of the relative abundance of the differential bacteria is obtained by performing difference analysis on the historical information of the relative abundance of the diseases and the historical information of the relative abundance of healthy people.
Step S504, standardized naming of ruminococcus microbiota by reference to the National Center for Biotechnology Information (NCBI) taxonomy and wikipedia, and merging ruminococcus microbiota abundances from different studies (see table 1). The medical topic the disease category (MeSH) was used to standardize naming and merging of diseases.
TABLE 1 standardized nomenclature of rumen coccus microbiota
TABLE 2 disease Classification Table
Serial number
|
Index (I)
|
Encoding
|
1
|
Health care
|
1
|
2
|
Inflammation (bronchitis, cystitis, otitis, pneumonia)
|
2
|
3
|
Atherosclerosis of arteries
|
3
|
4
|
Tumor(s)
|
4
|
5
|
Hypertension (hypertension)
|
5
|
6
|
Diabetes mellitus
|
6
|
7
|
Infection with viral infection
|
7
|
8
|
Others
|
0 |
And step S505, performing data cleaning for solving the problems of data redundancy, data missing values and abnormal values. And (3) deleting the samples with the abundances of all the rumen coccus microbiota of the same strain being 0, considering that the 0 value of the abundance of the flora can have both systematic errors and real situations.
Step S506, considering that some fields are disorder secondary classification and multi-classification variables, and influence of assigned values on the model is not excluded, the IBM SPSS statistics 23.0 is used to transform the classification variables such as gender, antibiotic usage, disease classification, smoking status, country, etc. except the predicted variable (disease category) into dummy variables (dummy variable), so as to improve the model fitting goodness (as shown in table 3).
Table 3 dummy variable handling table
Step S507, training the processed intestinal flora feature data (13 ruminococcus microbiota and 6 classification indexes (25 variables)) in a training set to obtain a disease type prediction model, and performing cross validation by 10 to calculate prediction accuracy of different models, where the training models are a Random Forest model (Random Forest), a Decision Tree model (Decision Tree), and an Adaboost model (Adaboost), respectively, the accuracy is shown in table 4, and the importance of each variable is shown in fig. 6.
In the invention, a RandomForestClassifier packet in a sklern. ensemble is used for model training, a decision tree packet of a sklern. tree is used for decision tree analysis, and an AdaBoostClassifier packet in the sklern. ensemble is used for Adaboost analysis.
TABLE 4 accuracy
And S508, selecting a random forest model, and performing parameter optimization on the machine learning model by using a grid search algorithm.
And adjusting parameters by using a grid search algorithm, namely sequentially adjusting the parameters according to the step length in a specified parameter range, training a learner by using the adjusted parameters, and finding the parameter with the highest precision on the verification set from all the parameters.
Specifically, as shown in fig. 7, the grid search flow is as follows:
determining parameters estimators, wherein the range is 0-200, and the original step length is 10;
calculating the accuracy of the model corresponding to the parameters by using a cross validation method;
judging whether the searching is finished or not, and returning to the previous step under the condition that the searching is not finished; under the condition that the search is finished, executing the next step;
and outputting the optimal parameters.
The random forest model comprises three frame parameters of n _ estimators, oob _ score and oob _ score, wherein the n _ estimators refer to the number of decision trees with the largest RF and are mainly concerned parameters. We evaluated the model score using "cross _ val _ score" of "sklern. model _ selection" in python, and used a grid search for model parameter adjustment, which results in a model score of 0.853, which is the best model, when n _ estimators is 121, as shown in table 5 and fig. 8.
TABLE 5 grid search results
Step S509, the machine learning model after parameter adjustment is tested by using the test and external verification data.
And step S510, evaluating the performance of the machine learning model by using the confusion matrix according to the external verification result.
Specifically, the classification model is evaluated using a confusion matrix in sklern, and the model is evaluated by formula calculation Accuracy (Accuracy), Precision (Precision), Recall (Recall), and F1 value (F1 score).
As shown in table 7 and fig. 9, in this matrix (8x8), the rows represent the true values of the samples and the columns represent the predicted values of the samples predicted by the algorithm, so that the position of the ith row and jth column represents the number of samples whose true values are i and whose predicted values are j. The prediction error of the classification algorithm in the model is relatively less, the model performance is good, therefore, the number of samples with the real sample value i and the predicted value i is more, (the position of the ith row and the ith column is the diagonal line in the confusion matrix).
Table 7 schematic representation of confusion matrix
T00
|
F01
|
F02
|
F03
|
F04
|
F05
|
F06
|
F07
|
F10
|
T11
|
F12
|
F13
|
F14
|
F15
|
F16
|
F17
|
F20
|
F21
|
T22
|
F23
|
F24
|
F25
|
F26
|
F27
|
F30
|
F31
|
F32
|
T33
|
F34
|
F35
|
F36
|
F37
|
F40
|
F41
|
F42
|
F43
|
T44
|
F45
|
F46
|
F47
|
F50
|
F51
|
F52
|
F53
|
F54
|
T55
|
F56
|
F57
|
F60
|
F61
|
F62
|
F63
|
F64
|
F65
|
T66
|
F67
|
F70
|
F71
|
F72
|
F73
|
F74
|
F75
|
F76
|
T77 |
In table 7, T represents true, F represents false, the first number represents the true result, i.e., the predicted value, and the second number represents the predicted class, i.e., the label value.
The specific calculation method of the Accuracy (Accuracy) is as follows:
accuracy (Accuracy): the number of correctly classified samples accounts for the total number of samples.
A=(T00+T11+…+T77)/N=(121+118+111+94+125+76+121+131)/987=0.909。
The Precision (Precision) is calculated by the following specific method:
the precision ratio is as follows: the correct-predicted proper data accounts for the proportion of the correct-predicted proper data.
P0=T00/(T00+F01+F02+…+F07)=121/(121+8+2+3+0+1+0)=0.896;
The same principle is calculated as follows:
P1=0.983;P2=0.917;P3=0.879;P4=0.839;P5=0.974;P6=0.931;P7=0.891;
P=(P0+P1+…+P7)/8=0.914。
recall (Recall) specific calculation method:
the recall ratio is as follows: the positive data that is predicted to be correct is proportional to the actual positive data.
R0=T00/(T00+F10+F20+…+F70)=121/135=0.896;
The same principle is calculated as follows:
R1=0.648;R2=0.974;R3=0.989;R4=0.969;R5=1.000;R6=0.992;R7=0.978
R=(R0+R1+…+R7)/8=0.931。
specific calculation method of F1 value (F1 score):
f1 value: and (5) harmonizing the average value. (ii) a
F1=2*P*R/(P+R)=2*0.914*0.931/(0.914+0.931)=0.922。
Specific results are shown in table 7.
TABLE 7 external verification results Table
The gray scale image is mapped using the error _ matrix. As shown in fig. 10, the type and number of errors in classification made by the algorithm are determined according to the brightness of the grayscale image, and in this embodiment, the diagonal brightness in the matrix is generally high, which indicates that the model prediction performance is good.
And step S511, predicting the disease type by using the disease prediction model qualified by performance evaluation.
A more specific embodiment of the present invention is as follows:
collecting fresh or properly frozen feces of people, putting dry ice in the feces for preservation within 30 minutes, and storing the feces in a refrigerator at-80 ℃ as soon as possible until intestinal metagenome sequencing is carried out;
extracting DNA, and performing quality control on the extracted nucleic acid substance by using an agarose gel method, wherein the total amount of the DNA is more than or equal to 1 mug, and the total concentration of the DNA is more than or equal to 20 ng/muL;
establishing a library for a sample with qualified quality, and carrying out double-end sequencing on the sample with the illumina hiseq 4000;
after obtaining original metagenome double-end sequencing data, performing quality control on the data by using Trimmomatic software, removing low-quality sequences and joints, and evaluating the data after quality control by using FastQC software;
performing metagenome species annotation analysis on the data after quality control by adopting MetaPhIAn2 software;
acquiring abundance information of species of the intestinal flora of the population;
adopting a machine learning method for modeling and a ten-by-ten cross validation method, randomly dividing data into a training set and a testing set, adopting a grid to search and adjust parameters, and selecting optimal parameters;
and (3) acquiring a batch of external data which never participate in modeling, using the constructed model for predicting the batch of data, and judging the quality of the predicted model through a confusion matrix.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.