CN119360969A

CN119360969A - Biomarker screening method and system based on genomics

Info

Publication number: CN119360969A
Application number: CN202411946317.4A
Authority: CN
Inventors: 侯艳
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2024-12-27
Filing date: 2024-12-27
Publication date: 2025-01-24

Abstract

The present application discloses a biomarker screening method and system based on genomics, which relates to the field of biological detection technology. The method includes: determining biomarker expression data related to significant genes based on transcriptome sequencing data of cancer disease biological samples, and estimating the posterior distribution of the association between biomarkers and cancer diseases through a logistic regression model; determining significant interaction genes based on the posterior distribution and the gene interaction network; and screening to obtain a list of biomarkers related to cancer diseases based on the significant interaction genes. Since the present application screens the posterior distribution and significant interaction genes through transcriptome sequencing data of cancer disease biological samples, it avoids the situation of missed detection due to low specificity of biomarkers in the traditional biomarker screening process, thereby obtaining a list of biomarkers with high correlation with cancer diseases, and improving the sensitivity of the biomarker screening process.

Description

Biomarker screening method and system based on genomics

Technical Field

The application relates to the technical field of biological detection, in particular to a biomarker screening method and a biomarker screening system based on genomics.

Background

Cancer (malignant tumor) is a major disease threatening human life and health, and the incidence and mortality rate of the disease in the global world are rising year by year at present, and the situation is very serious. The biomarker for cancer has stronger disease specificity and is beneficial to early diagnosis and treatment effect evaluation of the disease. Researchers have generally used a variety of methods to find and discover biomarkers for cancer diseases, such as immunological, molecular biological, genomic, etc. techniques.

However, due to the interaction between the biomarkers, the difference exists in the disease samples, the difference exists in the biomarker levels among different individuals, the change of the expression level of part of the biomarkers is not obvious in early disease or slight disease condition, and other factors lead to lower sensitivity of the biomarker screening process, and meanwhile, the specificity of the biomarkers is reduced due to the cross reaction of part of the biomarkers and other diseases, so that the missed diagnosis condition is easy to occur, and the diagnosis of patients is missed or delayed.

Disclosure of Invention

The application mainly aims to provide a biomarker screening method and a biomarker screening system based on genomics, and aims to solve the technical problems that the biomarker specificity is low and the screening sensitivity is low in the traditional biomarker screening process, so that the condition of missed detection easily occurs, and the diagnosis of a patient is missed or delayed.

To achieve the above object, the present application provides a method for screening biomarkers based on genomics, the method comprising:

Determining biomarker expression data associated with the significant genes from transcriptome sequencing data of a cancer disease biological sample, the transcriptome sequencing data obtained based on functional genomic and clinical genomic construction;

estimating posterior distribution of biomarker and cancer disease association by logistic regression model based on biomarker expression data of the significant gene;

determining significant interacting genes in a biological sample of the cancer disease based on the posterior distribution and the inter-gene interaction network;

Based on the significant interacting genes, a list of biomarkers associated with cancer disease is screened.

In one embodiment, the step of determining biomarker expression data associated with a significant gene based on transcriptome sequencing data of a cancer disease biological sample comprises:

Obtaining multiple types of cancer disease biological samples;

Carrying out transcription amplification on the cancer disease biological sample, and sequencing to obtain transcriptome sequencing data by taking functional genomics and clinical genomics as references;

and clustering and screening the transcriptome sequencing data to obtain biomarker expression data related to the significant genes.

In one embodiment, the step of cluster screening the transcriptome sequencing data to obtain biomarker expression data associated with a significant gene comprises:

preliminary screening is carried out on the transcriptome sequencing data according to the gene expression level to obtain gene difference data;

Performing association test according to the gene difference data to obtain gene difference data after the association test;

performing feature screening on the gene difference data after the association test through a hierarchical clustering algorithm to obtain a hierarchical association result of genes and cancer diseases;

And determining biomarker expression data related to the significant genes according to the hierarchical association result.

In one embodiment, the step of estimating a posterior distribution of biomarkers associated with cancer disease by a logistic regression model based on biomarker expression data of the significant gene comprises:

using a logistic regression model to describe the association of biomarkers with cancer disease based on biomarker expression data for the significant genes;

the non-normalized association of the biomarker with the cancer disease is performed by bayesian methods to obtain a posterior distribution of the biomarker associated with the cancer disease.

In one embodiment, the step of determining significant interacting genes in a biological sample for a cancer disease based on the posterior distribution and the intergenic interaction network comprises:

Determining a significant interactivity of genes in a cancer disease state and a normal state based on the posterior distribution and an inter-gene interaction network;

based on the significant interactivity, significant interacting genes in a cancer disease biological sample are determined.

In one embodiment, the step of screening for a list of biomarkers associated with a cancer disease based on the significant interacting genes comprises:

performing feature importance assessment on the significant interaction genes to obtain importance scores;

Sorting the importance scores to obtain a score list;

screening the significant interaction genes based on the score list to obtain a biomarker list associated with cancer disease.

In one embodiment, the step of performing a feature importance assessment on the significant interaction gene to obtain an importance score comprises:

Initializing a gradient lifting decision tree model;

training the gradient lifting decision tree model through a cancer gene database;

And carrying out feature importance assessment on the significant interaction genes according to the trained gradient lifting decision tree model to obtain importance scores.

In addition, to achieve the above object, the present application also proposes a genomics-based biomarker screening system, the system comprising:

a sequencing data module for determining biomarker expression data associated with a significant gene from transcriptome sequencing data of a cancer disease biological sample, the transcriptome sequencing data obtained based on functional genomic and clinical genomic constructs;

A posterior distribution module for estimating posterior distribution of the biomarker associated with the cancer disease by a logistic regression model based on the biomarker expression data of the significant gene;

An interaction module for determining significant interacting genes in a biological sample of a cancer disease based on the posterior distribution and an inter-gene interaction network;

and the marker screening module is used for screening and obtaining a biomarker list related to the cancer diseases based on the significant interaction genes.

In one embodiment, the sequencing data module is further configured to obtain multiple types of cancer disease biological samples;

The sequencing data module is also used for carrying out transcription amplification on the cancer disease biological sample, and sequencing to obtain transcriptome sequencing data by taking functional genomics and clinical genomics as references;

The sequencing data module is also used for carrying out cluster screening on the transcriptome sequencing data to obtain biomarker expression data related to the obvious genes.

In one embodiment, the sequencing data module is further configured to perform preliminary screening on the transcriptome sequencing data according to a gene expression level to obtain gene difference data;

the sequencing data module is also used for carrying out association test according to the gene difference data to obtain the gene difference data after the association test;

The sequencing data module is also used for carrying out feature screening on the gene difference data after the association test through a hierarchical clustering algorithm to obtain a hierarchical association result of genes and cancer diseases;

The sequencing data module is also used for determining biomarker expression data related to the significant genes according to the hierarchical association result.

The application provides one or more technical schemes, which have at least the following technical effects that firstly, biomarker expression data related to significant genes are determined according to transcriptome sequencing data of biological samples of cancer diseases, the transcriptome sequencing data are obtained based on functional genomics and clinical genomics construction, then, posterior distribution of the biomarker related to the cancer diseases is estimated through a logistic regression model based on the biomarker expression data of the significant genes, then, significant interaction genes in the biological samples of the cancer diseases are determined according to the posterior distribution and an interaction network between genes, and finally, a biomarker list related to the cancer diseases is obtained through screening based on the significant interaction genes. Because the application screens and obtains posterior distribution and obvious interaction genes related to the biological markers and the cancer diseases through transcriptome sequencing data of the biological samples of the cancer diseases, the condition that detection is missed due to low specificity of the biological markers in the traditional biological marker screening process is avoided, thereby obtaining a biological marker list with high correlation with the cancer diseases and improving the sensitivity of the biological marker screening process.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic flow chart of a genomic-based biomarker screening method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a second embodiment of a genomics-based biomarker screening method according to the present application;

FIG. 3 is a schematic flow chart of a third exemplary genomics-based biomarker screening method according to the present application;

Fig. 4 is a schematic diagram of functional blocks of a genomics-based biomarker screening system according to an embodiment of the present application.

The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the technical solution of the present application and are not intended to limit the present application.

For a better understanding of the technical solution of the present application, the following detailed description will be given with reference to the drawings and the specific embodiments.

The execution subject of the present embodiment may be a computing service device, such as a personal computer, a server, or the like, having functions of transcription sequencing, posterior distribution calculation, and marker screening, or an electronic device, a genome-based biomarker screening system (screening system), or the like, that is capable of implementing the above functions, and performs the genome-based biomarker screening method of the present application, which is not limited in this embodiment. This embodiment and the following embodiments will be described below with reference to a screening system.

Based on the above, the embodiment of the application provides a biomarker screening method based on genomics, and referring to fig. 1, fig. 1 is a schematic flow chart provided by the embodiment of the biomarker screening method based on genomics.

In this embodiment, the genomics-based biomarker screening method includes steps S10 to S40:

And step S10, determining biomarker expression data related to the significant genes according to transcriptome sequencing data of the cancer disease biological sample, wherein the transcriptome sequencing data is obtained based on functional genomics and clinical genomics construction.

The biological sample for cancer diseases is various biological materials related to cancer.

By way of example, a biological sample of a cancer disease may include tissue specimens (e.g., tumor tissue, paracancerous tissue), blood (e.g., whole blood, plasma, serum), body fluids (e.g., urine, cerebrospinal fluid, hydrothorax, etc.), cells (e.g., circulating tumor cells, etc.), and the like, obtained from a cancer patient.

Transcriptome sequencing data is data obtained by sequencing RNA transcribed from a biological sample of a cancer disease under a specific condition.

For example, messenger RNA (mRNA), ribosomal RNA (rRNA), transfer RNA (tRNA), and other non-coding RNAs may be included in the transcription process. The expression of genes in biological samples of cancer diseases can be comprehensively and quantitatively analyzed by transcription sequencing.

The biomarker expression data is gene data capable of significantly expressing a biomarker, which has a close relationship with cancer diseases, in the transcriptome sequencing data.

Functional genomics is the study of gene functions, including gene expression control, gene product functions, gene-to-gene interactions, and the like.

Clinical genomics is the application of genomic techniques and research results to clinical practice involving the use of genomic information to diagnose disease, predict disease prognosis, etc.

Transcriptome sequencing data is constructed through functional genomics and clinical genomics standards, so that the molecular mechanism and individual difference of cancer diseases can be better known, important support is provided for the application of the screening of the biomarkers of the cancer diseases in medical treatment, and more accurate and effective screening of the biomarkers can be realized.

In one embodiment, the biological sample of the cancer disease can be subjected to transcription amplification, then transcriptome sequencing data can be obtained according to the standards of functional genomics and clinical genomics, and the screening system can perform relevant characteristic screening of the cancer disease according to the data, so as to obtain biomarker expression data of significant genes closely related to the cancer disease.

Step S20, estimating posterior distribution of the biomarker and the cancer disease associated by a logistic regression model based on the biomarker expression data of the significant genes.

It should be noted that the logistic regression model is a statistical analysis model for solving the two-classification (or multi-classification) problem. The association of the biomarker with the cancer disease may be described by a logistic regression model to determine the association between the biomarker and the cancer disease.

Biomarkers are biochemical molecules that can label changes or changes in the structure of systems, organs, tissues, cells, subcells, and genes. The diagnosis of cancer diseases can be facilitated more accurately through the biomarker.

It is noted that a posterior distribution is a probability distribution of the association between a biomarker and a cancer disease. The role of biomarkers in cancer genesis and its uncertainty can be deeply understood by posterior distribution.

In one embodiment, the biomarker expression data can be used as an input variable to combine information related to the cancer disease, and the probability description of the association degree of the biomarker and the cancer disease is carried out through a logistic regression model to obtain posterior distribution of the association of the biomarker and the cancer disease.

And step S30, determining significant interaction genes in the cancer disease biological sample according to the posterior distribution and the interaction network among genes.

The inter-gene interaction network is a complex network structure formed by various interaction relationships between a plurality of genes.

In organisms, genes are not isolated and functional. The expression and function of one gene may be affected by other genes, and there may be several interaction modes such as synergy and antagonism. These interactions are interleaved, forming a vast network.

It should be noted that a significant interacting gene is a gene that has a particularly prominent relationship with interactions in biological samples of cancer diseases and may play a key role in the development and progression of cancer.

For example, in a cancer sample, the interaction of certain genes may be significantly altered, such as an increase or decrease in synergy or antagonism between them in the cancer state, and the genes may be considered to have significant interactions in the cancer. Through the network of interactions between genes, the interactions between genes can be better understood, revealing the association and synergy pattern between genes that play a key role in the development and progression of cancer.

In one embodiment, genes that are particularly prominent in their interactions and critical for the development and progression of cancer can be found based on the posterior distribution of biomarkers associated with cancer disease, in combination with the inter-gene interaction network. Thereby providing important basis for accurately screening and obtaining the biomarker related to the cancer disease and helping to discover a new biomarker.

And step S40, screening and obtaining a biomarker list related to the cancer disease based on the significant interaction genes.

It should be noted that the biomarker list is a list listing biomarkers related to cancer diseases. Through the biomarker list, the association between the biomarker and the cancer disease can be better understood, and the diagnosis of the disease is facilitated.

In one embodiment, the significant interacting genes obtained above may be arranged in order of degree of interaction or classification of different physiological functions, and screened to obtain a biomarker list associated with cancer disease.

In this embodiment, the biological sample of the cancer disease may be subjected to transcription amplification first, then transcriptome sequencing data may be obtained according to the criteria of functional genomics and clinical genomics, and the screening system may perform screening for relevant characteristics of the cancer disease based on these data, so as to obtain biomarker expression data of significant genes closely associated with the cancer disease. The biomarker expression data can then be used as input variables to combine information related to the cancer disease, and the probability description of the degree of association of the biomarker with the cancer disease is performed through a logistic regression model to obtain posterior distribution of the association of the biomarker with the cancer disease. Genes with particularly prominent interaction relationships and critical effects on the development and progression of cancer can then be found based on the posterior distribution of biomarkers associated with cancer disease, in combination with the inter-gene interaction network. Thereby providing important basis for accurately screening and obtaining the biomarker related to the cancer disease and helping to discover a new biomarker. Finally, the obtained significant interaction genes can be arranged according to the sequence of the interaction degree or the classification of different physiological functions, and a biomarker list related to cancer diseases is obtained by screening.

In a possible implementation manner, the step S40 of the embodiment may include the steps of evaluating the feature importance of the significant interaction genes, obtaining importance scores, sorting the importance scores to obtain a score list, and screening the significant interaction genes based on the score list to obtain a biomarker list related to cancer diseases.

The importance score is a score obtained by evaluating the respective importance levels of the significant interacting genes based on the degree of association with cancer diseases, and expressing the importance levels in the form of a numerical value.

In this embodiment, the importance scores may be obtained by evaluating the feature importance of the significant interaction genes based on the degree of association of cancer diseases, and then sorting the importance scores to obtain a score list. Finally, the biological molecules corresponding to the significant interaction genes in the front part (such as the first ten or tens of the biological molecules) in the score list are used as biological markers, so that a biological marker list related to the cancer diseases is obtained. Thus, the genes with higher scores are used as biomarkers, which are helpful for focusing on key genes.

In a possible implementation manner, the step of evaluating the feature importance of the significant interaction genes and obtaining the importance score in the embodiment includes initializing a gradient lifting decision tree model, training the gradient lifting decision tree model through a cancer gene database, and evaluating the feature importance of the significant interaction genes according to the trained gradient lifting decision tree model to obtain the importance score.

It should be noted that the gradient lifting decision tree model is a model for performing feature importance prediction and evaluation on significant interaction genes by combining a plurality of decision trees through an ensemble learning method. In the construction process, different decision trees can be generated through random feature selection, sample sampling and other modes, and the decision trees jointly form a forest, so that the significant interaction genes can be evaluated and classified.

The cancer gene database is a database for collecting, sorting, and storing gene information related to cancer. Such as the sequence of the gene, mutation status, expression level, association with the occurrence and development of cancer, related clinical information, etc.

In this embodiment, the gradient-lifting decision tree model may be initialized first. And then training the gradient lifting decision tree model through a cancer gene database to obtain a model for learning various cancer related gene information. And finally, carrying out predictive evaluation on the feature importance of the significant interaction genes according to the trained gradient lifting decision tree model to obtain importance scores. Therefore, the method can provide important basis for searching the biomarker related to the cancer disease through the cancer gene database so as to improve the accuracy of the biomarker list.

The embodiment provides a biomarker screening method based on genomics, which can be used for carrying out transcription amplification on a biological sample of cancer diseases, acquiring transcriptome sequencing data according to the standards of functional genomics and clinical genomics, and screening relevant characteristics of the cancer diseases according to the data by a screening system to obtain biomarker expression data of significant genes closely related to the cancer diseases. The biomarker expression data can then be used as input variables to combine information related to the cancer disease, and the probability description of the degree of association of the biomarker with the cancer disease is performed through a logistic regression model to obtain posterior distribution of the association of the biomarker with the cancer disease. Genes with particularly prominent interaction relationships and critical effects on the development and progression of cancer can then be found based on the posterior distribution of biomarkers associated with cancer disease, in combination with the inter-gene interaction network. Thereby providing important basis for accurately screening and obtaining the biomarker related to the cancer disease and helping to discover a new biomarker. Finally, the obtained significant interaction genes can be arranged according to the sequence of the interaction degree or the classification of different physiological functions, and a biomarker list related to cancer diseases is obtained by screening. Because the transcriptome sequencing data of the cancer disease biological sample is used for screening the posterior distribution and the significant interaction genes related to the biomarker and the cancer disease, the condition that detection is missed due to low biomarker specificity in the traditional biomarker screening process is avoided, so that a biomarker list with high correlation with the cancer disease can be obtained, and the sensitivity of the biomarker screening process is improved.

Second embodiment based on the first embodiment of the present application, in the second embodiment of the present application, the same or similar content as the first embodiment can be referred to the above description, and the description is omitted. On this basis, please refer to fig. 1 and fig. 2, fig. 2 is a schematic flow chart provided in a second embodiment of the genomics-based biomarker screening method according to the present application.

The step S10 of this example includes steps S11 to S13:

And S11, acquiring biological samples of various types of cancer diseases.

And step S12, carrying out transcription amplification on the cancer disease biological sample, and sequencing to obtain transcriptome sequencing data by taking functional genomics and clinical genomics as references.

The method is characterized in that the number of RNA transcripts in a sample is increased by a specific technical means, then, the sample after the transcription amplification is subjected to sequencing analysis, the sequencing process is to determine the sequence information of the RNA molecules, and then, the sequence information is screened by taking functional genomics and clinical genomics as references, so that transcriptome sequencing data are obtained.

And step S13, carrying out cluster screening on the transcriptome sequencing data to obtain biomarker expression data related to the obvious genes.

In this embodiment, a biological sample (for example, cancer cells, cancer tissues, etc.) involved in a cancer disease can be subjected to a transcriptional amplification treatment. And then carrying out sequencing analysis on the sample subjected to transcription amplification, and determining the sequence information of RNA molecules in the sample. And screening by taking functional genomics and clinical genomics as references to obtain transcriptome sequencing data. Thus, through the standards of functional genomics and clinical genomics, the molecular mechanism and individual difference of cancer diseases can be better known, and important support is provided for the application of the screening of the biomarkers of the cancer diseases in medical treatment, so that the biomarkers can be screened more accurately and effectively.

In a possible implementation manner, the step S13 of the embodiment can include the steps of performing preliminary screening on the transcriptome sequencing data according to the gene expression level to obtain gene difference data, performing association test on the gene difference data to obtain gene difference data after association test, performing feature screening on the gene difference data after association test through a hierarchical clustering algorithm to obtain a hierarchical association result of genes and cancer diseases, and determining biomarker expression data related to significant genes according to the hierarchical association result.

The gene difference data is data obtained by performing preliminary differential screening based on gene expression levels (e.g., comparing diseased and healthy gene expression levels).

Specifically, genes whose expression levels are significantly changed (increased or decreased) are selected by comparing the differences in the expression levels of the genes under different conditions (such as normal tissue and cancer tissue), and these genes form gene difference data.

The association test is a test performed on a standard that has a significant association with a cancer disease after the preliminary differential screening.

It should be noted that, the hierarchical clustering algorithm is a clustering analysis method, and the basic idea is to gradually merge or split samples or data points to form a hierarchical clustering result. The data which are strongly related to the cancer disease state in the gene difference data after the association test can be subjected to further hierarchical aggregation through a hierarchical clustering algorithm, a hierarchical clustering tree structure is constructed, the hierarchical relationship between the sample and the gene in the gene difference data after the association test is intuitively displayed, and a hierarchical association result is obtained.

In this embodiment, the transcriptome sequencing data may be initially screened for differences based on gene expression levels (e.g., comparing diseased to healthy gene expression levels) to obtain gene difference data. And then carrying out association test on the cancer disease with the standard which has obvious association with the cancer disease after the primary difference screening, and obtaining the gene difference data after the association test. And finally, carrying out feature screening on the gene difference data after the association test by a hierarchical clustering algorithm, carrying out further hierarchical aggregation on the data which are strongly associated with the cancer disease state in the gene difference data after the association test, constructing a hierarchical clustering tree structure, visually displaying the hierarchical relationship between the sample and the gene in the gene difference data after the association test, and obtaining a hierarchical association result so as to determine biomarker expression data related to the obvious gene. Therefore, the distribution and structural characteristics of the hierarchical relationship between the cancer sample and the gene are better understood through a hierarchical clustering algorithm, so that the accuracy of acquiring the biomarker expression data is improved.

In another possible embodiment, step S30 of the present example may include the steps of determining a significant interactivity of genes in a cancer disease state and a normal state based on the posterior distribution and an inter-gene interaction network, and determining a significant interacted gene in a cancer disease biological sample based on the significant interactivity.

It should be noted that a significant interactivity is a significant correlation of genes existing between each other in a disease state such as cancer and a normal state.

Specifically, it is determined how the relationship between genes is different in cancer from that in normal cases. For example, some genes may cooperate or be restricted in a more gentle manner under normal conditions, but in a cancerous condition their interactions may become abnormally active or inhibited, and such a significant change in interactivity may play a critical role in the development of cancer. By determining this significant interactivity, the molecular mechanisms of cancer can be understood in depth, providing an important basis for the screening of biological markers.

In this embodiment, the significant interactions of genes with each other in the disease state and normal state of cancer can be determined based on the posterior distribution of biomarkers associated with cancer disease, in combination with the interaction network between genes. Genes with particularly prominent interactions are then found that play a critical role in the development and progression of cancer based on their remarkable interactions. Thus, by determining the significant interactivity, the molecular mechanism of cancer can be deeply understood, and an important basis is provided for the screening of biological markers.

The present embodiment can perform a transcriptional amplification treatment on a biological sample (such as cancer cells, cancer tissues, etc.) related to a cancer disease. And then carrying out sequencing analysis on the sample subjected to transcription amplification, and determining the sequence information of RNA molecules in the sample. And screening by taking functional genomics and clinical genomics as references to obtain transcriptome sequencing data. Thus, through the standards of functional genomics and clinical genomics, the molecular mechanism and individual difference of cancer diseases can be better known, and important support is provided for the application of the screening of the biomarkers of the cancer diseases in medical treatment, so that the biomarkers can be screened more accurately and effectively. Further, the significant interactivity of genes with each other in such disease states and normal states of cancer can also be determined based on the posterior distribution of biomarkers associated with cancer disease, in combination with the inter-gene interaction network. Genes with particularly prominent interactions are then found that play a critical role in the development and progression of cancer based on their remarkable interactions. Thus, by determining the significant interactivity, the molecular mechanism of cancer can be deeply understood, and an important basis is provided for the screening of biological markers.

In the third embodiment of the present application, the same or similar contents as those of the first and second embodiments can be referred to the description above, and the description thereof will not be repeated. On this basis, please refer to fig. 3, fig. 3 is a schematic flow chart of a third embodiment of the genomics-based biomarker screening method according to the present application.

The step S20 of this example includes steps S21 to S22:

step S21, based on the biomarker expression data of the significant genes, a logistic regression model is used for describing the association of the biomarkers with cancer diseases.

Step S22, the biological markers and the related non-normalized combination of the cancer diseases are carried out through a Bayesian method, so that posterior distribution of the biological markers and the related cancer diseases is obtained.

The bayesian method is a statistical inference method based on bayesian theorem. Bayesian theorem describes how the estimate of event probability is updated from new observed data given some a priori information. I.e. the posterior distribution of the biomarker associated with the cancer disease is calculated from a priori information of the association of the biomarker with the cancer disease.

Specifically, for easy understanding, the specific processing procedure is as follows:

1. first, a model of the binary result is built.

Is provided withRepresenting a research index (i.e., biomarker expression data) that promotes aggregation. For studiesSingle of (3),Representing the result of the binary disease,Representing biomarker exposure measurements from a reference laboratory (reference measurements),Representing biomarker measurements (local measurements) from a study-specific local laboratory,Vectors representing other covariates. Assume that for all samples in the study, local measurements of biomarkersIs available, but only a portion of the samples have reference measurements. For studiesSingle of (3),Representing that the reference measurement is available, otherwise。

Assume a reference measurementIs consistent in different studies, while local measurementsThe distribution of (c) shows heterogeneity in the study. In addition, the results can be assumedLocal measurementIs conditionally independent, given a reference measurementThis means the probability:

;

For convenience, it is assumed that covariates have no effect on study-specific measurement bias, i.e . In addition, the likelihood function containing all samples can be expressed as follows:

;

wherein,

;

For the sake of convenience the user has to choose,Representing a set of all parameters, described in turn below,Is a studyIs a sample of the total number of samples. Further referring to the assumption of conditional independence 1, we can derive the following equation:

;

it can be observed that the likelihood function comprises three components, namely biomarker-disease association Reference-local measurement correlationAnd reference a priori。

Logistic regression models with random intercept terms were used to describe the association of biomarkers with disease as follows:

;

wherein, In order to investigate the specific intercept of a sample,Is the inverse of the logic function. Our primary purpose is to estimateIt is the logarithm of the Odds Ratio (OR), describing the relationship between the biomarker and the disease.

The model used to describe the reference-local measurement correlation is called the calibration model. Hypothetical reference measurementLocal measurementThere is a linear correlation between them,AndAll obey normal distribution, then:

;

And

;

,,,,,,AndCan be regarded asIs an element of (a). The likelihood function can now be fully expressed.

Since likelihood functions involve unknownsEmploying maximum likelihood estimation introduces challenging integration calculations. Furthermore, hierarchical studies-biological specimen structures were not included in the estimation of parameters. Thus, a Bayesian approach can be chosen to achievePosterior distribution of (c). In the Bayesian method, the following will beProcessing as a latent variable can be converted into an estimated quantity by constructing an appropriate a priori distribution. Also, the hierarchy may be reflected inIn the presetting of the equal parameters, the estimation efficiency is improved.

2. Then a priori distribution is performed.

Briefly, assume thatAndIs factorized in terms of the joint a priori distribution, independent ofThe method comprises the following steps:

;

Since the reference measurements in each study were assumed to be identical AndShould be equally distributed. Thus, the first and second substrates are bonded together,AndAnd not independent. Based on the assumption of a normal distribution, the a priori distribution of these three variables can be expressed as:

;

And May be set based on reference measurements in the actual scene or conjugate priors (normal and inverse gamma distributions) may be used as well. In our study, a non-informative prior was placed, specifically:

;

for the following AndIt is suggested to use information priors or weak information priors to preserve the rights and avoid serious instability. For example, using a standard normal distribution, the following is used:

;

wherein, AndIs the number of covariates.

For specific study parameters,,AndIt is assumed that any vector between them is sampled from a common distribution. This corresponds to a "random effect" model. However, in the Bayesian framework, it is not necessary to assume that the trial is sampled from a super population. Instead, it is replaced by a qualitative assumption of "interchangeability", which means that there is no reason to believe that the study is systematic. Here, the simplest assumption is continued that each investigated parameter is an independent sample of the superpopulation distribution controlled by some unknown superparameter. For the followingAndThe population distribution is normal distribution as follows:

And for the case of It is an inverse gamma distribution as follows:

;

wherein, 、、、、、、AndIs a corresponding superparameter, also considered part of an unknown parameter,Representation ofThe inverse gamma distribution is followed as follows:

;

And (3) with AndSimilarly, we propose to assign a suitable weak information prior to these super-parameters. For the followingAndA standard normal distribution may be employed as follows:

;

for the following 、、、A half kexi distribution plate may be used as follows:

。

3. and finally, estimating parameters.

According to a Bayes formula, non-normalized joint posterior distribution of unknown parameters and reference measured values can be obtained as follows:

;

From this distribution, samples can be taken using the Markov Chain Monte Carlo (MCMC) method, such as the nut (no u-sampler) 6. Based on the MCMC sampler, the sample mean can be calculated as a point estimate of the parameter and the 95% highest posterior density interval as an interval estimate. Resulting in a posterior distribution of biomarkers associated with cancer disease.

This example uses bayesian techniques to aggregate biomarker data from multiple study sources, considering reference measurements of non-re-analyzed biological specimens as non-observable potential variables. A bi-level study-biological specimen model was established to describe the relationship between reference measurements, local measurements and results. From this model, the posterior distribution of biomarker-disease associations was estimated, providing a powerful framework for integrating biomarker data for cancer disease.

Referring to fig. 4, the biomarker screening method based on genomics of the present application is implemented by a biomarker screening system based on genomics, and fig. 4 is a functional block diagram of a biomarker screening system based on genomics according to an embodiment of the present application. The system comprises:

a sequencing data module 10 for determining biomarker expression data associated with significant genes from transcriptome sequencing data of a cancer disease biological sample, the transcriptome sequencing data obtained based on functional genomic and clinical genomic constructs.

A posterior distribution module 20 for estimating a posterior distribution of biomarkers associated with cancer disease by a logistic regression model based on biomarker expression data of the significant genes.

An interaction module 30 for determining significant interacting genes in a biological sample of a cancer disease based on the posterior distribution and the inter-gene interaction network.

A marker screening module 40 for screening for a list of biomarkers associated with a cancer disease based on the significant interacting genes.

The present embodiment provides a biomarker screening system based on genomics, which can firstly transcribe and amplify a biological sample of a cancer disease, then acquire transcriptome sequencing data according to the standards of functional genomics and clinical genomics, and screen relevant characteristics of the cancer disease according to the data to obtain biomarker expression data of a significant gene closely related to the cancer disease. The biomarker expression data can then be used as input variables to combine information related to the cancer disease, and the probability description of the degree of association of the biomarker with the cancer disease is performed through a logistic regression model to obtain posterior distribution of the association of the biomarker with the cancer disease. Genes with particularly prominent interaction relationships and critical effects on the development and progression of cancer can then be found based on the posterior distribution of biomarkers associated with cancer disease, in combination with the inter-gene interaction network. Thereby providing important basis for accurately screening and obtaining the biomarker related to the cancer disease and helping to discover a new biomarker. Finally, the obtained significant interaction genes can be arranged according to the sequence of the interaction degree or the classification of different physiological functions, and a biomarker list related to cancer diseases is obtained by screening. Because the transcriptome sequencing data of the cancer disease biological sample is used for screening the posterior distribution and the significant interaction genes related to the biomarker and the cancer disease, the condition that detection is missed due to low biomarker specificity in the traditional biomarker screening process is avoided, so that a biomarker list with high correlation with the cancer disease can be obtained, and the sensitivity of the biomarker screening process is improved.

Based on the first embodiment of the inventive system, a second embodiment of the inventive system is presented. In the second embodiment of the present application, the same or similar contents as those of the first embodiment can be referred to the description above, and the description is omitted.

The sequencing data module 10 described in this example is also used to obtain multiple types of cancer disease biological samples.

The sequencing data module 10 is further used for performing transcription amplification on the cancer disease biological sample, and sequencing to obtain transcriptome sequencing data based on functional genomics and clinical genomics.

The sequencing data module 10 is further configured to perform cluster screening on the transcriptome sequencing data to obtain biomarker expression data related to a significant gene.

In another possible implementation manner, the sequencing data module 10 of this embodiment is further configured to perform preliminary screening on the transcriptome sequencing data according to a gene expression level to obtain gene difference data, the sequencing data module 10 is further configured to perform association test on the gene difference data to obtain gene difference data after association test, the sequencing data module 10 is further configured to perform feature screening on the gene difference data after association test by a hierarchical clustering algorithm to obtain a hierarchical association result of genes and cancer diseases, and the sequencing data module 10 is further configured to determine biomarker expression data related to significant genes according to the hierarchical association result.

The biomarker screening system based on genomics provided by the application adopts the biomarker screening method based on genomics in the embodiment, and can solve the technical problems of low biomarker specificity and low screening sensitivity in the traditional biomarker screening process, so that the condition of missed detection is easy to occur, and the missed diagnosis or delayed diagnosis of patients is caused. Compared with the prior art, the beneficial effects of the biomarker screening system based on genomics provided by the application are the same as those of the biomarker screening method based on genomics provided by the embodiment, and other technical features in the biomarker screening system based on genomics are the same as those disclosed by the method of the embodiment, so that details are not repeated.

It should be noted that the above examples are only for understanding the present application, and do not limit the method and system for screening biomarkers based on genomics of the present application, and more forms of simple transformation based on the technical concept are all within the scope of the present application.

The foregoing description is only a partial embodiment of the present application, and is not intended to limit the scope of the present application, and all the equivalent structural changes made by the description and the accompanying drawings under the technical concept of the present application, or the direct/indirect application in other related technical fields are included in the scope of the present application.

Claims

1. A method of genomics-based biomarker screening, the method comprising:

2. The method of claim 1, wherein the step of determining biomarker expression data associated with a significant gene from transcriptome sequencing data of a biological sample of a cancer disease comprises:

Obtaining multiple types of cancer disease biological samples;

3. The method of claim 2, wherein the step of cluster screening the transcriptome sequencing data for biomarker expression data associated with a significant gene comprises:

4. The method of claim 1, wherein the step of estimating a posterior distribution of biomarkers associated with cancer disease by a logistic regression model based on biomarker expression data of the significant genes comprises:

5. The method of any one of claims 1 to 4, wherein the step of determining significant interacting genes in a biological sample for a cancer disease based on the posterior distribution and an inter-gene interaction network comprises:

6. The method of any one of claims 1 to 4, wherein the step of screening for a list of biomarkers associated with cancer disease based on the significant interacting genes comprises:

Sorting the importance scores to obtain a score list;

7. The method of claim 6, wherein the step of performing a feature importance assessment on the significant interaction genes to obtain an importance score comprises:

Initializing a gradient lifting decision tree model;

8. A genomics-based biomarker screening system, the system comprising:

9. The system of claim 8, wherein the sequencing data module is further configured to obtain multiple types of cancer disease biological samples;

10. The system of claim 9, wherein the sequencing data module is further configured to perform a preliminary screening of the transcriptome sequencing data based on gene expression levels to obtain gene difference data;