[go: up one dir, main page]

CN119360969A - Biomarker screening method and system based on genomics - Google Patents

Biomarker screening method and system based on genomics Download PDF

Info

Publication number
CN119360969A
CN119360969A CN202411946317.4A CN202411946317A CN119360969A CN 119360969 A CN119360969 A CN 119360969A CN 202411946317 A CN202411946317 A CN 202411946317A CN 119360969 A CN119360969 A CN 119360969A
Authority
CN
China
Prior art keywords
genes
significant
biomarker
gene
cancer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411946317.4A
Other languages
Chinese (zh)
Inventor
侯艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202411946317.4A priority Critical patent/CN119360969A/en
Publication of CN119360969A publication Critical patent/CN119360969A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/20Screening of libraries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/27Regression, e.g. linear or logistic regression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Library & Information Science (AREA)
  • Public Health (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioethics (AREA)
  • Physiology (AREA)
  • Epidemiology (AREA)
  • Chemical & Material Sciences (AREA)
  • Biochemistry (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本申请公开了一种基于基因组学的生物标志物筛选方法及系统,涉及生物检测技术领域,该方法包括:根据癌症疾病生物样本的转录组测序数据,确定与显著基因相关的生物标志表达数据,并通过逻辑回归模型估计生物标志物与癌症疾病关联的后验分布;根据后验分布和基因间相互作用网络确定显著相互作用基因;基于显著相互作用基因,筛选得到与癌症疾病相关的生物标志物列表。由于本申请通过癌症疾病生物样本的转录组测序数据,筛选得到后验分布和显著相互作用基因,避免了传统的生物标志物筛选过程中由于生物标志物特异性不高导致漏检的情况,从而可得到与癌症疾病相关性高的生物标志物列表,提高了生物标志物筛选过程的灵敏度。

The present application discloses a biomarker screening method and system based on genomics, which relates to the field of biological detection technology. The method includes: determining biomarker expression data related to significant genes based on transcriptome sequencing data of cancer disease biological samples, and estimating the posterior distribution of the association between biomarkers and cancer diseases through a logistic regression model; determining significant interaction genes based on the posterior distribution and the gene interaction network; and screening to obtain a list of biomarkers related to cancer diseases based on the significant interaction genes. Since the present application screens the posterior distribution and significant interaction genes through transcriptome sequencing data of cancer disease biological samples, it avoids the situation of missed detection due to low specificity of biomarkers in the traditional biomarker screening process, thereby obtaining a list of biomarkers with high correlation with cancer diseases, and improving the sensitivity of the biomarker screening process.

Description

Biomarker screening method and system based on genomics
Technical Field
The application relates to the technical field of biological detection, in particular to a biomarker screening method and a biomarker screening system based on genomics.
Background
Cancer (malignant tumor) is a major disease threatening human life and health, and the incidence and mortality rate of the disease in the global world are rising year by year at present, and the situation is very serious. The biomarker for cancer has stronger disease specificity and is beneficial to early diagnosis and treatment effect evaluation of the disease. Researchers have generally used a variety of methods to find and discover biomarkers for cancer diseases, such as immunological, molecular biological, genomic, etc. techniques.
However, due to the interaction between the biomarkers, the difference exists in the disease samples, the difference exists in the biomarker levels among different individuals, the change of the expression level of part of the biomarkers is not obvious in early disease or slight disease condition, and other factors lead to lower sensitivity of the biomarker screening process, and meanwhile, the specificity of the biomarkers is reduced due to the cross reaction of part of the biomarkers and other diseases, so that the missed diagnosis condition is easy to occur, and the diagnosis of patients is missed or delayed.
Disclosure of Invention
The application mainly aims to provide a biomarker screening method and a biomarker screening system based on genomics, and aims to solve the technical problems that the biomarker specificity is low and the screening sensitivity is low in the traditional biomarker screening process, so that the condition of missed detection easily occurs, and the diagnosis of a patient is missed or delayed.
To achieve the above object, the present application provides a method for screening biomarkers based on genomics, the method comprising:
Determining biomarker expression data associated with the significant genes from transcriptome sequencing data of a cancer disease biological sample, the transcriptome sequencing data obtained based on functional genomic and clinical genomic construction;
estimating posterior distribution of biomarker and cancer disease association by logistic regression model based on biomarker expression data of the significant gene;
determining significant interacting genes in a biological sample of the cancer disease based on the posterior distribution and the inter-gene interaction network;
Based on the significant interacting genes, a list of biomarkers associated with cancer disease is screened.
In one embodiment, the step of determining biomarker expression data associated with a significant gene based on transcriptome sequencing data of a cancer disease biological sample comprises:
Obtaining multiple types of cancer disease biological samples;
Carrying out transcription amplification on the cancer disease biological sample, and sequencing to obtain transcriptome sequencing data by taking functional genomics and clinical genomics as references;
and clustering and screening the transcriptome sequencing data to obtain biomarker expression data related to the significant genes.
In one embodiment, the step of cluster screening the transcriptome sequencing data to obtain biomarker expression data associated with a significant gene comprises:
preliminary screening is carried out on the transcriptome sequencing data according to the gene expression level to obtain gene difference data;
Performing association test according to the gene difference data to obtain gene difference data after the association test;
performing feature screening on the gene difference data after the association test through a hierarchical clustering algorithm to obtain a hierarchical association result of genes and cancer diseases;
And determining biomarker expression data related to the significant genes according to the hierarchical association result.
In one embodiment, the step of estimating a posterior distribution of biomarkers associated with cancer disease by a logistic regression model based on biomarker expression data of the significant gene comprises:
using a logistic regression model to describe the association of biomarkers with cancer disease based on biomarker expression data for the significant genes;
the non-normalized association of the biomarker with the cancer disease is performed by bayesian methods to obtain a posterior distribution of the biomarker associated with the cancer disease.
In one embodiment, the step of determining significant interacting genes in a biological sample for a cancer disease based on the posterior distribution and the intergenic interaction network comprises:
Determining a significant interactivity of genes in a cancer disease state and a normal state based on the posterior distribution and an inter-gene interaction network;
based on the significant interactivity, significant interacting genes in a cancer disease biological sample are determined.
In one embodiment, the step of screening for a list of biomarkers associated with a cancer disease based on the significant interacting genes comprises:
performing feature importance assessment on the significant interaction genes to obtain importance scores;
Sorting the importance scores to obtain a score list;
screening the significant interaction genes based on the score list to obtain a biomarker list associated with cancer disease.
In one embodiment, the step of performing a feature importance assessment on the significant interaction gene to obtain an importance score comprises:
Initializing a gradient lifting decision tree model;
training the gradient lifting decision tree model through a cancer gene database;
And carrying out feature importance assessment on the significant interaction genes according to the trained gradient lifting decision tree model to obtain importance scores.
In addition, to achieve the above object, the present application also proposes a genomics-based biomarker screening system, the system comprising:
a sequencing data module for determining biomarker expression data associated with a significant gene from transcriptome sequencing data of a cancer disease biological sample, the transcriptome sequencing data obtained based on functional genomic and clinical genomic constructs;
A posterior distribution module for estimating posterior distribution of the biomarker associated with the cancer disease by a logistic regression model based on the biomarker expression data of the significant gene;
An interaction module for determining significant interacting genes in a biological sample of a cancer disease based on the posterior distribution and an inter-gene interaction network;
and the marker screening module is used for screening and obtaining a biomarker list related to the cancer diseases based on the significant interaction genes.
In one embodiment, the sequencing data module is further configured to obtain multiple types of cancer disease biological samples;
The sequencing data module is also used for carrying out transcription amplification on the cancer disease biological sample, and sequencing to obtain transcriptome sequencing data by taking functional genomics and clinical genomics as references;
The sequencing data module is also used for carrying out cluster screening on the transcriptome sequencing data to obtain biomarker expression data related to the obvious genes.
In one embodiment, the sequencing data module is further configured to perform preliminary screening on the transcriptome sequencing data according to a gene expression level to obtain gene difference data;
the sequencing data module is also used for carrying out association test according to the gene difference data to obtain the gene difference data after the association test;
The sequencing data module is also used for carrying out feature screening on the gene difference data after the association test through a hierarchical clustering algorithm to obtain a hierarchical association result of genes and cancer diseases;
The sequencing data module is also used for determining biomarker expression data related to the significant genes according to the hierarchical association result.
The application provides one or more technical schemes, which have at least the following technical effects that firstly, biomarker expression data related to significant genes are determined according to transcriptome sequencing data of biological samples of cancer diseases, the transcriptome sequencing data are obtained based on functional genomics and clinical genomics construction, then, posterior distribution of the biomarker related to the cancer diseases is estimated through a logistic regression model based on the biomarker expression data of the significant genes, then, significant interaction genes in the biological samples of the cancer diseases are determined according to the posterior distribution and an interaction network between genes, and finally, a biomarker list related to the cancer diseases is obtained through screening based on the significant interaction genes. Because the application screens and obtains posterior distribution and obvious interaction genes related to the biological markers and the cancer diseases through transcriptome sequencing data of the biological samples of the cancer diseases, the condition that detection is missed due to low specificity of the biological markers in the traditional biological marker screening process is avoided, thereby obtaining a biological marker list with high correlation with the cancer diseases and improving the sensitivity of the biological marker screening process.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a schematic flow chart of a genomic-based biomarker screening method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of a second embodiment of a genomics-based biomarker screening method according to the present application;
FIG. 3 is a schematic flow chart of a third exemplary genomics-based biomarker screening method according to the present application;
Fig. 4 is a schematic diagram of functional blocks of a genomics-based biomarker screening system according to an embodiment of the present application.
The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the technical solution of the present application and are not intended to limit the present application.
For a better understanding of the technical solution of the present application, the following detailed description will be given with reference to the drawings and the specific embodiments.
The execution subject of the present embodiment may be a computing service device, such as a personal computer, a server, or the like, having functions of transcription sequencing, posterior distribution calculation, and marker screening, or an electronic device, a genome-based biomarker screening system (screening system), or the like, that is capable of implementing the above functions, and performs the genome-based biomarker screening method of the present application, which is not limited in this embodiment. This embodiment and the following embodiments will be described below with reference to a screening system.
Based on the above, the embodiment of the application provides a biomarker screening method based on genomics, and referring to fig. 1, fig. 1 is a schematic flow chart provided by the embodiment of the biomarker screening method based on genomics.
In this embodiment, the genomics-based biomarker screening method includes steps S10 to S40:
And step S10, determining biomarker expression data related to the significant genes according to transcriptome sequencing data of the cancer disease biological sample, wherein the transcriptome sequencing data is obtained based on functional genomics and clinical genomics construction.
The biological sample for cancer diseases is various biological materials related to cancer.
By way of example, a biological sample of a cancer disease may include tissue specimens (e.g., tumor tissue, paracancerous tissue), blood (e.g., whole blood, plasma, serum), body fluids (e.g., urine, cerebrospinal fluid, hydrothorax, etc.), cells (e.g., circulating tumor cells, etc.), and the like, obtained from a cancer patient.
Transcriptome sequencing data is data obtained by sequencing RNA transcribed from a biological sample of a cancer disease under a specific condition.
For example, messenger RNA (mRNA), ribosomal RNA (rRNA), transfer RNA (tRNA), and other non-coding RNAs may be included in the transcription process. The expression of genes in biological samples of cancer diseases can be comprehensively and quantitatively analyzed by transcription sequencing.
The biomarker expression data is gene data capable of significantly expressing a biomarker, which has a close relationship with cancer diseases, in the transcriptome sequencing data.
Functional genomics is the study of gene functions, including gene expression control, gene product functions, gene-to-gene interactions, and the like.
Clinical genomics is the application of genomic techniques and research results to clinical practice involving the use of genomic information to diagnose disease, predict disease prognosis, etc.
Transcriptome sequencing data is constructed through functional genomics and clinical genomics standards, so that the molecular mechanism and individual difference of cancer diseases can be better known, important support is provided for the application of the screening of the biomarkers of the cancer diseases in medical treatment, and more accurate and effective screening of the biomarkers can be realized.
In one embodiment, the biological sample of the cancer disease can be subjected to transcription amplification, then transcriptome sequencing data can be obtained according to the standards of functional genomics and clinical genomics, and the screening system can perform relevant characteristic screening of the cancer disease according to the data, so as to obtain biomarker expression data of significant genes closely related to the cancer disease.
Step S20, estimating posterior distribution of the biomarker and the cancer disease associated by a logistic regression model based on the biomarker expression data of the significant genes.
It should be noted that the logistic regression model is a statistical analysis model for solving the two-classification (or multi-classification) problem. The association of the biomarker with the cancer disease may be described by a logistic regression model to determine the association between the biomarker and the cancer disease.
Biomarkers are biochemical molecules that can label changes or changes in the structure of systems, organs, tissues, cells, subcells, and genes. The diagnosis of cancer diseases can be facilitated more accurately through the biomarker.
It is noted that a posterior distribution is a probability distribution of the association between a biomarker and a cancer disease. The role of biomarkers in cancer genesis and its uncertainty can be deeply understood by posterior distribution.
In one embodiment, the biomarker expression data can be used as an input variable to combine information related to the cancer disease, and the probability description of the association degree of the biomarker and the cancer disease is carried out through a logistic regression model to obtain posterior distribution of the association of the biomarker and the cancer disease.
And step S30, determining significant interaction genes in the cancer disease biological sample according to the posterior distribution and the interaction network among genes.
The inter-gene interaction network is a complex network structure formed by various interaction relationships between a plurality of genes.
In organisms, genes are not isolated and functional. The expression and function of one gene may be affected by other genes, and there may be several interaction modes such as synergy and antagonism. These interactions are interleaved, forming a vast network.
It should be noted that a significant interacting gene is a gene that has a particularly prominent relationship with interactions in biological samples of cancer diseases and may play a key role in the development and progression of cancer.
For example, in a cancer sample, the interaction of certain genes may be significantly altered, such as an increase or decrease in synergy or antagonism between them in the cancer state, and the genes may be considered to have significant interactions in the cancer. Through the network of interactions between genes, the interactions between genes can be better understood, revealing the association and synergy pattern between genes that play a key role in the development and progression of cancer.
In one embodiment, genes that are particularly prominent in their interactions and critical for the development and progression of cancer can be found based on the posterior distribution of biomarkers associated with cancer disease, in combination with the inter-gene interaction network. Thereby providing important basis for accurately screening and obtaining the biomarker related to the cancer disease and helping to discover a new biomarker.
And step S40, screening and obtaining a biomarker list related to the cancer disease based on the significant interaction genes.
It should be noted that the biomarker list is a list listing biomarkers related to cancer diseases. Through the biomarker list, the association between the biomarker and the cancer disease can be better understood, and the diagnosis of the disease is facilitated.
In one embodiment, the significant interacting genes obtained above may be arranged in order of degree of interaction or classification of different physiological functions, and screened to obtain a biomarker list associated with cancer disease.
In this embodiment, the biological sample of the cancer disease may be subjected to transcription amplification first, then transcriptome sequencing data may be obtained according to the criteria of functional genomics and clinical genomics, and the screening system may perform screening for relevant characteristics of the cancer disease based on these data, so as to obtain biomarker expression data of significant genes closely associated with the cancer disease. The biomarker expression data can then be used as input variables to combine information related to the cancer disease, and the probability description of the degree of association of the biomarker with the cancer disease is performed through a logistic regression model to obtain posterior distribution of the association of the biomarker with the cancer disease. Genes with particularly prominent interaction relationships and critical effects on the development and progression of cancer can then be found based on the posterior distribution of biomarkers associated with cancer disease, in combination with the inter-gene interaction network. Thereby providing important basis for accurately screening and obtaining the biomarker related to the cancer disease and helping to discover a new biomarker. Finally, the obtained significant interaction genes can be arranged according to the sequence of the interaction degree or the classification of different physiological functions, and a biomarker list related to cancer diseases is obtained by screening.
In a possible implementation manner, the step S40 of the embodiment may include the steps of evaluating the feature importance of the significant interaction genes, obtaining importance scores, sorting the importance scores to obtain a score list, and screening the significant interaction genes based on the score list to obtain a biomarker list related to cancer diseases.
The importance score is a score obtained by evaluating the respective importance levels of the significant interacting genes based on the degree of association with cancer diseases, and expressing the importance levels in the form of a numerical value.
In this embodiment, the importance scores may be obtained by evaluating the feature importance of the significant interaction genes based on the degree of association of cancer diseases, and then sorting the importance scores to obtain a score list. Finally, the biological molecules corresponding to the significant interaction genes in the front part (such as the first ten or tens of the biological molecules) in the score list are used as biological markers, so that a biological marker list related to the cancer diseases is obtained. Thus, the genes with higher scores are used as biomarkers, which are helpful for focusing on key genes.
In a possible implementation manner, the step of evaluating the feature importance of the significant interaction genes and obtaining the importance score in the embodiment includes initializing a gradient lifting decision tree model, training the gradient lifting decision tree model through a cancer gene database, and evaluating the feature importance of the significant interaction genes according to the trained gradient lifting decision tree model to obtain the importance score.
It should be noted that the gradient lifting decision tree model is a model for performing feature importance prediction and evaluation on significant interaction genes by combining a plurality of decision trees through an ensemble learning method. In the construction process, different decision trees can be generated through random feature selection, sample sampling and other modes, and the decision trees jointly form a forest, so that the significant interaction genes can be evaluated and classified.
The cancer gene database is a database for collecting, sorting, and storing gene information related to cancer. Such as the sequence of the gene, mutation status, expression level, association with the occurrence and development of cancer, related clinical information, etc.
In this embodiment, the gradient-lifting decision tree model may be initialized first. And then training the gradient lifting decision tree model through a cancer gene database to obtain a model for learning various cancer related gene information. And finally, carrying out predictive evaluation on the feature importance of the significant interaction genes according to the trained gradient lifting decision tree model to obtain importance scores. Therefore, the method can provide important basis for searching the biomarker related to the cancer disease through the cancer gene database so as to improve the accuracy of the biomarker list.
The embodiment provides a biomarker screening method based on genomics, which can be used for carrying out transcription amplification on a biological sample of cancer diseases, acquiring transcriptome sequencing data according to the standards of functional genomics and clinical genomics, and screening relevant characteristics of the cancer diseases according to the data by a screening system to obtain biomarker expression data of significant genes closely related to the cancer diseases. The biomarker expression data can then be used as input variables to combine information related to the cancer disease, and the probability description of the degree of association of the biomarker with the cancer disease is performed through a logistic regression model to obtain posterior distribution of the association of the biomarker with the cancer disease. Genes with particularly prominent interaction relationships and critical effects on the development and progression of cancer can then be found based on the posterior distribution of biomarkers associated with cancer disease, in combination with the inter-gene interaction network. Thereby providing important basis for accurately screening and obtaining the biomarker related to the cancer disease and helping to discover a new biomarker. Finally, the obtained significant interaction genes can be arranged according to the sequence of the interaction degree or the classification of different physiological functions, and a biomarker list related to cancer diseases is obtained by screening. Because the transcriptome sequencing data of the cancer disease biological sample is used for screening the posterior distribution and the significant interaction genes related to the biomarker and the cancer disease, the condition that detection is missed due to low biomarker specificity in the traditional biomarker screening process is avoided, so that a biomarker list with high correlation with the cancer disease can be obtained, and the sensitivity of the biomarker screening process is improved.
Second embodiment based on the first embodiment of the present application, in the second embodiment of the present application, the same or similar content as the first embodiment can be referred to the above description, and the description is omitted. On this basis, please refer to fig. 1 and fig. 2, fig. 2 is a schematic flow chart provided in a second embodiment of the genomics-based biomarker screening method according to the present application.
The step S10 of this example includes steps S11 to S13:
And S11, acquiring biological samples of various types of cancer diseases.
And step S12, carrying out transcription amplification on the cancer disease biological sample, and sequencing to obtain transcriptome sequencing data by taking functional genomics and clinical genomics as references.
The method is characterized in that the number of RNA transcripts in a sample is increased by a specific technical means, then, the sample after the transcription amplification is subjected to sequencing analysis, the sequencing process is to determine the sequence information of the RNA molecules, and then, the sequence information is screened by taking functional genomics and clinical genomics as references, so that transcriptome sequencing data are obtained.
And step S13, carrying out cluster screening on the transcriptome sequencing data to obtain biomarker expression data related to the obvious genes.
In this embodiment, a biological sample (for example, cancer cells, cancer tissues, etc.) involved in a cancer disease can be subjected to a transcriptional amplification treatment. And then carrying out sequencing analysis on the sample subjected to transcription amplification, and determining the sequence information of RNA molecules in the sample. And screening by taking functional genomics and clinical genomics as references to obtain transcriptome sequencing data. Thus, through the standards of functional genomics and clinical genomics, the molecular mechanism and individual difference of cancer diseases can be better known, and important support is provided for the application of the screening of the biomarkers of the cancer diseases in medical treatment, so that the biomarkers can be screened more accurately and effectively.
In a possible implementation manner, the step S13 of the embodiment can include the steps of performing preliminary screening on the transcriptome sequencing data according to the gene expression level to obtain gene difference data, performing association test on the gene difference data to obtain gene difference data after association test, performing feature screening on the gene difference data after association test through a hierarchical clustering algorithm to obtain a hierarchical association result of genes and cancer diseases, and determining biomarker expression data related to significant genes according to the hierarchical association result.
The gene difference data is data obtained by performing preliminary differential screening based on gene expression levels (e.g., comparing diseased and healthy gene expression levels).
Specifically, genes whose expression levels are significantly changed (increased or decreased) are selected by comparing the differences in the expression levels of the genes under different conditions (such as normal tissue and cancer tissue), and these genes form gene difference data.
The association test is a test performed on a standard that has a significant association with a cancer disease after the preliminary differential screening.
It should be noted that, the hierarchical clustering algorithm is a clustering analysis method, and the basic idea is to gradually merge or split samples or data points to form a hierarchical clustering result. The data which are strongly related to the cancer disease state in the gene difference data after the association test can be subjected to further hierarchical aggregation through a hierarchical clustering algorithm, a hierarchical clustering tree structure is constructed, the hierarchical relationship between the sample and the gene in the gene difference data after the association test is intuitively displayed, and a hierarchical association result is obtained.
In this embodiment, the transcriptome sequencing data may be initially screened for differences based on gene expression levels (e.g., comparing diseased to healthy gene expression levels) to obtain gene difference data. And then carrying out association test on the cancer disease with the standard which has obvious association with the cancer disease after the primary difference screening, and obtaining the gene difference data after the association test. And finally, carrying out feature screening on the gene difference data after the association test by a hierarchical clustering algorithm, carrying out further hierarchical aggregation on the data which are strongly associated with the cancer disease state in the gene difference data after the association test, constructing a hierarchical clustering tree structure, visually displaying the hierarchical relationship between the sample and the gene in the gene difference data after the association test, and obtaining a hierarchical association result so as to determine biomarker expression data related to the obvious gene. Therefore, the distribution and structural characteristics of the hierarchical relationship between the cancer sample and the gene are better understood through a hierarchical clustering algorithm, so that the accuracy of acquiring the biomarker expression data is improved.
In another possible embodiment, step S30 of the present example may include the steps of determining a significant interactivity of genes in a cancer disease state and a normal state based on the posterior distribution and an inter-gene interaction network, and determining a significant interacted gene in a cancer disease biological sample based on the significant interactivity.
It should be noted that a significant interactivity is a significant correlation of genes existing between each other in a disease state such as cancer and a normal state.
Specifically, it is determined how the relationship between genes is different in cancer from that in normal cases. For example, some genes may cooperate or be restricted in a more gentle manner under normal conditions, but in a cancerous condition their interactions may become abnormally active or inhibited, and such a significant change in interactivity may play a critical role in the development of cancer. By determining this significant interactivity, the molecular mechanisms of cancer can be understood in depth, providing an important basis for the screening of biological markers.
In this embodiment, the significant interactions of genes with each other in the disease state and normal state of cancer can be determined based on the posterior distribution of biomarkers associated with cancer disease, in combination with the interaction network between genes. Genes with particularly prominent interactions are then found that play a critical role in the development and progression of cancer based on their remarkable interactions. Thus, by determining the significant interactivity, the molecular mechanism of cancer can be deeply understood, and an important basis is provided for the screening of biological markers.
The present embodiment can perform a transcriptional amplification treatment on a biological sample (such as cancer cells, cancer tissues, etc.) related to a cancer disease. And then carrying out sequencing analysis on the sample subjected to transcription amplification, and determining the sequence information of RNA molecules in the sample. And screening by taking functional genomics and clinical genomics as references to obtain transcriptome sequencing data. Thus, through the standards of functional genomics and clinical genomics, the molecular mechanism and individual difference of cancer diseases can be better known, and important support is provided for the application of the screening of the biomarkers of the cancer diseases in medical treatment, so that the biomarkers can be screened more accurately and effectively. Further, the significant interactivity of genes with each other in such disease states and normal states of cancer can also be determined based on the posterior distribution of biomarkers associated with cancer disease, in combination with the inter-gene interaction network. Genes with particularly prominent interactions are then found that play a critical role in the development and progression of cancer based on their remarkable interactions. Thus, by determining the significant interactivity, the molecular mechanism of cancer can be deeply understood, and an important basis is provided for the screening of biological markers.
In the third embodiment of the present application, the same or similar contents as those of the first and second embodiments can be referred to the description above, and the description thereof will not be repeated. On this basis, please refer to fig. 3, fig. 3 is a schematic flow chart of a third embodiment of the genomics-based biomarker screening method according to the present application.
The step S20 of this example includes steps S21 to S22:
step S21, based on the biomarker expression data of the significant genes, a logistic regression model is used for describing the association of the biomarkers with cancer diseases.
Step S22, the biological markers and the related non-normalized combination of the cancer diseases are carried out through a Bayesian method, so that posterior distribution of the biological markers and the related cancer diseases is obtained.
The bayesian method is a statistical inference method based on bayesian theorem. Bayesian theorem describes how the estimate of event probability is updated from new observed data given some a priori information. I.e. the posterior distribution of the biomarker associated with the cancer disease is calculated from a priori information of the association of the biomarker with the cancer disease.
Specifically, for easy understanding, the specific processing procedure is as follows:
1. first, a model of the binary result is built.
Is provided withRepresenting a research index (i.e., biomarker expression data) that promotes aggregation. For studiesSingle of (3),Representing the result of the binary disease,Representing biomarker exposure measurements from a reference laboratory (reference measurements),Representing biomarker measurements (local measurements) from a study-specific local laboratory,Vectors representing other covariates. Assume that for all samples in the study, local measurements of biomarkersIs available, but only a portion of the samples have reference measurements. For studiesSingle of (3),Representing that the reference measurement is available, otherwise
Assume a reference measurementIs consistent in different studies, while local measurementsThe distribution of (c) shows heterogeneity in the study. In addition, the results can be assumedLocal measurementIs conditionally independent, given a reference measurementThis means the probability:
;
For convenience, it is assumed that covariates have no effect on study-specific measurement bias, i.e . In addition, the likelihood function containing all samples can be expressed as follows:
;
wherein,
;
For the sake of convenience the user has to choose,Representing a set of all parameters, described in turn below,Is a studyIs a sample of the total number of samples. Further referring to the assumption of conditional independence 1, we can derive the following equation:
;
it can be observed that the likelihood function comprises three components, namely biomarker-disease association Reference-local measurement correlationAnd reference a priori
Logistic regression models with random intercept terms were used to describe the association of biomarkers with disease as follows:
;
wherein, In order to investigate the specific intercept of a sample,Is the inverse of the logic function. Our primary purpose is to estimateIt is the logarithm of the Odds Ratio (OR), describing the relationship between the biomarker and the disease.
The model used to describe the reference-local measurement correlation is called the calibration model. Hypothetical reference measurementLocal measurementThere is a linear correlation between them,AndAll obey normal distribution, then:
;
And
;
,,,,,,AndCan be regarded asIs an element of (a). The likelihood function can now be fully expressed.
Since likelihood functions involve unknownsEmploying maximum likelihood estimation introduces challenging integration calculations. Furthermore, hierarchical studies-biological specimen structures were not included in the estimation of parameters. Thus, a Bayesian approach can be chosen to achievePosterior distribution of (c). In the Bayesian method, the following will beProcessing as a latent variable can be converted into an estimated quantity by constructing an appropriate a priori distribution. Also, the hierarchy may be reflected inIn the presetting of the equal parameters, the estimation efficiency is improved.
2. Then a priori distribution is performed.
Briefly, assume thatAndIs factorized in terms of the joint a priori distribution, independent ofThe method comprises the following steps:
;
Since the reference measurements in each study were assumed to be identical AndShould be equally distributed. Thus, the first and second substrates are bonded together,AndAnd not independent. Based on the assumption of a normal distribution, the a priori distribution of these three variables can be expressed as:
;
And May be set based on reference measurements in the actual scene or conjugate priors (normal and inverse gamma distributions) may be used as well. In our study, a non-informative prior was placed, specifically:
;
for the following AndIt is suggested to use information priors or weak information priors to preserve the rights and avoid serious instability. For example, using a standard normal distribution, the following is used:
;
wherein, AndIs the number of covariates.
For specific study parameters,,AndIt is assumed that any vector between them is sampled from a common distribution. This corresponds to a "random effect" model. However, in the Bayesian framework, it is not necessary to assume that the trial is sampled from a super population. Instead, it is replaced by a qualitative assumption of "interchangeability", which means that there is no reason to believe that the study is systematic. Here, the simplest assumption is continued that each investigated parameter is an independent sample of the superpopulation distribution controlled by some unknown superparameter. For the followingAndThe population distribution is normal distribution as follows:
And for the case of It is an inverse gamma distribution as follows:
;
wherein, AndIs a corresponding superparameter, also considered part of an unknown parameter,Representation ofThe inverse gamma distribution is followed as follows:
;
And (3) with AndSimilarly, we propose to assign a suitable weak information prior to these super-parameters. For the followingAndA standard normal distribution may be employed as follows:
;
for the following A half kexi distribution plate may be used as follows:
3. and finally, estimating parameters.
According to a Bayes formula, non-normalized joint posterior distribution of unknown parameters and reference measured values can be obtained as follows:
;
From this distribution, samples can be taken using the Markov Chain Monte Carlo (MCMC) method, such as the nut (no u-sampler) 6. Based on the MCMC sampler, the sample mean can be calculated as a point estimate of the parameter and the 95% highest posterior density interval as an interval estimate. Resulting in a posterior distribution of biomarkers associated with cancer disease.
This example uses bayesian techniques to aggregate biomarker data from multiple study sources, considering reference measurements of non-re-analyzed biological specimens as non-observable potential variables. A bi-level study-biological specimen model was established to describe the relationship between reference measurements, local measurements and results. From this model, the posterior distribution of biomarker-disease associations was estimated, providing a powerful framework for integrating biomarker data for cancer disease.
Referring to fig. 4, the biomarker screening method based on genomics of the present application is implemented by a biomarker screening system based on genomics, and fig. 4 is a functional block diagram of a biomarker screening system based on genomics according to an embodiment of the present application. The system comprises:
a sequencing data module 10 for determining biomarker expression data associated with significant genes from transcriptome sequencing data of a cancer disease biological sample, the transcriptome sequencing data obtained based on functional genomic and clinical genomic constructs.
The biological sample for cancer diseases is various biological materials related to cancer.
By way of example, a biological sample of a cancer disease may include tissue specimens (e.g., tumor tissue, paracancerous tissue), blood (e.g., whole blood, plasma, serum), body fluids (e.g., urine, cerebrospinal fluid, hydrothorax, etc.), cells (e.g., circulating tumor cells, etc.), and the like, obtained from a cancer patient.
Transcriptome sequencing data is data obtained by sequencing RNA transcribed from a biological sample of a cancer disease under a specific condition.
For example, messenger RNA (mRNA), ribosomal RNA (rRNA), transfer RNA (tRNA), and other non-coding RNAs may be included in the transcription process. The expression of genes in biological samples of cancer diseases can be comprehensively and quantitatively analyzed by transcription sequencing.
The biomarker expression data is gene data capable of significantly expressing a biomarker, which has a close relationship with cancer diseases, in the transcriptome sequencing data.
Functional genomics is the study of gene functions, including gene expression control, gene product functions, gene-to-gene interactions, and the like.
Clinical genomics is the application of genomic techniques and research results to clinical practice involving the use of genomic information to diagnose disease, predict disease prognosis, etc.
Transcriptome sequencing data is constructed through functional genomics and clinical genomics standards, so that the molecular mechanism and individual difference of cancer diseases can be better known, important support is provided for the application of the screening of the biomarkers of the cancer diseases in medical treatment, and more accurate and effective screening of the biomarkers can be realized.
In one embodiment, the biological sample of the cancer disease can be subjected to transcription amplification, then transcriptome sequencing data can be obtained according to the standards of functional genomics and clinical genomics, and the screening system can perform relevant characteristic screening of the cancer disease according to the data, so as to obtain biomarker expression data of significant genes closely related to the cancer disease.
A posterior distribution module 20 for estimating a posterior distribution of biomarkers associated with cancer disease by a logistic regression model based on biomarker expression data of the significant genes.
It should be noted that the logistic regression model is a statistical analysis model for solving the two-classification (or multi-classification) problem. The association of the biomarker with the cancer disease may be described by a logistic regression model to determine the association between the biomarker and the cancer disease.
Biomarkers are biochemical molecules that can label changes or changes in the structure of systems, organs, tissues, cells, subcells, and genes. The diagnosis of cancer diseases can be facilitated more accurately through the biomarker.
It is noted that a posterior distribution is a probability distribution of the association between a biomarker and a cancer disease. The role of biomarkers in cancer genesis and its uncertainty can be deeply understood by posterior distribution.
In one embodiment, the biomarker expression data can be used as an input variable to combine information related to the cancer disease, and the probability description of the association degree of the biomarker and the cancer disease is carried out through a logistic regression model to obtain posterior distribution of the association of the biomarker and the cancer disease.
An interaction module 30 for determining significant interacting genes in a biological sample of a cancer disease based on the posterior distribution and the inter-gene interaction network.
The inter-gene interaction network is a complex network structure formed by various interaction relationships between a plurality of genes.
In organisms, genes are not isolated and functional. The expression and function of one gene may be affected by other genes, and there may be several interaction modes such as synergy and antagonism. These interactions are interleaved, forming a vast network.
It should be noted that a significant interacting gene is a gene that has a particularly prominent relationship with interactions in biological samples of cancer diseases and may play a key role in the development and progression of cancer.
For example, in a cancer sample, the interaction of certain genes may be significantly altered, such as an increase or decrease in synergy or antagonism between them in the cancer state, and the genes may be considered to have significant interactions in the cancer. Through the network of interactions between genes, the interactions between genes can be better understood, revealing the association and synergy pattern between genes that play a key role in the development and progression of cancer.
In one embodiment, genes that are particularly prominent in their interactions and critical for the development and progression of cancer can be found based on the posterior distribution of biomarkers associated with cancer disease, in combination with the inter-gene interaction network. Thereby providing important basis for accurately screening and obtaining the biomarker related to the cancer disease and helping to discover a new biomarker.
A marker screening module 40 for screening for a list of biomarkers associated with a cancer disease based on the significant interacting genes.
It should be noted that the biomarker list is a list listing biomarkers related to cancer diseases. Through the biomarker list, the association between the biomarker and the cancer disease can be better understood, and the diagnosis of the disease is facilitated.
In one embodiment, the significant interacting genes obtained above may be arranged in order of degree of interaction or classification of different physiological functions, and screened to obtain a biomarker list associated with cancer disease.
The present embodiment provides a biomarker screening system based on genomics, which can firstly transcribe and amplify a biological sample of a cancer disease, then acquire transcriptome sequencing data according to the standards of functional genomics and clinical genomics, and screen relevant characteristics of the cancer disease according to the data to obtain biomarker expression data of a significant gene closely related to the cancer disease. The biomarker expression data can then be used as input variables to combine information related to the cancer disease, and the probability description of the degree of association of the biomarker with the cancer disease is performed through a logistic regression model to obtain posterior distribution of the association of the biomarker with the cancer disease. Genes with particularly prominent interaction relationships and critical effects on the development and progression of cancer can then be found based on the posterior distribution of biomarkers associated with cancer disease, in combination with the inter-gene interaction network. Thereby providing important basis for accurately screening and obtaining the biomarker related to the cancer disease and helping to discover a new biomarker. Finally, the obtained significant interaction genes can be arranged according to the sequence of the interaction degree or the classification of different physiological functions, and a biomarker list related to cancer diseases is obtained by screening. Because the transcriptome sequencing data of the cancer disease biological sample is used for screening the posterior distribution and the significant interaction genes related to the biomarker and the cancer disease, the condition that detection is missed due to low biomarker specificity in the traditional biomarker screening process is avoided, so that a biomarker list with high correlation with the cancer disease can be obtained, and the sensitivity of the biomarker screening process is improved.
Based on the first embodiment of the inventive system, a second embodiment of the inventive system is presented. In the second embodiment of the present application, the same or similar contents as those of the first embodiment can be referred to the description above, and the description is omitted.
The sequencing data module 10 described in this example is also used to obtain multiple types of cancer disease biological samples.
The sequencing data module 10 is further used for performing transcription amplification on the cancer disease biological sample, and sequencing to obtain transcriptome sequencing data based on functional genomics and clinical genomics.
The method is characterized in that the number of RNA transcripts in a sample is increased by a specific technical means, then, the sample after the transcription amplification is subjected to sequencing analysis, the sequencing process is to determine the sequence information of the RNA molecules, and then, the sequence information is screened by taking functional genomics and clinical genomics as references, so that transcriptome sequencing data are obtained.
The sequencing data module 10 is further configured to perform cluster screening on the transcriptome sequencing data to obtain biomarker expression data related to a significant gene.
In this embodiment, a biological sample (for example, cancer cells, cancer tissues, etc.) involved in a cancer disease can be subjected to a transcriptional amplification treatment. And then carrying out sequencing analysis on the sample subjected to transcription amplification, and determining the sequence information of RNA molecules in the sample. And screening by taking functional genomics and clinical genomics as references to obtain transcriptome sequencing data. Thus, through the standards of functional genomics and clinical genomics, the molecular mechanism and individual difference of cancer diseases can be better known, and important support is provided for the application of the screening of the biomarkers of the cancer diseases in medical treatment, so that the biomarkers can be screened more accurately and effectively.
In another possible implementation manner, the sequencing data module 10 of this embodiment is further configured to perform preliminary screening on the transcriptome sequencing data according to a gene expression level to obtain gene difference data, the sequencing data module 10 is further configured to perform association test on the gene difference data to obtain gene difference data after association test, the sequencing data module 10 is further configured to perform feature screening on the gene difference data after association test by a hierarchical clustering algorithm to obtain a hierarchical association result of genes and cancer diseases, and the sequencing data module 10 is further configured to determine biomarker expression data related to significant genes according to the hierarchical association result.
The gene difference data is data obtained by performing preliminary differential screening based on gene expression levels (e.g., comparing diseased and healthy gene expression levels).
Specifically, genes whose expression levels are significantly changed (increased or decreased) are selected by comparing the differences in the expression levels of the genes under different conditions (such as normal tissue and cancer tissue), and these genes form gene difference data.
The association test is a test performed on a standard that has a significant association with a cancer disease after the preliminary differential screening.
It should be noted that, the hierarchical clustering algorithm is a clustering analysis method, and the basic idea is to gradually merge or split samples or data points to form a hierarchical clustering result. The data which are strongly related to the cancer disease state in the gene difference data after the association test can be subjected to further hierarchical aggregation through a hierarchical clustering algorithm, a hierarchical clustering tree structure is constructed, the hierarchical relationship between the sample and the gene in the gene difference data after the association test is intuitively displayed, and a hierarchical association result is obtained.
In this embodiment, the transcriptome sequencing data may be initially screened for differences based on gene expression levels (e.g., comparing diseased to healthy gene expression levels) to obtain gene difference data. And then carrying out association test on the cancer disease with the standard which has obvious association with the cancer disease after the primary difference screening, and obtaining the gene difference data after the association test. And finally, carrying out feature screening on the gene difference data after the association test by a hierarchical clustering algorithm, carrying out further hierarchical aggregation on the data which are strongly associated with the cancer disease state in the gene difference data after the association test, constructing a hierarchical clustering tree structure, visually displaying the hierarchical relationship between the sample and the gene in the gene difference data after the association test, and obtaining a hierarchical association result so as to determine biomarker expression data related to the obvious gene. Therefore, the distribution and structural characteristics of the hierarchical relationship between the cancer sample and the gene are better understood through a hierarchical clustering algorithm, so that the accuracy of acquiring the biomarker expression data is improved.
The biomarker screening system based on genomics provided by the application adopts the biomarker screening method based on genomics in the embodiment, and can solve the technical problems of low biomarker specificity and low screening sensitivity in the traditional biomarker screening process, so that the condition of missed detection is easy to occur, and the missed diagnosis or delayed diagnosis of patients is caused. Compared with the prior art, the beneficial effects of the biomarker screening system based on genomics provided by the application are the same as those of the biomarker screening method based on genomics provided by the embodiment, and other technical features in the biomarker screening system based on genomics are the same as those disclosed by the method of the embodiment, so that details are not repeated.
It should be noted that the above examples are only for understanding the present application, and do not limit the method and system for screening biomarkers based on genomics of the present application, and more forms of simple transformation based on the technical concept are all within the scope of the present application.
The foregoing description is only a partial embodiment of the present application, and is not intended to limit the scope of the present application, and all the equivalent structural changes made by the description and the accompanying drawings under the technical concept of the present application, or the direct/indirect application in other related technical fields are included in the scope of the present application.

Claims (10)

1. A method of genomics-based biomarker screening, the method comprising:
Determining biomarker expression data associated with the significant genes from transcriptome sequencing data of a cancer disease biological sample, the transcriptome sequencing data obtained based on functional genomic and clinical genomic construction;
estimating posterior distribution of biomarker and cancer disease association by logistic regression model based on biomarker expression data of the significant gene;
determining significant interacting genes in a biological sample of the cancer disease based on the posterior distribution and the inter-gene interaction network;
Based on the significant interacting genes, a list of biomarkers associated with cancer disease is screened.
2. The method of claim 1, wherein the step of determining biomarker expression data associated with a significant gene from transcriptome sequencing data of a biological sample of a cancer disease comprises:
Obtaining multiple types of cancer disease biological samples;
Carrying out transcription amplification on the cancer disease biological sample, and sequencing to obtain transcriptome sequencing data by taking functional genomics and clinical genomics as references;
and clustering and screening the transcriptome sequencing data to obtain biomarker expression data related to the significant genes.
3. The method of claim 2, wherein the step of cluster screening the transcriptome sequencing data for biomarker expression data associated with a significant gene comprises:
preliminary screening is carried out on the transcriptome sequencing data according to the gene expression level to obtain gene difference data;
Performing association test according to the gene difference data to obtain gene difference data after the association test;
performing feature screening on the gene difference data after the association test through a hierarchical clustering algorithm to obtain a hierarchical association result of genes and cancer diseases;
And determining biomarker expression data related to the significant genes according to the hierarchical association result.
4. The method of claim 1, wherein the step of estimating a posterior distribution of biomarkers associated with cancer disease by a logistic regression model based on biomarker expression data of the significant genes comprises:
using a logistic regression model to describe the association of biomarkers with cancer disease based on biomarker expression data for the significant genes;
the non-normalized association of the biomarker with the cancer disease is performed by bayesian methods to obtain a posterior distribution of the biomarker associated with the cancer disease.
5. The method of any one of claims 1 to 4, wherein the step of determining significant interacting genes in a biological sample for a cancer disease based on the posterior distribution and an inter-gene interaction network comprises:
Determining a significant interactivity of genes in a cancer disease state and a normal state based on the posterior distribution and an inter-gene interaction network;
based on the significant interactivity, significant interacting genes in a cancer disease biological sample are determined.
6. The method of any one of claims 1 to 4, wherein the step of screening for a list of biomarkers associated with cancer disease based on the significant interacting genes comprises:
performing feature importance assessment on the significant interaction genes to obtain importance scores;
Sorting the importance scores to obtain a score list;
screening the significant interaction genes based on the score list to obtain a biomarker list associated with cancer disease.
7. The method of claim 6, wherein the step of performing a feature importance assessment on the significant interaction genes to obtain an importance score comprises:
Initializing a gradient lifting decision tree model;
training the gradient lifting decision tree model through a cancer gene database;
And carrying out feature importance assessment on the significant interaction genes according to the trained gradient lifting decision tree model to obtain importance scores.
8. A genomics-based biomarker screening system, the system comprising:
a sequencing data module for determining biomarker expression data associated with a significant gene from transcriptome sequencing data of a cancer disease biological sample, the transcriptome sequencing data obtained based on functional genomic and clinical genomic constructs;
A posterior distribution module for estimating posterior distribution of the biomarker associated with the cancer disease by a logistic regression model based on the biomarker expression data of the significant gene;
An interaction module for determining significant interacting genes in a biological sample of a cancer disease based on the posterior distribution and an inter-gene interaction network;
and the marker screening module is used for screening and obtaining a biomarker list related to the cancer diseases based on the significant interaction genes.
9. The system of claim 8, wherein the sequencing data module is further configured to obtain multiple types of cancer disease biological samples;
The sequencing data module is also used for carrying out transcription amplification on the cancer disease biological sample, and sequencing to obtain transcriptome sequencing data by taking functional genomics and clinical genomics as references;
The sequencing data module is also used for carrying out cluster screening on the transcriptome sequencing data to obtain biomarker expression data related to the obvious genes.
10. The system of claim 9, wherein the sequencing data module is further configured to perform a preliminary screening of the transcriptome sequencing data based on gene expression levels to obtain gene difference data;
the sequencing data module is also used for carrying out association test according to the gene difference data to obtain the gene difference data after the association test;
The sequencing data module is also used for carrying out feature screening on the gene difference data after the association test through a hierarchical clustering algorithm to obtain a hierarchical association result of genes and cancer diseases;
The sequencing data module is also used for determining biomarker expression data related to the significant genes according to the hierarchical association result.
CN202411946317.4A 2024-12-27 2024-12-27 Biomarker screening method and system based on genomics Pending CN119360969A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411946317.4A CN119360969A (en) 2024-12-27 2024-12-27 Biomarker screening method and system based on genomics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411946317.4A CN119360969A (en) 2024-12-27 2024-12-27 Biomarker screening method and system based on genomics

Publications (1)

Publication Number Publication Date
CN119360969A true CN119360969A (en) 2025-01-24

Family

ID=94318167

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411946317.4A Pending CN119360969A (en) 2024-12-27 2024-12-27 Biomarker screening method and system based on genomics

Country Status (1)

Country Link
CN (1) CN119360969A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220275455A1 (en) * 2019-07-08 2022-09-01 Preferred Networks, Inc. Data processing and classification for determining a likelihood score for breast disease
CN116106401A (en) * 2023-01-05 2023-05-12 上海交通大学 Biomarker combination and screening method thereof
CN117625793A (en) * 2024-01-23 2024-03-01 普瑞基准科技(北京)有限公司 Screening method of ovarian cancer biomarker and application thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220275455A1 (en) * 2019-07-08 2022-09-01 Preferred Networks, Inc. Data processing and classification for determining a likelihood score for breast disease
CN116106401A (en) * 2023-01-05 2023-05-12 上海交通大学 Biomarker combination and screening method thereof
CN117625793A (en) * 2024-01-23 2024-03-01 普瑞基准科技(北京)有限公司 Screening method of ovarian cancer biomarker and application thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
韩佳玲: "结直肠癌生物标志物计算筛选及诊断模型构建", 中国优秀硕士学位论文全文数据库基础科学辑, 15 February 2023 (2023-02-15), pages 2006 - 1002 *

Similar Documents

Publication Publication Date Title
Sun et al. Identification of 12 cancer types through genome deep learning
US10810213B2 (en) Phenotype/disease specific gene ranking using curated, gene library and network based data structures
Archer et al. L 1 penalized continuation ratio models for ordinal response prediction using high‐dimensional datasets
Allison et al. Microarray data analysis: from disarray to consolidation and consensus
EP3942556A1 (en) Systems and methods for deriving and optimizing classifiers from multiple datasets
Ghosh Mixture models for assessing differential expression in complex tissues using microarray data
Khene et al. Application of machine learning models to predict recurrence after surgical resection of nonmetastatic renal cell carcinoma
Breen et al. A holistic comparative analysis of diagnostic tests for urothelial carcinoma: a study of Cxbladder Detect, UroVysion® FISH, NMP22® and cytology based on imputation of multiple datasets
Mallick et al. An integrated Bayesian framework for multi‐omics prediction and classification
CA3222355A1 (en) Systems and methods for associating compounds with physiological conditions using fingerprint analysis
Simon Microarray-based expression profiling and informatics
Doherty et al. A comparison of feature selection methodologies and learning algorithms in the development of a DNA methylation-based telomere length estimator
Liu et al. Improved ReliefF-based feature selection algorithm for cancer histology
CN117766024B (en) Ovarian cancer CD8+T cell related prognosis evaluation method, system and application thereof
Mitchell et al. A highly efficient design strategy for regression with outcome pooling
Hobbs et al. Biostatistics and bioinformatics in clinical trials
Li et al. Regression analysis of misclassified current status data
Morgan et al. Comparison of multiplex meta analysis techniques for understanding the acute rejection of solid organ transplants
Sharma et al. A comparative study of data mining, digital image processing and genetical approach for early detection of liver cancer
CN119360969A (en) Biomarker screening method and system based on genomics
Liu et al. Evaluation and amelioration of computer-aided diagnosis with artificial neural networks utilizing small-sized sample sets
Tadesse et al. Identification of differentially expressed genes in high-density oligonucleotide arrays accounting for the quantification limits of the technology
Irigoien et al. Identification of differentially expressed genes by means of outlier detection
Denault et al. Detecting differentially methylated regions using a fast wavelet-based approach to functional association analysis
Tallarita et al. Bayesian autoregressive frailty models for inference in recurrent events

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination