CN114098638B

CN114098638B - Interpretable dynamic disease severity prediction method

Info

Publication number: CN114098638B
Application number: CN202111338917.9A
Authority: CN
Inventors: 马欣宇; 王萌; 刘星; 林思涵; 欧阳文; 唐永忠
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2023-09-08
Anticipated expiration: 2041-11-12
Also published as: CN114098638A

Abstract

The invention discloses an interpretable dynamic disease severity prediction method, which comprises the following steps: extracting SOFA score, patient status and medication usage information; according to the drug use information, linking to UMLS standard terms to construct a drug related knowledge graph; embedding the medicine related knowledge graph into the dimension reduction to obtain the embedding of the medicine entity; determining the category to which the SOFA change value at the current moment belongs as i, and multiplying the patient state and the embedding of the drug entity by the ith row of the corresponding weight matrix respectively; inputting the SOFA score, the patient state after weight processing and the time series data of the embedded splice of the pharmaceutical entity into a TCN prediction model, outputting a predicted SOFA score trend, and training and updating a weight matrix; the predicted SOFA score trend is explained. The invention can be more sensitive to the change trend of the SOFA score, more accurately predict the SOFA score and explain the predicted result.

Description

Interpretable dynamic disease severity prediction method

Technical Field

The invention relates to the technical field of disease severity prediction, in particular to an interpretable dynamic disease severity prediction method.

Background

With the continuous development of database technology, hospitals gradually collect and store a large amount of electronic medical records, and how to mine knowledge of such massive real data gradually attracts attention of researchers. Knowledge discovery and machine learning methods can be used to discover new patterns in patient data, as well as for classification and prediction purposes, such as outcome or risk assessment. Real-time disease severity is an important concern for caregivers in Intensive Care Units (ICUs) and is also critical to save patient lives. If the rich electronic medical record information can be learned, powerful support is provided for clinical decision making of the ICU, and the method can be a great contribution to clinical practice. Since the beginning of the 90 s of the 20 th century, the Sequential Organ Failure Assessment, SOFA, scores have been incorporated into various aspects of critical illness care, which comprehensively reflects six organ system functions, including respiratory, cardiovascular, renal, nervous, hepatic and hematological. The greater the value, the greater the severity of the disease in the patient. In the disease severity prediction task, if the SOFA score trend of ICU patients can be dynamically predicted, the clinician can be helped to better deal with the patient's condition and make more appropriate clinical decisions. At present, a plurality of disease severity prediction models based on electronic medical record data mining exist, however, the methods are not sensitive enough to the change trend of SOFA scores, the prediction accuracy is not enough, and the prediction results are not fully explained.

Disclosure of Invention

First, the technical problem to be solved

Based on the problems, the invention provides an interpretable dynamic disease severity prediction method which can be more sensitive to the change trend of the SOFA score and can be used for predicting the SOFA score more accurately.

(II) technical scheme

Based on the technical problems, the invention provides an interpretable dynamic disease severity prediction method, which comprises the following steps:

s1, extracting SOFA scores, patient states and drug use information from a MIMIMIC-III database, processing the SOFA scores, the patient states and the drug use information into a time sequence format, and preprocessing the SOFA scores, the patient states and the drug use information;

s2, according to the medicine use information, a medicine related knowledge graph is constructed by linking the names of the used medicines to medicine entities, relations and corresponding medical entities in the UMLS term library;

s3, embedding the medicine related knowledge graph into a low-dimensional continuous vector space by using a knowledge graph embedding model to obtain the embedding of all medicine entities;

s4, obtaining a SOFA change value at the current moment according to the SOFA score, determining the category to which the SOFA change value at the current moment belongs as i, multiplying the patient state and the embedding of the drug entity by a patient state weight matrix and an i-th row weight of the drug weight matrix respectively, and performing weight processing on the i=1, 2, 4 and 7; the value θ of the ith row and j column in the weight matrix _ij Or omega _ij Respectively represent the influence weight value of the jth patient state or the jth drug under the ith class on the SOFA change value, N ₁ Representing the total number of patient states, N ₂ Represents the total number of drugs:

s5, inputting the SOFA score, the patient state subjected to weight treatment and the time series data of the embedded splice of the pharmaceutical entity into a TCN prediction model, and outputting a predicted SOFA score trend; and obtaining the affiliated prediction category according to the predicted SOFA scoring trend, and respectively training and updating the patient state weight matrix and the drug weight matrix through an SGD learning model.

Further, the method further comprises the following steps:

s6, explaining the predicted SOFA scoring trend: and taking variables corresponding to weight values higher than a weight threshold under a certain class of the patient state weight matrix or the medicine weight matrix as important patient states or important medicines respectively, prompting the change trend of the disease severity of the patient according to the important patient states, and obtaining the direct cause of the increase of the SOFA score according to the combination of the important medicines and the medicine related knowledge graph.

Further, in step S1, the method for extracting the SOFA score includes: inquiring the names of the ITEMS related to the SOFA score in two tables of D_ITEMS and D_LABITIEMS of MIMIMIC-III, and inquiring the corresponding values and time of the ITEMS in two tables of corresponding item IDs to CHARTEVENTS and LABELENETS; the values of the items are mapped onto the SOFA score according to the definition of the SOFA score; the start time of this ICU is queried from the unique number of the ICU to the ICUSTAYS table, and the time corresponding to the SOFA score is calculated by subtracting the start time from the time of the item.

Further, in step S1, the patient status includes the following characteristics: demographic data of the patient, physiological parameters, laboratory test results, complications.

Further, in step S1, the drug usage information is extracted from the INPUTEVENTS_MV table of MIMIMIIC-III, with 1 representing that the drug is used at the current time, and 0 being the opposite.

Further, in step S1, the preprocessing includes data cleansing and missing value padding

Further, the medicine related knowledge graph contains 38,117 medical entities, 154 relations and 186840 triplets, wherein the medical entities comprise 157 medicine entities corresponding to medicine use information.

Further, the SOFA change value is the SOFA score at the current time minus the SOFA score at the admission time, and is divided into 7 classes, which are integers, and when the SOFA change value is zero or less, the SOFA change value is divided into a first class, a second class, a third class, a fourth class, a fifth class, a sixth class, and a seventh class, respectively, when the SOFA change value is 1 or less, the SOFA change value is 2 or more, the SOFA change value is 3 or more, the SOFA change value is 4 or more, the SOFA change value is 5 or more, and when the SOFA change value is 6 or more, the SOFA change value is 1 or more, the SOFA change value is 2 or less.

Furthermore, the knowledge graph embedding model adopts a TransE model.

Further, the SOFA score of the input of the TCN prediction model is embedded into a 1 x 80 dimension vector at the embedding layer before the input, the depth of the TCN prediction model is 6, and the convolution kernel is 2.

(III) beneficial effects

The technical scheme of the invention has the following advantages:

(1) According to the invention, a TCN time sequence prediction model is used as a basic model of SOFA scoring trend prediction, the state of a patient, the use of medicines and the SOFA scoring are effectively fused, the capability of the model for predicting the SOFA scoring trend is enhanced, a medicine related knowledge graph is constructed from an existing knowledge base according to medicines used by the patient, the medical background knowledge used by the medicines is fused, more medical entities related to the use of the medicines are obtained, and the embedded representation of the medical entities fused with the medical background knowledge is input into the SOFA scoring trend prediction model, so that even if the SOFA scoring change is larger, the SOFA scoring trend prediction capability is better, and the calculated amount is increased, so that the SOFA scoring trend prediction result is more accurate;

(2) According to the invention, the input quantity of the prediction model is processed through the patient state weight matrix and the medicine weight matrix, so that the influence on the SOFA trend in different SOFA categories can be reflected, and the accuracy of the SOFA trend prediction result is further improved;

(3) The invention can provide explanation for the predicted result, and explain how the patient state affects the change of the SOFA score by explaining the predicted SOFA scoring trend through the patient state weight matrix and the medicine weight matrix; and the medicine use and the SOFA score change are related through the constructed medicine related knowledge graph fusion medical background knowledge, so that how the used medicine affects the SOFA score fluctuation to different degrees is explained, and a clinician is helped to better cope with the condition of a patient and make a more proper clinical decision.

Drawings

The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are illustrative and should not be construed as limiting the invention in any way, in which:

FIG. 1 is a flow chart illustrating an exemplary method for predicting severity of a dynamic disease in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of a TCN prediction model according to an embodiment of the present invention;

FIG. 3 is a graph illustrating the weight impact of patient status on SOFA score according to an embodiment of the present invention;

FIG. 4 is a graph comparing the trend of patient status and SOFA score according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a drug knowledge graph including important drugs in part according to an embodiment of the present invention.

Detailed Description

The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

An interpretable dynamic disease severity prediction method, as shown in fig. 1, comprises the steps of:

MIMIMIIC-III (Medical Information Mart for Intensive Care III) is a large electronic medical record database, which is the source of electronic medical record clinical data in this embodiment. The MIMIC-III contains health related data for more than forty thousands of patients hospitalized in Beth Israel Deaconess Medical Center intensive care units between 2001 and 2012, including information on demographics, clinical vital sign measurements (about 1 data point per hour), laboratory test results, procedures, medications, care records, imaging reports, and mortality (hospitalization and discharge). This information is spread across 26 tables of MIMIC-III, and the tables are connected by some defined key. Thus, there is a need to integrate the information needed by the present invention across multiple tables.

S1.1, extracting SOFA scores, patient states and drug use information from a MIMIMIC-III database, and processing the SOFA scores, the patient states and the drug use information into a time sequence format;

the invention uses postgresql to install the MIMIC-III database and uses the SQL language to extract the data. According to the definition of the SOFA score, the invention inquires the names of the ITEMS related to the SOFA score in two tables of D_ITEMS and D_LABITIEMS, and inquires the corresponding values and time of the ITEMS in two tables of corresponding item IDs to CHARTEVENTS and LABIVEENTS; the item values are mapped onto the SOFA scores according to the definition of the SOFA scores; the time maps from a precise to second time to the t hour of the ICU by looking up the start time of the ICU in the ICUSTAYS table from the unique number of the ICU and subtracting the start time from the time of the SOFA score to calculate the hour corresponding to the SOFA score.

Based on the reported factors and clinical experience, the present invention selects other relevant covariates recorded during hospitalization as patient status. Patient status consists of a series of characteristics including patient demographics, physiological parameters, laboratory test results, complications, and the like. Wherein, the static characteristics of the patient comprise demographic data such as age, sex, complications and the like, and the dynamic characteristics comprise physiological parameters such as heart rate and respiratory rate from a PATIENTS table of MIMINIC-III, and laboratory test results such as platelets and white blood cells are extracted in the same way as the SOFA scoring related items.

Drug use information is extracted from the input_mv table of MIMIC-III. The table contains the term name and time interval, i.e. start time and end time, for the patient to inject the drug. When drug use is treated as a time series, use 1 represents that the drug was used at the current time, and 0 is the opposite. Specifically, the present invention controls the time interval represented by each hour after the patient enters the ICU versus the time interval for each drug use. The two are overlapped, namely, the two are marked as 1, and the other is marked as 0.

The information needed in the MIMIMIMI-III database is completely extracted and processed into a time sequence format.

S1.2, preprocessing the extracted data, including data cleaning and missing value filling;

the MIMIMIIC-III contains ICU records of a plurality of ages, and the invention mainly researches the adult population, so that records of which the ages are less than 18 years are deleted; the invention deletes some error data existing in MIMIMIIC-III, such as ICU record of 300 years old patient; meanwhile, although dynamic information representing the state of a patient such as a plurality of vital signs can be theoretically measured once in one hour, a plurality of missing values exist in practice, the invention removes the ICU records of six vital signs, including heart rate, systolic pressure, diastolic pressure, mean arterial pressure, respiratory rate and body temperature; for drug use sequences, the present invention incorporates drug names representing the same drug under the direction of a skilled physician.

In the patient state time series, there is a problem that some characteristic values are missing. The present invention refers to forward-fill imputation strategy, i.e., a fill-in strategy that fills in forward, and makes some improvements to the data of the present invention. The specific filling strategy is as follows:

a. in an ICU, if a certain time of a certain feature is empty, taking a latest non-missing value of the feature before the time as filling;

b. if the feature before the vacancy time is all vacancy, taking the latest non-vacancy value after the time as filling;

c. if the ICU does not measure the feature at all times, the average value of the feature in all data is taken to fill.

the present invention links a medication name to a UMLS standard term using REST (Representational State Transfer, presentation layer state transition) API (Application Programming Interface, application program interface) of UMLS (Unified Medical Language System ), and in particular, the present invention retrieves a UMLS term library from a medication name, and when there are a plurality of search results, the first search result will be selected as a medication entity linked to the medication name. Based on these pharmaceutical entities, the present invention proceeds to use the API to search for their atomic information and to use each atomic information to retrieve relationships of the atomic information and the corresponding medical entity. Finally, the medicine related knowledge graph constructed by the invention contains 38,117 medical entities, 154 relations and 186840 triplets, wherein the triplets are in the form of (head entity, relation and tail entity) and are expressed by (h, r and t).

the invention uses the existing TransE graph embedding model to represent a triplet (h, r, t) based on the entity and the distributed vector of the relation, and obtains the embedding of 38117 medical entities from the medicine related knowledge graph, wherein each medical entity embedding is a vector with 1-80 dimensions, which is equivalent to describing each medical entity by 80 features. While the present invention focuses only on the embedding of 157 pharmaceutical entities used.

According to the medicine used by the patient, a medicine related knowledge graph is constructed from the existing knowledge base, the medical background knowledge used by the medicine is fused, more medical entities related to the medicine are obtained, the medical entities comprise 125 categories, the embedded representation of 157 medicine entities fused with the medical background knowledge is input into the SOFA trend prediction model, and each medicine entity can represent more information after being embedded by the knowledge graph, so that even if the SOFA score change is larger, the medicine has better prediction capability, and the calculated amount is increased, but the SOFA trend prediction result is more accurate.

The condition that the SOFA in the training data is changed greatly is not more, the model captures the rule of the change trend only by the fact that the original data is insufficient, and the medicine is taken as an external event, after the information in the knowledge graph is fused, the model can be guided to better find the change trend, for example, medicine A and medicine B have medicine interaction, adverse reaction can be caused, illness state is rapidly aggravated, but the training data may lack or rarely only use the two medicines and cause the instance that the illness state is suddenly worsened, and the model can hardly find the rule that the illness state is changed greatly by combining the two medicines from the few data without fusing the information in the knowledge graph. While certain features in the embedding can represent this knowledge after drug embedding, the model can better capture changes caused by external causes as well.

S4, obtaining a SOFA change value at the current moment according to the SOFA score, determining the category to which the SOFA change value at the current moment belongs as i, multiplying the patient state and the embedding of the drug entity by a patient state weight matrix and an i-th row weight of the drug weight matrix respectively, and performing weight processing on the i=1, 2, 4 and 7; the patient state weight matrix is a matrix of 7 rows and N1 columns, and the drug weight matrix is a matrix of 7 rows and N2 columns, as follows:

wherein the value theta of the ith row and j column in the weight matrix _ij Or omega _ij Respectively representing the j patient state under the i type or the impact weight value of the j drug on the SOFA change value, and the row represents the category to which the SOFA change value belongs, N ₁ Representing the total number of patient states, N ₂ Representing the total number of drugs. The SOFA change value is the SOFA score of the current moment minus the SOFA score of the admission moment, and is divided into 7 classes, which are integers, and the SOFA change value is divided into a first class when the SOFA change value is less than or equal to zero, a second class when the SOFA change value is equal to 1, a third class when the SOFA change value is equal to 2, a fourth class when the SOFA change value is equal to 3, a fifth class when the SOFA change value is equal to 4, and a third class when the SOFA change value is equal to 2The category 5 is classified into a sixth category and a seventh category when the number is 6 or more. The smaller the SOFA change value, the smaller the risk of disease.

S5, inputting the SOFA score, the patient state subjected to weight treatment and the time series data of the embedded splice of the pharmaceutical entity into a TCN prediction model, and outputting a predicted SOFA score trend; obtaining the affiliated prediction category according to the predicted SOFA scoring trend, and respectively training and updating the patient state weight matrix and the drug weight matrix through an SGD learning model;

the present invention selects the existing TCN model to predict the SOFA score trend, as shown in fig. 2. The input of the model is time series data, the data of each time is spliced by SOFA scoring, the patient state after weight processing and the embedding of the drug entity after weight processing, and the data of a plurality of patients are separated by a separator; the model output is the SOFA scoring trend. The input SOFA scoring data is single data, and is embedded into 1 x 80-dimensional vectors at an embedding layer of the prediction model; the input patient state data is a 1 x 79 dimensional vector, and each dimension represents a characteristic, such as heart rate, etc.; the input embedded data of the pharmaceutical entity is n vectors with 1 x 80 dimensions, n represents the number of pharmaceutical species used by the patient in the hour, and the n vectors are added to form a vector with 1 x 80 dimensions at the embedded layer of the model; that is, the data of each time in the time sequence data is a splice vector of 1×80, 1×79, 1×80 of the data dimension of the input network of each hour; in figure v ₀ -v _T-1 The subscript of (2) is time, the data length of each time is determined according to the number of three collected data, the depth of the TCN model in the embodiment is 6, the convolution kernel size is 2, and the convolution kernel expansion d= [2 ] ⁰ ,2 ¹ ,2 ² ,2 ³ ,2 ⁴ ,2 ⁵ ]。

After the predicted SOFA scoring trend is obtained, on one hand, parameters of a TCN prediction model are trained and optimized, on the other hand, the category of the prediction is obtained according to the predicted SOFA scoring trend, and the state weight matrix and the drug weight matrix of the patient are respectively trained and updated through the existing SGD learning model, so that the prediction result is continuously optimized.

The SOFA score is reduced when the SOFA change value is in the first class, and the SOFA score is increased when the SOFA change value is in the second to seventh classes; therefore, the influence of the patient state on the reduction of the SOFA score is analyzed by taking the weight value of the first row, namely the first type, in the weight matrix of the patient state and the corresponding patient state, and as shown in figure 3, the corresponding weight of the first type of magnesium is-0.95, and the influence on the reduction of the SOFA score is the largest; the disease condition is the most serious when the SOFA change value is the seventh type, the influence of the patient state on the rising of the SOFA score is analyzed by taking the weight value of the seventh row, namely the seventh type, in the weight matrix of the patient state and the corresponding patient state, and as shown in fig. 3, the influence of the corresponding weight of the hemoglobin under the seventh type on the rising of the SOFA score is the greatest.

These important features may suggest to the physician which patient states may be of greater concern in real-time care, for example in the case of fig. 4, when hemoglobin is decreasing, oxygen saturation is decreasing, urine nitrogen is increasing, meaning that the patient is about to have a trend toward increasing SOFA score, i.e., increasing severity of the disease.

In the use of medicines, after important medicines are found, a medicine knowledge graph constructed by the invention can be combined to provide a confirmation space for doctors. Figure 5 shows the mechanism of action of a portion of the important drugs, with important drugs in the ellipses, direct causes of rising SOFA scores in the rectangles, relationships on the edges, and intermediate entities not graphically surrounded.

In summary, the method for predicting the severity of the dynamic disease has the following beneficial effects:

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims

1. An interpretable method of predicting severity of a dynamic disease, comprising the steps of:

s4, obtaining a SOFA change value at the current moment according to the SOFA score, determining the category to which the SOFA change value at the current moment belongs as i, multiplying the patient state and the embedding of the drug entity by a patient state weight matrix and an i-th row weight of the drug weight matrix respectively, and performing weight processing on the i=1, 2, 4 and 7; value θ of ith row j column in patient state weight matrix _ij Representing the j patient status under the i type, the value omega of the j column of the i row in the drug weight matrix _ij Weight value, N, representing the effect of the jth drug on SOFA change value under the ith class ₁ Representing the total number of patient states, N ₂ Represents the total number of drugs:

2. The method of claim 1, further comprising:

3. The method of claim 1, wherein in step S1, the method of extracting the SOFA score is: inquiring the names of the ITEMS related to the SOFA score in two tables of D_ITEMS and D_LABITIEMS of MIMIMIC-III, and inquiring the corresponding values and time of the ITEMS in two tables of corresponding item IDs to CHARTEVENTS and LABELENETS; the values of the items are mapped onto the SOFA score according to the definition of the SOFA score; the start time of this ICU is queried from the unique number of the ICU to the ICUSTAYS table, and the time corresponding to the SOFA score is calculated by subtracting the start time from the time of the item.

4. The method of claim 1, wherein in step S1, the patient status comprises the following characteristics: demographic data of the patient, physiological parameters, laboratory test results, complications.

5. The method according to claim 1, wherein in step S1, the drug use information is extracted from the input_mv table of MIMIC-III, and use 1 represents that the drug is used at the current time, and 0 is the opposite.

6. The method of claim 1, wherein in step S1, the preprocessing includes data cleansing and missing value padding.

7. The method of claim 1, wherein the drug-related knowledge-graph comprises 38,117 medical entities, 154 relationships, and 186840 triplets, wherein the medical entities include 157 drug entities corresponding to drug use information.

8. The method according to claim 1, wherein the SOFA change value is a SOFA score at a current time minus a SOFA score at a time of admission, and is divided into 7 classes, each being an integer, and the SOFA change value is divided into a first class when zero or less, a second class when 1, a third class when 2, a fourth class when 3, a fifth class when 4, a sixth class when 5, and a seventh class when 6 or more.

9. The method for predicting the severity of an interpretable dynamic disease of claim 1, wherein the knowledge-graph embedding model uses a transition model.

10. The method of claim 1, wherein the input SOFA score of the TCN predictive model is embedded as a 1 x 80 dimensional vector at the embedding layer prior to input, the TCN predictive model has a depth of 6 and a convolution kernel of 2.