[go: up one dir, main page]

CN109360657B - Time period reasoning method for selecting samples of hospital infection data - Google Patents

Time period reasoning method for selecting samples of hospital infection data Download PDF

Info

Publication number
CN109360657B
CN109360657B CN201811129775.3A CN201811129775A CN109360657B CN 109360657 B CN109360657 B CN 109360657B CN 201811129775 A CN201811129775 A CN 201811129775A CN 109360657 B CN109360657 B CN 109360657B
Authority
CN
China
Prior art keywords
sample
data
patient
infection
time period
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811129775.3A
Other languages
Chinese (zh)
Other versions
CN109360657A (en
Inventor
李栋栋
胡必杰
高晓东
牛耀军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Lilian Information Technology Co ltd
Zhongshan Hospital Fudan University
Original Assignee
Shanghai Lilian Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Lilian Information Technology Co ltd filed Critical Shanghai Lilian Information Technology Co ltd
Priority to CN201811129775.3A priority Critical patent/CN109360657B/en
Publication of CN109360657A publication Critical patent/CN109360657A/en
Application granted granted Critical
Publication of CN109360657B publication Critical patent/CN109360657B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

本发明公开了一种医院感染数据的样本选取的时间段推理方法,在进行数据采样的过程中,以记录的诊断日期为基准,选取真实感染日期或者是真实感染日期最近几天的样本,其中以基准往前的“前时间段”和往后的“后时间段”推理的时间单位长度内抽取样本,并求取平均值进行估值。其有益效果为:采用时间段推理的方法解决样本选取的问题,有效获取处于感染状态的那段时间的感染样本,且该方法具备可推广性,对于时间序列的数据均可以进行尝试,此方法的使用具备的基本条件要类似于基本经验。

Figure 201811129775

The invention discloses a time period reasoning method for sample selection of nosocomial infection data. In the process of data sampling, the recorded diagnosis date is used as the benchmark, and the real infection date or samples of the last few days of the real infection date are selected, wherein Samples are drawn within the time unit length inferred from the “before time period” and the “post time period” in the past, and the average value is obtained for evaluation. Its beneficial effects are: adopting the method of time period reasoning to solve the problem of sample selection, effectively obtaining the infected samples during the period of time in the infection state, and the method is generalizable, and can be tried for time series data. The basic conditions for the use of the device are similar to basic experience.

Figure 201811129775

Description

一种医院感染数据的样本选取的时间段推理方法A time-segment reasoning method for sample selection of nosocomial infection data

技术领域technical field

本发明涉及一种医院感染数据挖掘技术,尤其涉及一种应用于医院感染大数据分析和建模过程中医院感染数据的样本选取的时间段推理方法。The invention relates to a nosocomial infection data mining technology, in particular to a time period reasoning method applied to the selection of nosocomial infection data samples in the nosocomial infection big data analysis and modeling process.

背景技术Background technique

在医院感染领域,每年因医院感染造成了大量的经济损失和人员伤亡,医院感染数据的分析建模是医疗数据分析建模中较为棘手的问题,医院感染数据质量差、样本构建难度大,数据的分析和建模也没有较好的先例作为指导,然而,随着医院感染事件逐步受到重视,构建出一种监测预警模型,对医院感染的病例进行实时的监测预警,从而帮助临床医生进行及时干预和救治就成为一个极具价值的问题。近年来,各大医院都开始建立自己的医院感染监测信息系统,不过,这些监测预警系统良莠不齐,效果也差强人意,导致这些问题的原因多在于医院感染大数据分析建模的难度较大,没有很成功的案例作为指导和参考,且每一个案例都是解决某一小部分问题,较难全面阐述和分析医院感染建模的难点。已有文献资料中也提出了对数据建模的解决方案,但是存在各种各样的问题。In the field of nosocomial infection, a large number of economic losses and casualties are caused by nosocomial infection every year. The analysis and modeling of nosocomial infection data is a difficult problem in medical data analysis and modeling. The quality of nosocomial infection data is poor, the sample construction is difficult, and the data There is no good precedent for the analysis and modeling of hospital infection. However, with the gradual attention of hospital infection incidents, a monitoring and early warning model has been constructed to carry out real-time monitoring and early warning of hospital infection cases, thereby helping clinicians to conduct timely monitoring and warning. Intervention and treatment have become a very valuable issue. In recent years, major hospitals have begun to establish their own hospital infection monitoring information systems. However, these monitoring and early warning systems are of mixed quality and their effects are not satisfactory. Most of these problems are caused by the difficulty of big data analysis and modeling of hospital infection. Successful cases are used as guidance and reference, and each case solves a small part of the problem. It is difficult to comprehensively explain and analyze the difficulties of hospital infection modeling. Solutions to data modeling have also been proposed in the existing literature, but there are various problems.

例如,文献(林予松,王培培,刘炜,等.医疗体检数据预处理方法研究[J].计算机应用研究,2017,34(4):1089-1092.)提出了一种数据清洗的方法,通过线性函数的数据变换等方式消除了数据重复等问题,然而,在医院感染数据中,更常见的是数据缺失问题,并且不同属性的数据缺失情况也不一样,比如体温基本不会出现大量缺失,而实验室检查可能连续一周都缺失,通过一种固定的模式去对所有数据进行“一刀切”式的处理是不合理的。For example, the literature (Lin Yusong, Wang Peipei, Liu Wei, et al. Research on the preprocessing method of medical examination data [J]. Computer Application Research, 2017, 34(4): 1089-1092.) proposed a data cleaning method, through linear Function data transformation and other methods eliminate data duplication and other problems. However, in hospital infection data, the problem of data missing is more common, and the data missing of different attributes is different. For example, body temperature basically does not appear a lot of missing, while Lab tests may be missing for a week in a row, and a “one size fits all” approach to all data is unreasonable.

再如,文献(Kotsiantis S B,Kanellopoulos D,Pintelas P E.Datapreprocessing for supervised leaning[J].International Journal of ComputerScience,2006,1(2):111-117.)将机器学习领域中数据缺失、数据错误等常见的一些处理办法都进行了介绍,对于缺失值可以采用使用均值、特殊值等办法,然而,从建模的目的而言,这些方法并不是很适合,因为建模的最终目的是要对医院感染的病人进行提前预警或者是实时监测,最重要的是要对最终预警出来的病人给出预警依据,这些依据一般是要展示病人真实的数值而不是处理后的值,这样才便于医生进行合理诊断,所以直接修改值或者使用特殊值的方式是不太适合这种情况的。For another example, the literature (Kotsiantis S B, Kanellopoulos D, Pintelas P E. Datapreprocessing for supervised leaning[J]. International Journal of Computer Science, 2006, 1(2): 111-117.) described missing data and wrong data in the field of machine learning. Some common processing methods have been introduced. For missing values, methods such as the use of mean and special values can be used. However, from the purpose of modeling, these methods are not very suitable, because the ultimate purpose of modeling is to The most important thing is to provide early warning or real-time monitoring for hospital-infected patients. The most important thing is to provide early warning basis for the final warning patient. These basis are generally to show the real value of the patient instead of the processed value, so that it is convenient for doctors to carry out Reasonable diagnosis, so directly modifying the value or using a special value is not suitable for this case.

再如,文献(李红,梁沛枫,潘东峰,等.自回归滑动平均混合模型在医院感染发病率预测中的应用研究[J].中华医院感染学杂志,2013,23(11):2693.)提出了一种时间序列模型,能够对医院感染的发展趋势进行监测,目的是早期预警,降低医院感染风险。但是,该预警模型有两个较为明显的缺点,一是该模型是通过间接监测医院感染发病率的,这一般属于事后的回顾性研究,很难起到提前、实时的监测,无法及时对医院感染进行干预和治疗,二是该模型属于公式型的计算模型,不具备可解释性,较难去分析原因,且模型使用的数据基于宁夏人民医院建立的,未经其他医院的大量测试,在是否具备可推广性还有待检验。Another example, literature (Li Hong, Liang Peifeng, Pan Dongfeng, et al. Application of autoregressive moving average mixed model in the prediction of hospital infection incidence [J]. Chinese Journal of Hospital Infection, 2013, 23(11): 2693.) A time series model is proposed to monitor the development trend of nosocomial infection, with the purpose of early warning and reducing the risk of nosocomial infection. However, this early warning model has two obvious shortcomings. One is that the model indirectly monitors the incidence of nosocomial infection, which is generally a retrospective study after the event. Intervention and treatment of infection. Second, the model is a formula-type calculation model, which is not interpretable and difficult to analyze. The data used in the model is based on the Ningxia People's Hospital, without extensive testing in other hospitals. Whether it can be generalized remains to be tested.

在进行医院感染大数据分析建模的过程中,遇到的难点主要包括以下几种:In the process of big data analysis and modeling of nosocomial infection, the difficulties encountered mainly include the following:

(1)医院感染数据缺失的问题。医院感染数据具备时效性的特点,这一特点决定了使用数据时必须考虑到病人这些检测数据的时间范围,而医院感染数据存在缺失的难题,增加了医院感染大数据分析建模的难度;(1) The problem of missing nosocomial infection data. Nosocomial infection data has the characteristics of timeliness, which determines that the time range of the patient's detection data must be taken into account when using the data, and the problem of missing nosocomial infection data increases the difficulty of big data analysis and modeling of nosocomial infection;

(2)医院感染数据正反例样本划分的问题。医院感染数据样本主要分为两类,一类是感染样本,一类是非感染样本,如何划分这两类样本使其成为正反例是一个较为重要的问题。然而,实际问题较为复杂,非感染样本是较容易取得的,只需要从那些没发生医院感染的病人中随机抽取几天的数据作为非感染样本即可,感染样本的选取有一个难点,就是发生医院感染的病人多数住院时间较长,真正处于感染状态的可能就只有一段时间,其他时间都是正常的,如何获取这一段感染状态的数据就较为困难。在医院感染中,已经确诊或者是上报为医院感染的病人一般都会有一个经医生诊断的“感染日期”,此处称为“诊断日期”,用于确定病人那天发生了感染,最简单的做法自然是取该“诊断日期”这一天作为感染样本,但是,实际调查发现,这个日期是医生的一个推断日期,多数并不准,病人真实发生感染的日期有可能在此日期之前也有可能在此日期之后,在日期把握上并不十分严格,文献(张晓炜,孟黎辉,郑佳,等.医院感染漏报率不同统计方法的探讨[J].中华医院感染学杂志,2006,1.)中已经就类似问题进行了阐述。(2) The problem of dividing positive and negative samples of nosocomial infection data. Nosocomial infection data samples are mainly divided into two categories, one is infected samples and the other is non-infected samples. How to divide these two types of samples into positive and negative examples is a more important issue. However, the actual problem is more complicated. Non-infected samples are relatively easy to obtain. It is only necessary to randomly select a few days of data from patients who have not suffered nosocomial infection as non-infected samples. The selection of infected samples is difficult. Most of the patients with nosocomial infection have been hospitalized for a long time, and they may only be in the real infection state for a period of time. Other times are normal. It is more difficult to obtain the data of the infection state in this period. In nosocomial infections, patients who have been diagnosed or reported as nosocomial infections generally have an "infection date" diagnosed by a doctor, which is referred to as the "diagnosis date" here, which is used to determine that the patient was infected on that day. Naturally, the "diagnosis date" is taken as the infection sample. However, the actual investigation found that this date is an inferred date by the doctor, and most of them are not accurate. The actual date of infection of the patient may be before or here. After the date, the date is not very strict, and the literature (Zhang Xiaowei, Meng Lihui, Zheng Jia, et al. Discussion on different statistical methods for the underreporting rate of nosocomial infection [J]. Chinese Journal of Hospital Infectious Diseases, 2006, 1.). Similar issues are discussed.

因此,医院感染大数据分析和建模的过程中需要针对上述缺陷需要对现有技术进行改进。Therefore, in the process of big data analysis and modeling of nosocomial infection, the existing technology needs to be improved in view of the above-mentioned defects.

发明内容SUMMARY OF THE INVENTION

有鉴于现有技术的上述缺陷,本发明所要解决的技术问题是提供一种医院感染数据的样本选取的时间段推理方法,以解决医院感染大数据分析和建模过程中医院感染数据样本选取的问题。In view of the above-mentioned defects of the prior art, the technical problem to be solved by the present invention is to provide a time period reasoning method for sample selection of nosocomial infection data, so as to solve the problem of nosocomial infection data sample selection in the process of nosocomial infection big data analysis and modeling. question.

在进行发明内容阐述之前,需要对文件中出现的术语进行解释和定义。Before proceeding to the elaboration of the content of the invention, the terms appearing in the document need to be explained and defined.

有效时间范围:指对病人/患者检测数据的时效时间范围的统称,例如,对于像体温、大便次数、心率和呼吸频率等数据具备较高的时效性,基本上每天都会有差异,那么数据使用就可以24小时作为范围,超过24小时的数据就不再考虑使用,而像微生物检查和实验室检查具备较低时效性,三天到五天内的数据都可以认为有效,那么就可以使用72或者120小时为范围,这种范围为统称“有效时间范围”。“有效时间范围”一般根据经验或者是参考文献中的资料来确定,也可以根据实际的建模目的来确立,标准参考《医院感染诊断标准(试行)》中部分特征的作用时间。Valid time range: Refers to the general term for the time range of the patient/patient detection data. For example, data such as body temperature, stool frequency, heart rate and respiratory rate have high timeliness, and there will be differences basically every day, then the data use 24 hours can be used as the range, and data beyond 24 hours are no longer considered for use. For example, microbiological examinations and laboratory examinations have low timeliness, and data within three to five days can be considered valid, then 72 or 120 hours is the range, and this range is collectively referred to as the "valid time range". The "effective time range" is generally determined based on experience or data in the reference literature, and can also be established according to the actual modeling purpose. The standard refers to the action time of some features in "Nosocomial Infection Diagnostic Standards (Trial)".

诊断日期:在医院感染中,已经确诊或者是上报为医院感染的病人一般都会有一个经医生诊断的“感染日期”,此处称为“诊断日期”。Diagnosis date: In nosocomial infection, patients who have been diagnosed or reported as nosocomial infection generally have an "infection date" diagnosed by a doctor, which is referred to as the "diagnosis date" here.

感染日期:病人真实发生感染的日期为感染日期。Date of infection: The date when the patient actually became infected is the date of infection.

前时间段:以诊断日期为基准日期选取感染样本,往前推理的时间单位长度为前时间段。Previous time period: Select infection samples with the diagnosis date as the base date, and the length of the time unit of forward reasoning is the previous time period.

后时间段:以诊断日期为基准日期选取感染样本,往后推理的时间单位长度为后时间段。Post-time period: The infection sample is selected based on the diagnosis date, and the length of the time unit of subsequent inference is the post-time period.

为了解决上述问题,本发明提供了一种医院感染数据样本选取的时间段推理方法,包括如下步骤:In order to solve the above problems, the present invention provides a time period reasoning method for selecting a nosocomial infection data sample, comprising the following steps:

步骤1,确定医院感染数据的特征,并将特征按照“有效时间范围”进行归类,特征集合记作F,k表述在集合F中的第k个特征;Step 1, determine the characteristics of the nosocomial infection data, and classify the characteristics according to the "valid time range", the characteristic set is denoted as F, and k represents the kth characteristic in the set F;

步骤2,所有病人组成的集合记作S,在集合S中得到病人m,并对病人m生成正反例样本集合N;Step 2, the set composed of all patients is denoted as S, the patient m is obtained in the set S, and the positive and negative sample set N is generated for the patient m;

步骤3,步骤2中生成正反例样本集合N后,将医院感染病人集合记为C,感染病人的诊断日期组成的集合记为Cd;Step 3, after the positive and negative sample set N is generated in step 2, the set of nosocomial infection patients is denoted as C, and the set of the diagnosis date of the infected patient is denoted as Cd;

步骤4,从集合C中随机抽取n个病人,并获得n个病人对应的诊断日期;Step 4, randomly extract n patients from the set C, and obtain the diagnosis dates corresponding to the n patients;

步骤5,对步骤4中的n个病人进行诊断,取得n个病人“前时间段”和“后时间段”的数据组成的数组A_pre和A_end;Step 5: Diagnose the n patients in Step 4, and obtain the arrays A_pre and A_end composed of the data of the "pre-time period" and "post-time period" of the n patients;

步骤6,对步骤5中的两个数组分别求和后再求平均,分别取得两个平均值avg_pre=sum(A_pre)/n,avg_end=sum(A_end)/n;这两个平均值作为集合C中所有病人的时间段推理的两个参数,近似估计集合C中所有病人的“前时间段”和“后时间段”;Step 6, the two arrays in step 5 are summed and then averaged to obtain two averages avg_pre=sum(A_pre)/n, avg_end=sum(A_end)/n; these two averages are used as a set Two parameters for time-segment inference for all patients in C, approximate estimates of the "pre-time" and "post-time" for all patients in set C;

步骤7,更新数据生成样本集合D并根据样本集合D进行建模测试;Step 7, update data to generate sample set D and carry out modeling test according to sample set D;

步骤8,根据测试结果不断微调avg_pre和avg_end,以得到最终需求的值。Step 8, continuously fine-tune avg_pre and avg_end according to the test results to obtain the final required value.

需要说明的是,步骤2中的正例样本为发生医院感染的病人m样本,反例样本为未发生医院感染的病人m样本。It should be noted that the positive sample in step 2 is the sample of patient m with nosocomial infection, and the negative sample is the sample of patient m without nosocomial infection.

进一步的,若m为正例样本中的病人,则m记为S中的第m个病人;若m为反例样本中的病人,则m为随机抽取的病人。Further, if m is a patient in the positive sample, then m is recorded as the mth patient in S; if m is a patient in the negative sample, then m is a randomly selected patient.

进一步的,步骤7中更新数据的方法采用增量式更新方法,包括如下步骤:Further, the method for updating data in step 7 adopts an incremental updating method, including the following steps:

步骤7a,按照时间由前到后的序列对步骤2中的正反例样本集合N进行升序排序,以保证在增量式更新的过程中,时间都是从前到后排列的,从而保证更新时总是新值覆盖旧值;Step 7a, sort the positive and negative sample set N in step 2 in ascending order according to the sequence of time from front to back, so as to ensure that in the process of incremental update, the time is arranged from front to back, so as to ensure that when updating. always new value overwrites old value;

步骤7b,将样本集合N中时间最早的样本i,存入到样本集合D中,并按步骤1中确定的医院感染数据的特征对应存入集合T中,分别记录Tk_v和Tk_date,表示集合N中的样本i对应的集合T中的第k个特征的值和该值的日期;Step 7b, store the sample i with the earliest time in the sample set N into the sample set D, and store it in the set T according to the characteristics of the hospital infection data determined in step 1, respectively record Tk_v and Tk_date, indicating the set N The value of the k-th feature in the set T corresponding to the sample i in , and the date of the value;

步骤7c,对样本集合N中的第二条及以后所有的样本i进行缺失值判断,对缺失值进行更新,对未缺失值进行保留;若样本i的特征Tk的值Tk_v为缺失值,则在样本集合D中逆序找到该特征Tk的值Tk_v和Tk_date,若样本集合D中该值不为空,且Tk_date与i中的Tk_date的差值不超过“有效时间范围”,则将该值取出更新到样本i的Tk_v中来代替缺失值,此处要求逆序遍历是为了保证集合D中的遍历的样本在时间上总是最靠近当前样本的,以下相同;若样本集合D中该值不为空,但是超过“有效时间范围”则推出遍历保持样本i第k个特征的缺失状态;若样本集合D中该值为空,则继续遍历下一个值。Step 7c: Perform missing value judgment on the second and subsequent samples i in the sample set N, update the missing values, and retain the non-missing values; if the value Tk_v of the feature Tk of the sample i is a missing value, then Find the value Tk_v and Tk_date of the feature Tk in the sample set D in reverse order. If the value in the sample set D is not empty, and the difference between Tk_date and Tk_date in i does not exceed the "valid time range", then take the value out Update the Tk_v of sample i to replace the missing value. The reverse order traversal is required here to ensure that the traversed samples in the set D are always closest to the current sample in time, and the following is the same; if the value in the sample set D is not If it is empty, but it exceeds the "valid time range", the traversal is introduced to keep the missing state of the k-th feature of sample i; if the value in the sample set D is empty, continue to traverse the next value.

步骤7d,将完成更新或保留的样本存入到样本集合D中,按照步骤5顺序读取后续样本并保存样本数据;Step 7d, store the updated or retained samples into the sample set D, read subsequent samples in the order of step 5 and save the sample data;

步骤7e,当重复步骤7c和步骤7d得到i=N时,读取完成,样本集合D构建完成。In step 7e, when i=N is obtained by repeating steps 7c and 7d, the reading is completed, and the construction of the sample set D is completed.

本发明还提供一种通过时间段推理的方法解决医院感染数据样本取样的分析建模方法,包括如下步骤:The present invention also provides an analysis and modeling method for solving the sampling of nosocomial infection data samples by means of time-segment reasoning, comprising the following steps:

步骤A1,确定医院感染数据的特征,并将特征按照“有效时间范围”进行归类;Step A1, determine the characteristics of the nosocomial infection data, and classify the characteristics according to the "valid time range";

步骤A2,确定生成正反例样本的病人,其中正例样本为发生医院感染的病人样本,反例样本为未发生医院感染的病人样本;Step A2, determine the patients who generate positive and negative samples, wherein the positive samples are patient samples with nosocomial infection, and the negative samples are patient samples without nosocomial infection;

步骤A3,采用时间段推理的方式来划分正反例样本,具体实现方式如前述步骤1-步骤8所述;Step A3, adopting the method of time period reasoning to divide the positive and negative samples, and the specific implementation method is as described in the foregoing steps 1-8;

步骤A4,采用“增量式更新”的方法来生成样本集,具体实现方式如前述步骤7a-7e所述;Step A4, the method of "incremental update" is used to generate the sample set, and the specific implementation method is as described in the aforementioned steps 7a-7e;

步骤A5,对最终样本集进行分析建模。Step A5, analyze and model the final sample set.

本发明还提供一种通过时间段推理的方法解决医院感染数据样本取样的分析建模系统,至少包括一数据库,该数据库中存储有所有病人集合S以及集合S内病人的病例数据;一样本生成模块,根据样本生成条件生成样本集合,例如根据病人感染情况生成感染病人集合和非感染病人集合;一样本划分模块,将前述样本生成模块生成的样本集合划分为分析建模所需样本集合;以及一数据更新模块,该数据更新模块通过前述步骤1-步骤7实现数据缺失值的更新。The present invention also provides an analysis and modeling system for solving the sampling of nosocomial infection data samples by means of time-segment reasoning, which at least includes a database in which all patient sets S and case data of patients in the set S are stored; a sample generation a module that generates a sample set according to the sample generation conditions, for example, an infected patient set and a non-infected patient set are generated according to the infection situation of the patient; a sample division module that divides the sample set generated by the aforementioned sample generation module into a sample set required for analysis and modeling; and A data update module, the data update module implements the update of missing data values through the aforementioned steps 1-7.

本发明还提供一种通过时间段推理的方法解决医院感染数据样本取样的分析建模系统的实现方法,包括如下步骤:The present invention also provides an implementation method of an analysis and modeling system for solving the sampling of nosocomial infection data samples by means of time-segment reasoning, comprising the following steps:

步骤B1,根据数据库的信息,整理和明确医院感染数据中所需要的病人数据项并设计出对应的XML存储结构;Step B1, according to the information in the database, sort out and clarify the required patient data items in the nosocomial infection data and design a corresponding XML storage structure;

步骤B2,样本生成模块将病人的数据按设定采样周期为样本、按数据项为特征整理为所需数据的样本格式,生成所需样本集合;Step B2, the sample generation module organizes the patient's data into a sample format of the required data according to the set sampling period as the sample and the data item as the feature, and generates the required sample set;

在上述步骤B2中,将医院感染的数据整理为样本,这些样本中每一条均是一个病人在设定采样周期的数据,按照前述所述的增量式更新的方法对样本中的特征进行增量式更新,最终会产生若干病人在设定采样周期的样本所组成的样本集合。In the above step B2, the data of nosocomial infection are organized into samples, each of these samples is the data of a patient in the set sampling period, and the features in the samples are increased according to the aforementioned incremental update method. The quantitative update will eventually generate a sample set consisting of samples of several patients in the set sampling period.

步骤B3,样本划分模块按照最终分类的标签对样本集合进行划分,生成最终感染样本和非感染样本区分后的样本集合;In step B3, the sample division module divides the sample set according to the final classification label, and generates a final sample set that is distinguished from the infected sample and the non-infected sample;

步骤B4,划分后的样本集合通过数据更新模块进行增量式更新;Step B4, the divided sample set is incrementally updated by the data update module;

步骤B5,样本集合更新完成后,按照建模的一般方法进行建立模型。Step B5, after the update of the sample set is completed, build a model according to the general method of modeling.

进一步的,步骤B1中,文件以XML的方式进行存储,里面包含了病人的基本信息,如病例号、性别、年龄、感染日期等,包含了病人的入院基本信息,如入院诊断、入院科室、入院日期等,包含了病人在院期间设定采样周期的信息,如体温、医嘱、实验室检查、微生物检查、影像检查和病程记录等;该存储方案除了对病人的信息具备存储的功能外,最主要的是便于数据的组织和应用,XML里面的每一项都可以单独取出来并与其他项组合使用,并且里面每项都有准确的时间,也可以按照时间序列进行组织,使用方式取决于开发者的需求。Further, in step B1, the file is stored in the form of XML, which contains the basic information of the patient, such as case number, gender, age, date of infection, etc., and contains the basic information of the patient's admission, such as admission diagnosis, admission department, The admission date, etc., includes the information of the sampling period set by the patient during the hospitalization, such as body temperature, doctor's order, laboratory examination, microbiological examination, imaging examination, and disease course records; in addition to the storage function of the patient's information, this storage solution The most important thing is to facilitate the organization and application of data. Each item in XML can be taken out separately and used in combination with other items, and each item has an accurate time. It can also be organized in time series, depending on the way of use. to the needs of developers.

本发明还提供一种计算机可读介质,该计算机可读介质用于通过计算机网络解决医院感染数据样本取样和医院感染数据分析和建模,包括一组指令,当执行时,该指令会导致至少一个计算机执行解决医院感染数据分析建模过程中的医院感染数据样本取样的问题以及取样后数据分析和建模。The present invention also provides a computer-readable medium for addressing nosocomial infection data sample sampling and nosocomial infection data analysis and modeling over a computer network, comprising a set of instructions that, when executed, result in at least one A computer implementation solves the problem of sample sampling of nosocomial infection data in the process of nosocomial infection data analysis and modeling as well as post-sampling data analysis and modeling.

通过实施上述本发明提供的通过时间段推理的方法解决医院感染数据样本取样的方法,具有如下技术效果:By implementing the above-mentioned method provided by the present invention to solve the sampling of nosocomial infection data samples, the method has the following technical effects:

(1)采用时间段推理的方式解决了因感染日期不准较难划分样本的问题。之前医院感染的样本划分采用以病人为单位的情况较多,且感染数据的采集需要大量人工的审核,该方法通过以天为单位划分感染样本和非感染样本且通过“时间段”的方式来区分两类样本解决了样本选取较难、划分样本较难的问题,有效获取处于感染状态的那段时间的感染样本。(1) The problem of difficulty in dividing samples due to inaccurate infection dates is solved by using time-segment reasoning. In the past, there were many cases in which nosocomial infection samples were divided by patients, and the collection of infection data required a lot of manual review. Distinguishing the two types of samples solves the problems of difficult sample selection and difficult sample division, and effectively obtains infected samples during the period of infection.

(2)采用增量式更新方式解决了缺失数据或者是实时数据利用的问题。之前对医院感染的缺失数据和实时数据的处理办法中,较多数是对数据的缺失值进行评估,对于缺失较多的样本直接进行删除不再利用,这样并不是很合理,因为虽然缺失值较多,但是里面少数值如果是实时数据是具备参考价值的,该方法采用增量式更新基本可以解决多数数据缺失的问题。(2) The problem of missing data or real-time data utilization is solved by the incremental update method. In the previous methods for dealing with missing data and real-time data of nosocomial infections, most of them are to evaluate the missing values of the data, and delete the samples with more missing data and no longer use them. This is not very reasonable, because although the missing values are more However, if the few values in it are real-time data, it has reference value. This method can basically solve the problem of missing most data by using incremental update.

(3)提出了按照“有效时间范围”将不同特征进行归类的办法解决不同特征时间有效性长短不一的问题。(3) The method of classifying different features according to the "valid time range" is proposed to solve the problem of different time validity of different features.

(4)采用XML的方式进行存储,解决了医院数据复杂难以利用的问题。之前对医院感染数据进行处理的方法中,多数还是通过数据库和相关程序导出的数据进行处理和分析,并未单独对数据存储和处理去设计一种比较通用的数据结构。该方法除了具备存储和处理方便的优势外,还能把数据以病人为单位进行管理,每个病人的所有具体信息全都归总到一个文件中,即有利于数据的管理,又方便研发工作者对数据进行回顾性研究,极大的方便了数据的应用。(4) The way of XML is used for storage, which solves the problem that the hospital data is complicated and difficult to use. In the previous methods of processing hospital infection data, most of them still processed and analyzed data derived from databases and related programs, and did not design a more general data structure for data storage and processing alone. In addition to the advantages of convenient storage and processing, this method can also manage data on a patient-by-patient basis. All the specific information of each patient is summarized into one file, which is not only conducive to data management, but also convenient for R&D workers. A retrospective study of the data greatly facilitates the application of the data.

(5)较为明确的描述了“医院感染大数据分析建模”的基本流程及几处难点,为医院感染数据的分析和建模理清了基本思路。(5) The basic process and several difficulties of "nosocomial infection big data analysis and modeling" are clearly described, and the basic ideas for the analysis and modeling of nosocomial infection data are clarified.

附图说明Description of drawings

以下将结合附图对本发明的构思、具体结构及产生的技术效果作进一步说明,以充分地了解本发明的目的、特征和效果。The concept, specific structure and technical effects of the present invention will be further described below in conjunction with the accompanying drawings, so as to fully understand the purpose, characteristics and effects of the present invention.

图1是本发明实施例中医院感染数据分析建模流程图;Fig. 1 is a flowchart of nosocomial infection data analysis and modeling in the embodiment of the present invention;

图2是本发明实施例中时间段推理流程图;2 is a flowchart of time segment reasoning in an embodiment of the present invention;

图3是本发明实施例中增量式更新流程图;Fig. 3 is the incremental update flow chart in the embodiment of the present invention;

图4是本发明实施例中分析建模系统实现方法流程图;Fig. 4 is the flow chart of the realization method of analysis modeling system in the embodiment of the present invention;

图5是本发明实施例中《医院感染诊断标准(试行)》中部分特征归类表。FIG. 5 is a partial feature classification table in "Nosocomial Infection Diagnostic Standards (Trial)" in an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

在以下实施方式中提到的“有效时间范围”指对病人/患者检测数据的时效时间范围的统称,例如,对于像体温、大便次数、心率和呼吸频率等数据具备较高的时效性,基本上每天都会有差异,那么数据使用就可以24小时作为范围,超过24小时的数据就不再考虑使用,而像微生物检查和实验室检查具备较低时效性,三天到五天内的数据都可以认为有效,那么就可以使用72或者120小时为范围,这种范围为统称“有效时间范围”。“有效时间范围”一般根据经验或者是参考文献中的资料来确定,也可以根据实际的建模目的来确立,标准参考《医院感染诊断标准(试行)》中部分特征的作用时间。The "valid time range" mentioned in the following embodiments refers to the general term for the time range of the time limit for patient/patient detection data. There will be differences every day, so the data use can be used as a range of 24 hours, and the data that exceeds 24 hours will not be considered for use. For example, microbiological examinations and laboratory examinations have low timeliness, and data within three to five days can be used. If it is considered to be valid, then 72 or 120 hours can be used as the range, which is collectively referred to as the "valid time range". The "effective time range" is generally determined based on experience or data in the reference literature, and can also be established according to the actual modeling purpose. The standard refers to the action time of some features in "Nosocomial Infection Diagnostic Standards (Trial)".

“诊断日期”指在医院感染中,已经确诊或者是上报为医院感染的病人一般都会有一个经医生诊断的“感染日期”,此处称为“诊断日期”。"Date of diagnosis" refers to the "date of infection" diagnosed by a doctor in patients who have been diagnosed or reported as a hospital infection, which is referred to as the "date of diagnosis" here.

“感染日期”指病人真实发生感染的日期为感染日期。"Date of infection" refers to the date when the patient actually became infected as the date of infection.

“前时间段”指以诊断日期为基准日期选取感染样本,往前推理的时间单位长度为前时间段。"Previous time period" refers to the selection of infection samples based on the diagnosis date, and the length of the time unit of forward reasoning is the previous time period.

“后时间段”指以诊断日期为基准日期选取感染样本,往后推理的时间单位长度为后时间段。"Post time period" refers to the selection of infection samples based on the date of diagnosis, and the length of the time unit for subsequent inference is the post time period.

如图1所示是医院感染数据分析建模流程,包括如下步骤:As shown in Figure 1, the data analysis and modeling process of hospital infection includes the following steps:

步骤A1,确定医院感染数据的特征,如体温、脉搏和C反应蛋白等,形成医院感染的特征集合记为F,k表示集合F中第k个特征;将特征集合F按“有效时间范围”进行归类生成集合T,Tk表示第k个特征所属的类别;Step A1: Determine the characteristics of the nosocomial infection data, such as body temperature, pulse and C-reactive protein, etc., and form a characteristic set of nosocomial infection, which is denoted as F, and k represents the k-th feature in the set F; Perform classification to generate a set T, where Tk represents the category to which the k-th feature belongs;

其中,“有效时间范围”的目的在于不同特征对人体影响的时间长度是不同的,一般根据经验或者是参考文献中的资料来确定,也可以根据自己实际的建模目的来确立,标准建议参考《医院感染诊断标准(试行)》中部分特征的作用时间,如图5所示,本实施例给出了部分归类,可以用于参考。Among them, the purpose of the "effective time range" is that the time length of the impact of different features on the human body is different, which is generally determined based on experience or data in the reference literature, or can be established according to your actual modeling purpose. The standard recommends referring to The action time of some features in the "Nosocomial Infection Diagnostic Standards (Trial)" is shown in FIG. 5 , and some classifications are given in this embodiment, which can be used for reference.

其中,特征集合的确定主要依靠《医院感染诊断标准(试行)》中总结出来的部分特征和从论文或者是医生那里得到的部分特征,这部分工作主要在需求调查和分析阶段完成。Among them, the determination of the feature set mainly relies on some features summarized in "Nosocomial Infection Diagnostic Standards (Trial)" and some features obtained from papers or doctors. This part of the work is mainly completed in the stage of demand investigation and analysis.

步骤A2,确定生成正反例样本的病人,该步骤是将数据按病人为单位进行划分,其中正例样本为发生医院感染的病人样本,反例样本为未发生医院感染的病人样本;首先,需要取得发生医院感染的病人,这部分较容易取得,因为发生医院感染的病人都有医院的诊断或者已经进行了上报,直接拿到这部分病人和这部分病人对应的“诊断日期”即可,然后,非医院感染的病人则可以取医院中那些没有被诊断为医院感染的病人,由于这部分病人较多,所以采用分层抽样和随机抽样相结合的方式,该方法是将医院的病人按照科室进行分层,然后每一层再采用随机抽样的方式抽取部分病人,最终抽取的病人数目一般不超过发生医院感染病人数目的10倍;Step A2: Determine the patients who generate positive and negative samples. This step is to divide the data by patient unit, wherein the positive samples are the samples of patients with nosocomial infection, and the negative samples are the samples of patients without nosocomial infection; first, it is necessary to Obtaining patients with nosocomial infection is easier to obtain, because patients with nosocomial infection have been diagnosed by the hospital or have been reported, and you can directly get the "diagnosis date" corresponding to these patients and these patients, and then , non-hospital infection patients can take those patients in the hospital who have not been diagnosed with nosocomial infection. Since there are many patients in this group, a combination of stratified sampling and random sampling is adopted. Stratification is carried out, and then random sampling is used to select some patients in each stratum, and the number of patients finally selected generally does not exceed 10 times the number of patients with nosocomial infections;

需要注意的是,该步骤是用来确定哪些病人是医院感染,哪些病人是非医院感染,这些并不是用于建模的样本,因为一个病人是不适合作为一条样本的,每个病人在住院期间的某段时间是处于感染状态,而其他时间是正常的,只有处于感染状态的那段时间才能作为感染样本,即样本是具备时间序列性质的。It should be noted that this step is used to determine which patients have nosocomial infections and which patients are non-hospital infections. These are not samples for modeling, because one patient is not suitable as a sample, and each patient is in hospital during the period. A certain period of time is in the state of infection, while other times are normal. Only the period of time in the state of infection can be used as an infection sample, that is, the sample has a time-series nature.

步骤A3,采用时间段推理的方式来划分正反例样本;医院感染病人和非医院感染的病人确定以后,就可以按时间序列来生成正反例样本了。本案例主要是以天为单位生成样本,所以每个病人在院期间的每一天都可以作为一条样本,然而,并不是病人在院每一天的数据都要生成样本,对于非医院感染的病人,可以按照随机抽样的方式来抽取病人在院期间的某几天的数据,对于医院感染的病人,可以应用“时间段推理”的方法来抽取对应时间段的数据,其中时间段推理的“前时间段”和“后时间段”在划分正反例样本的时候需要多次尝试找到合理的值,两个时间段一般建议不超过5天;采用时间推理的流程如图2所示,包括:In step A3, time-segment reasoning is used to divide positive and negative samples; after nosocomial infection patients and non-hospital infection patients are determined, positive and negative samples can be generated in time series. In this case, samples are mainly generated in units of days, so each day of the patient's stay in the hospital can be used as a sample. However, it is not necessary to generate samples for the data of each day of the patient's stay in the hospital. For patients with non-hospital infections, The data of certain days during the patient's stay in the hospital can be extracted by random sampling. For patients with hospital infection, the method of "time period reasoning" can be applied to extract the data of the corresponding time period. When dividing positive and negative samples, it is necessary to try several times to find reasonable values for “period” and “post time period”, and it is generally recommended that the two time periods should not exceed 5 days; the process of using time inference is shown in Figure 2, including:

步骤A3a,将医院感染病人集合记为C,其诊断日期组成的集合记作Cd;Step A3a, the collection of hospital infection patients is denoted as C, and the collection of its diagnosis date is denoted as Cd;

步骤A3b,从集合C中随机抽取n个病人,并得到这n个病人对应的诊断日期;Step A3b, randomly extract n patients from the set C, and obtain the diagnosis dates corresponding to these n patients;

步骤A3c,依据《医院感染诊断标准(试行)》对这n个病人进一步诊断,并得到这n个病人“前时间段”和“后时间段”组成的数组A_pre和A_end,并对这n个病人两个数组分别求和再平均,得到两组数值的平均值分别为avg_pre=sum(A_pre)/n和avg_end=sum(A_end)/n,这两个平均值就可以作为所有病人C的时间段推理的两个参数;Step A3c, further diagnose the n patients according to the "Nosocomial Infection Diagnostic Criteria (Trial)", and obtain the arrays A_pre and A_end composed of the "pre-time period" and "post-time period" of the n patients, and analyze the n patients. The two arrays of patients are summed and averaged respectively, and the average values of the two groups of values are obtained as avg_pre=sum(A_pre)/n and avg_end=sum(A_end)/n. These two averages can be used as the time of all patients C. Two parameters of segment inference;

步骤A3d,采用增量式更新的办法生成样本集合并进行建模测试;Step A3d, adopting an incremental update method to generate a sample set and perform a modeling test;

步骤A3e,根据测试结果不断微调avg_pre和avg_end,如同时+1或者-1,等方式来优化集合最终得到效果较好的值。Step A3e, continuously fine-tune avg_pre and avg_end according to the test results, such as +1 or -1 at the same time, to optimize the set and finally obtain a value with better effect.

步骤A4,正反例样本划分完成后,采用“增量式更新”的方法来生成样本集;该步骤与前述步骤A3d的步骤相同,此处需要按照步骤1数据特征所属的“有效时间范围”来将不同特征进行增量式更新,需要注意的是医院感染病人应用时间段推理得到的正例样本由于在时间上是连续的,所以该方法能解决多数数据缺失问题,然而,非医院感染病人由于采用随机抽样,较难保证时间上一定连续,此处的“增量式更新”未必能解决数据缺失的问题,对于这种情况需要根据实际情况去处理,若缺失值过多,则可以考虑在选取非医院感染病人样本时选择随机抽取连续几天即可;采用“增量式更新”的方法来对样本缺失值进行处理如图3所示,具体步骤包括:Step A4, after the positive and negative samples are divided, use the "incremental update" method to generate a sample set; this step is the same as the previous step A3d, and here it is necessary to follow the "valid time range" to which the data characteristics of step 1 belong. To incrementally update different features, it should be noted that the positive samples obtained by applying time-segment inference for hospital-acquired patients are continuous in time, so this method can solve most of the missing data problems. However, non-hospital infection patients Due to the use of random sampling, it is difficult to ensure that the time is continuous. The "incremental update" here may not solve the problem of missing data. For this situation, it needs to be handled according to the actual situation. If there are too many missing values, it can be considered. When selecting samples from patients with non-hospital infection, it is sufficient to randomly select consecutive days; the method of "incremental update" is used to process the missing values of the samples, as shown in Figure 3. The specific steps include:

步骤A4a,将前述步骤A3中所有病人的组成的集合记作S,m表示S中第m个病人;In step A4a, the set of all patients in the aforementioned step A3 is denoted as S, and m represents the mth patient in S;

步骤A4a,遍历集合S,得到S中的某个医院感染病人m,并对m进行“时间段推理”生成正反例样本集合N,并对N按当天日期升序排序,排序的目的是保证在增量式更新的时候时间是从小到大排列的,从而保证更新时总是新值覆盖旧值,若病人m是非感染病人,则采用随机抽样的方法来生成样本集合N;Step A4a, traverse the set S, obtain a hospital infection patient m in S, perform "time period reasoning" on m to generate a set N of positive and negative samples, and sort N in ascending order of the date of the day. During the incremental update, the time is arranged from small to large, so as to ensure that the new value always covers the old value during the update. If the patient m is a non-infected patient, the random sampling method is used to generate the sample set N;

步骤A4b,开始遍历样本集合N,第一条样本i是时间最小的样本,直接存入样本集合D,并将该样本i的特征归类到集合T中,记录Tk_v和Tk_date,表示样本i第k个特征的值和该值的日期;Step A4b, start to traverse the sample set N, the first sample i is the sample with the smallest time, directly store it in the sample set D, and classify the characteristics of the sample i into the set T, record Tk_v and Tk_date, indicating the first sample i. the value of the k features and the date of that value;

步骤A4c,开始遍历第二条及后面所有的样本i,对于i中每一个特征Tk的值Tk_v进行判断,如果该值为缺失值,则进行第5步,否则该值保留,不做任何处理;Step A4c, start to traverse the second and all subsequent samples i, and judge the value Tk_v of each feature Tk in i. If the value is a missing value, go to step 5, otherwise the value is retained and does not do any processing ;

步骤A4d,若样本i的特征Tk的值Tk_v为缺失值,则在样本集合D中逆序找到该特征Tk的值Tk_v和Tk_date,若D中该值不为空且Tk_date与i中的Tk_date的差值不超过“有效时间范围”则将该值取出更新到样本i的Tk_v中来代替缺失值,若D中该值不为空但是超过“有效时间范围”则推出遍历保持样本i第k个特征的缺失状态,若D中该值也为空则继续遍历下一个值。此处要求逆序遍历是为了保证集合D中的遍历的样本在时间上总是最靠近当前样本的;Step A4d, if the value Tk_v of the feature Tk of the sample i is a missing value, find the values Tk_v and Tk_date of the feature Tk in the sample set D in reverse order, if the value in D is not empty and the difference between Tk_date and Tk_date in i If the value does not exceed the "valid time range", the value is taken out and updated to Tk_v of sample i to replace the missing value. If the value in D is not empty but exceeds the "valid time range", the traversal is introduced to keep the kth feature of sample i The missing state of , if the value in D is also empty, continue to traverse the next value. Reverse order traversal is required here to ensure that the traversed samples in set D are always closest to the current sample in time;

步骤A4e,完成更新或者是保留后,将这条样本存入样本集合D中并进行下一条样本的读取,即i=i+1;Step A4e, after the update or reservation is completed, this sample is stored in the sample set D and the next sample is read, i.e. i=i+1;

步骤A4f,判断i=N是否成立,若成立,完成遍历,样本集合D构建完成,若不成立继续进行下一步。In step A4f, it is judged whether i=N is established. If so, the traversal is completed, and the construction of the sample set D is completed. If it is not established, continue to the next step.

步骤A5,对最终样本集进行分析建模、测试及优化;该步骤与前述步骤A3e相同;样本集生成以后,基本上就解决了医院感染数据最主要的几个难点,在进行分析建模的时候,基本上按照数据分析和机器学习的基本过程就可以完成后续工作了,不过,需要注意的是机器学习算法的选择并不是任意选择的,医院感染监测预警模型的预警结果一般需要具备可解释性,即有理有据,所以,算法必须选择具备解释性质的算法,像决策树、随机森林和逻辑回归等,而深度学习、支持向量机等算法不建议使用;建模和测试的过程如图3所示,这部分依旧采用传统的算法和步骤,步骤如下:Step A5: Perform analysis, modeling, testing and optimization on the final sample set; this step is the same as the aforementioned step A3e; after the sample set is generated, the main difficulties of the nosocomial infection data are basically solved. At this time, the follow-up work can basically be completed according to the basic process of data analysis and machine learning. However, it should be noted that the choice of machine learning algorithm is not arbitrary, and the early warning results of the hospital infection monitoring and early warning model generally need to be interpretable. Therefore, the algorithm must choose an explanatory algorithm, such as decision tree, random forest and logistic regression, etc., while deep learning, support vector machine and other algorithms are not recommended; the process of modeling and testing is shown in the figure 3, this part still uses the traditional algorithm and steps, the steps are as follows:

步骤A5a,对样本集合D进行建模,建议选取决策树、随机森林和逻辑回归等算法,该算法具备可解释性,并记录该算法在测试集合上的敏感性和特异性指标;Step A5a, model the sample set D, and suggest selecting algorithms such as decision tree, random forest, and logistic regression, which are interpretable, and record the sensitivity and specificity indicators of the algorithm on the test set;

步骤A5b,记录敏感性及特异性指标后,微调avg_pre和avg_end,再次进行建模和测试,记录两个指标;Step A5b, after recording the sensitivity and specificity indicators, fine-tune avg_pre and avg_end, perform modeling and testing again, and record the two indicators;

步骤A5c,多次建模测试,并找到效果最好的两个指标,此时avg_pre和avg_end基本上就是最佳值;Step A5c, perform multiple modeling tests, and find two indicators with the best effect. At this time, avg_pre and avg_end are basically the best values;

最终模型构建后,就可以上线集成了,这部分根据不同的系统会有较大差别,但是模型基本具备通用性。After the final model is built, it can be integrated online. This part will vary greatly according to different systems, but the model is basically universal.

本发明还提供一种基于增量式更新方法解决医院感染数据缺失的分析建模系统,至少包括一数据库,该数据库中存储有所有病人集合S以及集合S内病人的病例数据;一样本生成模块,根据样本生成条件生成样本集合,例如根据病人感染情况生成感染病人集合和非感染病人集合;一样本划分模块,将前述样本生成模块生成的样本集合划分为分析建模所需样本集合;以及一数据更新模块,该数据更新模块通过前述步骤A4a-步骤步骤A4f实现数据缺失值的更新。The present invention also provides an analysis and modeling system based on an incremental update method to solve the lack of nosocomial infection data, which at least includes a database that stores all patient sets S and case data of patients in the set S; a sample generation module , generating a sample set according to the sample generation conditions, for example, generating an infected patient set and a non-infected patient set according to the infection situation of the patient; a sample dividing module, dividing the sample set generated by the aforementioned sample generating module into a sample set required for analysis and modeling; and a A data update module, the data update module implements the update of missing data values through the aforementioned steps A4a-steps A4f.

一种基于增量式更新方法解决医院感染数据缺失的分析建模系统的实现方法,如图4所示,包括如下步骤:An implementation method of an analysis and modeling system based on an incremental update method to solve the missing data of nosocomial infection, as shown in Figure 4, includes the following steps:

步骤B1,根据数据库的信息,整理和明确医院感染数据中所需要的病人数据项并设计出对应的XML存储结构;Step B1, according to the information in the database, sort out and clarify the required patient data items in the nosocomial infection data and design a corresponding XML storage structure;

步骤B2,样本生成模块将病人的数据按设定采样周期为样本、按数据项为特征整理为所需数据的样本格式,生成所需样本集合;Step B2, the sample generation module organizes the patient's data into a sample format of the required data according to the set sampling period as the sample and the data item as the feature, and generates the required sample set;

在上述步骤B2中,将医院感染的数据整理为样本,这些样本中每一条均是一个病人在设定采样周期的数据,按照前述所述的增量式更新的方法对样本中的特征进行增量式更新,最终会产生若干病人在设定采样周期的样本所组成的样本集合。In the above step B2, the data of nosocomial infection are organized into samples, each of these samples is the data of a patient in the set sampling period, and the features in the samples are increased according to the aforementioned incremental update method. The quantitative update will eventually generate a sample set consisting of samples of several patients in the set sampling period.

步骤B3,样本划分模块按照最终分类的标签对样本集合进行划分,生成最终感染样本和非感染样本区分后的样本集合;In step B3, the sample division module divides the sample set according to the final classification label, and generates a final sample set that is distinguished from the infected sample and the non-infected sample;

步骤B4,划分后的样本集合通过数据更新模块进行增量式更新;Step B4, the divided sample set is incrementally updated by the data update module;

步骤B5,样本集合更新完成后,按照建模的一般方法进行建立模型。Step B5, after the update of the sample set is completed, build a model according to the general method of modeling.

进一步的,步骤B1中,文件以XML的方式进行存储,里面包含了病人的基本信息,如病例号、性别、年龄、感染日期等,包含了病人的入院基本信息,如入院诊断、入院科室、入院日期等,包含了病人在院期间设定采样周期的信息,如体温、医嘱、实验室检查、微生物检查、影像检查和病程记录等;该存储方案除了对病人的信息具备存储的功能外,最主要的是便于数据的组织和应用,XML里面的每一项都可以单独取出来并与其他项组合使用,并且里面每项都有准确的时间,也可以按照时间序列进行组织,使用方式取决于开发者的需求。Further, in step B1, the file is stored in the form of XML, which contains the basic information of the patient, such as case number, gender, age, date of infection, etc., and contains the basic information of the patient's admission, such as admission diagnosis, admission department, The admission date, etc., includes the information of the sampling period set by the patient during the hospitalization, such as body temperature, doctor's order, laboratory examination, microbiological examination, imaging examination, and disease course records; in addition to the storage function of the patient's information, this storage solution has the function of storing the patient information. The most important thing is to facilitate the organization and application of data. Each item in XML can be taken out separately and used in combination with other items, and each item has an accurate time. It can also be organized in time series, depending on the way of use. to the needs of developers.

一种计算机可读介质,该计算机可读介质用于通过计算机网络选取样本集合和医院感染数据分析和建模,包括一组指令,当执行时,该指令会导致至少一个计算机执行解决医院感染数据分析建模过程中的样本集合选取的问题以及选取样本集合后的数据分析和建模。A computer-readable medium for selecting sample collections and nosocomial infection data analysis and modeling over a computer network, comprising a set of instructions that, when executed, cause at least one computer to execute a solution to the nosocomial infection data Analyze the problem of sample set selection in the modeling process and the data analysis and modeling after sample set selection.

以上详细描述了本发明的较佳具体实施例。应当理解,本领域的普通技术人员无需创造性劳动就可以根据本发明的构思作出诸多修改和变化。因此,凡本技术领域中技术人员依本发明的构思在现有技术的基础上通过逻辑分析、推理或者有限的实验可以得到的技术方案,皆应在由权利要求书所确定的保护范围内。The preferred embodiments of the present invention have been described above in detail. It should be understood that those skilled in the art can make many modifications and changes according to the concept of the present invention without creative efforts. Therefore, all technical solutions that can be obtained by those skilled in the art through logical analysis, reasoning or limited experiments on the basis of the prior art according to the concept of the present invention shall fall within the protection scope determined by the claims.

Claims (10)

1. A time period reasoning method for selecting a sample of hospital infection data is characterized by comprising the following steps:
step 1, determining the characteristics of hospital infection data, classifying the characteristics according to an effective time range, wherein a characteristic set is marked as F, k represents the kth characteristic in the set F, and the effective time range is a general name of the aging time range of patient/patient detection data;
step 2, recording a set composed of all patients as S, obtaining a patient m in the set S, and generating a positive and negative sample set N for the patient m;
step 3, after the positive and negative sample set N is generated in the step 2, recording a hospital infected patient set as C, and recording a set formed by diagnosis dates of infected patients as Cd;
step 4, randomly extracting n patients from the set C, and obtaining diagnosis dates corresponding to the n patients;
step 5, diagnosing the n patients in the step 4, and obtaining arrays A _ pre and A _ end formed by data of 'previous time period' and 'next time period' of the n patients, wherein the 'previous time period' is the time unit length obtained by selecting infection samples by taking the diagnosis date as the reference date and deducing in the previous time, and the 'next time period' is the time unit length obtained by selecting infection samples by taking the diagnosis date as the reference date and deducing in the next time;
step 6, summing the two arrays in the step 5, and then averaging to obtain two average values avg _ pre ═ sum (a _ pre)/n and avg _ end ═ sum (a _ end)/n; these two averages serve as two parameters for time period inference for all patients in set C, approximating the "previous time period" and "subsequent time period" for all patients in set C;
step 7, updating the data to generate a sample set D and carrying out modeling test according to the sample set D;
step 8, continuously fine-adjusting the avg _ pre and the avg _ end according to the test result to obtain a final required value;
wherein, the positive sample in step 2 is the m-sample of the patient with the hospital infection, and the negative sample is the m-sample of the patient without the hospital infection.
2. A time period inference method according to claim 1, wherein if m is a patient in the positive sample, then m is scored as the mth patient in S; if m is the patient in the counterexample sample, then m is the randomly drawn patient.
3. The time period inference method of claim 1, wherein the method of updating data in step 7 is an incremental update method, comprising the steps of:
step 7a, sequencing the positive and negative sample sets N in the step 2 in an ascending order according to a sequence of time from front to back so as to ensure that the time is arranged from front to back in the incremental updating process, thereby ensuring that a new value always covers an old value during updating;
step 7b, storing the sample i with the earliest time in the sample set N into a sample set D, correspondingly storing the sample i into a set T according to the characteristics of the hospital infection data determined in the step 1, and respectively recording Tk _ v and Tk _ date which represent the value of the kth characteristic in the set T corresponding to the sample i in the set N and the date of the value;
step 7c, carrying out missing value judgment on the second and all the subsequent samples i in the sample set N, updating the missing values, and reserving the un-missing values;
step 7D, storing the updated or reserved samples into the sample set D, reading subsequent samples according to the sequence of the step 5 and storing sample data;
and 7e, when the step 7c and the step 7D are repeated to obtain that i is equal to N, the reading is completed, and the construction of the sample set D is completed.
4. A time period inference method according to claim 3, wherein, in step 7c, if the value Tk _ v of the signature Tk of the sample i is a missing value, the values Tk _ v and Tk _ date of the signature Tk are found in reverse order in the sample set D, and if the value in the sample set D is not empty and the difference between Tk _ date and Tk _ date in i does not exceed the "valid time range", the value is updated to Tk _ v of the sample i instead of the missing value.
5. A time period inference method according to claim 3, characterised in that in step 7c, if the value Tk _ v of the feature Tk of the sample i is a missing value, the values Tk _ v and Tk _ date of the feature Tk are found in reverse order in the sample set D, and if the value in the sample set D is not empty but exceeds the "valid time range", the missing state of the kth feature of the sample i is deduced by traversal.
6. A time period inference method according to claim 3, characterised in that in step 7c, if the value Tk _ v of the signature Tk of the sample i is a missing value, the values Tk _ v and Tk _ date of the signature Tk are found in reverse order in the sample set D, and if the value in the sample set D is empty, the next value is continued to be traversed.
7. An analytical modelling approach to address hospital infection data sample selection by the time period inference method of any of claims 1-6, comprising the steps of:
step A1, determining the characteristics of hospital infection data, and classifying the characteristics according to an effective time range;
step A2, determining patients generating positive and negative samples, wherein the positive sample is a patient sample with nosocomial infection, and the negative sample is a patient sample without nosocomial infection;
step A3, dividing positive and negative examples samples by adopting a time period reasoning mode, wherein the specific implementation mode is as described in the step 1 to the step 8;
step A4, generating a sample set by using an incremental update method, wherein the specific implementation manner is as described in steps 7a-7 e;
step a5, analytically modeling the final sample set.
8. An analytical modelling system for addressing hospital infection data sample selection by the time period inference method of any one of claims 1-6, comprising at least a database in which the case data of all patients in set S and in set S are stored; the sample generation module generates a sample set according to the sample generation condition; the sample dividing module is used for dividing the sample set generated by the sample generating module into a sample set required by analysis and modeling; and the data updating module realizes the updating of the missing data value through the steps 1 to 8.
9. An implementation method of an analytical modeling system for resolving hospital infection data sample selection by the time period inference method of claim 8, comprising the steps of:
step B1, according to the information of the database, the patient data items needed in the hospital infection data are sorted and defined and a corresponding XML storage structure is designed;
b2, the sample generating module arranges the patient data into the sample format of the needed data according to the set sampling period and the data item as the characteristic, and generates the needed sample set;
in the step B2, the data of nosocomial infection is arranged into samples, each of which is the data of a patient in a set sampling period, and the incremental updating method according to claim 3 is used to incrementally update the features in the samples, so as to finally generate a sample set consisting of a plurality of samples of patients in the set sampling period;
step B3, the sample dividing module divides the sample set according to the finally classified labels to generate a sample set after the final infection sample and the non-infection sample are distinguished;
step B4, the divided sample set is updated incrementally through a data updating module;
and step B5, after the sample set is updated, establishing a model according to a general modeling method.
10. A computer readable medium for selecting a sample set and hospital infection data analysis and modeling over a computer network, comprising a set of instructions which, when executed, cause at least one computer to perform the steps of solving the problem of sample selection during the hospital infection data analysis and modeling process and analyzing and modeling the data after sample selection according to any one of claims 1-6.
CN201811129775.3A 2018-09-27 2018-09-27 Time period reasoning method for selecting samples of hospital infection data Active CN109360657B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811129775.3A CN109360657B (en) 2018-09-27 2018-09-27 Time period reasoning method for selecting samples of hospital infection data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811129775.3A CN109360657B (en) 2018-09-27 2018-09-27 Time period reasoning method for selecting samples of hospital infection data

Publications (2)

Publication Number Publication Date
CN109360657A CN109360657A (en) 2019-02-19
CN109360657B true CN109360657B (en) 2022-06-03

Family

ID=65347853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811129775.3A Active CN109360657B (en) 2018-09-27 2018-09-27 Time period reasoning method for selecting samples of hospital infection data

Country Status (1)

Country Link
CN (1) CN109360657B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111312346B (en) * 2020-01-21 2023-04-18 杭州杏林信息科技有限公司 Statistical method, equipment and storage medium for newly infected number of inpatients
CN111312404B (en) * 2020-01-21 2023-04-18 杭州杏林信息科技有限公司 Method, equipment and storage medium for counting number of blood stream infected persons related to new central vascular catheter
CN112002383B (en) * 2020-06-30 2024-03-08 杭州杏林信息科技有限公司 Automatic management method and system for number of people in hospital infection state in specific period
CN112037893A (en) * 2020-07-08 2020-12-04 杭州杏林信息科技有限公司 Automatic management method and system for number of people in hospital infection state at specified time point
CN118969291A (en) * 2024-10-17 2024-11-15 四川省医学科学院·四川省人民医院 A hospital infection risk prediction method and system based on random forest algorithm

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002065135A3 (en) * 2001-02-15 2003-05-30 Affitech As Determination of level of immunoglobulin modification
CN1598858A (en) * 2004-05-13 2005-03-23 郑州市疾病预防控制中心 Integral management system for digital information of hospital
CN105893725A (en) * 2014-11-13 2016-08-24 北京众智汇医科技有限公司 Management system for an entire process of hospital infection prevention and control, and method thereof
CN106390117A (en) * 2009-10-16 2017-02-15 奥默罗斯公司 Methods for treating conditions associated with masp-2 dependent complement activation
CN107658023A (en) * 2017-09-25 2018-02-02 泰康保险集团股份有限公司 Disease forecasting method, apparatus, medium and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002065135A3 (en) * 2001-02-15 2003-05-30 Affitech As Determination of level of immunoglobulin modification
CN1598858A (en) * 2004-05-13 2005-03-23 郑州市疾病预防控制中心 Integral management system for digital information of hospital
CN106390117A (en) * 2009-10-16 2017-02-15 奥默罗斯公司 Methods for treating conditions associated with masp-2 dependent complement activation
CN105893725A (en) * 2014-11-13 2016-08-24 北京众智汇医科技有限公司 Management system for an entire process of hospital infection prevention and control, and method thereof
CN107658023A (en) * 2017-09-25 2018-02-02 泰康保险集团股份有限公司 Disease forecasting method, apparatus, medium and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
医院感染监测基本数据集的建立及作用;付强等;《中华医院感染学杂志》;20161231;第26卷(第11期);全文 *

Also Published As

Publication number Publication date
CN109360657A (en) 2019-02-19

Similar Documents

Publication Publication Date Title
CN109360657B (en) Time period reasoning method for selecting samples of hospital infection data
US20200337580A1 (en) Time series data learning and analysis method using artificial intelligence
US20100217144A1 (en) Diagnostic and predictive system and methodology using multiple parameter electrocardiography superscores
US6099469A (en) Reflex algorithm for early and cost effective diagnosis of myocardial infractions suitable for automated diagnostic platforms
CN110974214A (en) A deep learning-based automatic electrocardiogram classification method, system and device
CN109659033A (en) A kind of chronic disease change of illness state event prediction device based on Recognition with Recurrent Neural Network
Tadesse et al. DeepMI: Deep multi-lead ECG fusion for identifying myocardial infarction and its occurrence-time
WO2019161611A1 (en) Ecg information processing method and ecg workstation
JP7404581B1 (en) Chronic nephropathy subtype mining system based on self-supervised graph clustering
CN112786203A (en) Machine learning diabetic retinopathy morbidity risk prediction method and application
Pandiaraj et al. Effective heart disease prediction using hybridmachine learning
CN107348964B (en) Psychological load measurement method of drivers in extra-long tunnel environment based on factor analysis
CN116740426A (en) A classification and prediction system for functional magnetic resonance images
Nath et al. Quantum annealing for automated feature selection in stress detection
US11961204B2 (en) State visualization device, state visualization method, and state visualization program
CN114191665A (en) Classification method and classification device for human-machine asynchrony during mechanical ventilation
CN109461480B (en) Incremental updating method for hospital infection data loss
CN115607166B (en) A method and system for intelligent analysis of ECG signals, and an intelligent ECG auxiliary system
Meng et al. Prediction of coronary heart disease using routine blood tests
CN114038554A (en) An auxiliary diagnosis system for tuberculous pleural effusion based on machine learning algorithm
CN115512845A (en) ACS risk prediction method, device and storage medium
Wang et al. Automated analysis of fetal heart rate baseline/acceleration/deceleration using MTU-Net3+ model
AU2401499A (en) Automated diagnostic system implementing immunoassays and clinical chemistry assays according to a reflex algorithm
CN113421643B (en) AI model reliability judging method, device, equipment and storage medium
Firoz et al. Detection of myocardial infarction using hybrid CNN-LSTM model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230824

Address after: 200032 No. 136, Xuhui District Medical College, Shanghai

Patentee after: ZHONGSHAN HOSPITAL, FUDAN University

Patentee after: SHANGHAI LILIAN INFORMATION TECHNOLOGY CO.,LTD.

Address before: 200444 room 1536, building 1, No. 668, SHANGDA Road, Baoshan District, Shanghai

Patentee before: SHANGHAI LILIAN INFORMATION TECHNOLOGY CO.,LTD.