KR102693667B1

KR102693667B1 - Apparatus and method for predicting discharge of inpatients

Info

Publication number: KR102693667B1
Application number: KR1020210154810A
Authority: KR
Inventors: 김영학; 전태준; 안임진; 강희준; 권한슬; 김윤하; 서혜람; 조하나; 최희정; 김민경; 한지예
Original assignee: 재단법인 아산사회복지재단; 울산대학교 산학협력단
Priority date: 2021-11-11
Filing date: 2021-11-11
Publication date: 2024-08-09
Anticipated expiration: 2041-11-11
Also published as: KR20230068717A; WO2023085674A1

Abstract

일 실시예에 따른 환자의 퇴원 예측을 위한 장치는, 제1 시점에 입원 중인 환자에 대하여, 상기 환자의 입원 기간 중 입원 날짜로부터 상기 제1 시점까지 수집된 의료 데이터를 포함하는 입력 데이터에 기계 학습 모델을 적용함으로써 상기 환자가 상기 제1 시점으로부터 대상 기간 내에 퇴원할 제1 가능성 점수를 획득하고, 상기 획득된 제1 가능성 점수에 기초하여 상기 환자가 상기 제1 시점으로부터 상기 대상 기간 내에 퇴원할지 여부를 예측하는 프로세서를 포함할 수 있다. 상기 프로세서는, 상기 제1 시점 이후의 제2 시점에 상기 환자가 입원 중인 경우에 응답하여, 상기 제1 시점으로부터 상기 제2 시점까지 수집된 의료 데이터에 기초하여 상기 입원 기간 중 수집된 의료 데이터를 업데이트하고, 상기 업데이트된 의료 데이터를 포함하는 입력 데이터에 상기 기계 학습 모델을 적용함으로써 상기 환자가 상기 제2 시점으로부터 대상 기간 내에 퇴원할 제2 가능성 점수를 획득하며, 상기 획득된 제2 가능성 점수에 기초하여, 상기 환자가 상기 제2 시점으로부터 상기 대상 기간 내에 퇴원할지 여부를 예측할 수 있다.A device for predicting discharge of a patient according to one embodiment may include a processor for applying a machine learning model to input data including medical data collected from a date of admission to a first time point during a period of hospitalization of a patient hospitalized at a first time point, thereby obtaining a first likelihood score of discharge of the patient within a target period from the first time point, and predicting whether the patient will be discharged from the first time point within the target period based on the obtained first likelihood score. The processor may, in response to a case in which the patient is hospitalized at a second time point after the first time point, update medical data collected during the period of hospitalization based on the medical data collected from the first time point to the second time point, and obtain a second likelihood score of discharge of the patient within the target period from the second time point by applying the machine learning model to input data including the updated medical data, and predicting whether the patient will be discharged from the second time point within the target period based on the obtained second likelihood score.

Description

{APPARATUS AND METHOD FOR PREDICTING DISCHARGE OF INPATIENTS}

이하, 입원 환자의 퇴원 예측을 위한 장치 및 방법에 관련된 기술이 개시된다.Hereinafter, a technology related to a device and method for predicting discharge of a hospitalized patient is disclosed.

병원의 효과적인 자원 관리는 인력의 노동 집약적 부담을 줄이고 입원 환자의 대기시간을 줄이며 최적의 치료시간을 확보함으로써 의료 서비스의 질을 향상시킬 수 있다. 병원 프로세스의 활용은 효과적인 병상 관리를 요구하고, 환자의 최적 치료 시간보다 오래 입원하는 것은 병상 관리를 방해할 수 있다. 환자의 입원 기간을 예측하는 것은 병상 관리에 대한 현명한 결정을 내리는 데 도움이 될 수 있다.Effective resource management in hospitals can improve the quality of healthcare services by reducing the labor-intensive burden of personnel, reducing the waiting time of hospitalized patients, and ensuring optimal treatment time. Utilization of hospital processes requires effective bed management, and a patient's hospitalization longer than the optimal treatment time can hinder bed management. Predicting the length of stay of a patient can help make wise decisions about bed management.

비용이 많이 들고 희소한 인적 자원과 물적 자원의 활용은 병원 프로세스의 효율적인 운영을 위해 필수적일 수 있다. 병원은 의료진과 의료진의 일정 관리, 병상 관리, 임상 경로 관리 등 다양한 자원들을 처리하면서 전반적인 관리 효율성을 개선하도록 요구될 수 있다. 병원의 효과적인 자원 관리는 인력의 노동 집약적 부담을 줄이고 입원환자의 대기시간을 줄이며 최적의 치료시간을 확보함으로써 의료의 질을 향상시킬 수 있다. 병원 자원 관리는 병상 관리를 포함할 수 있다. 최근 대부분의 병원에서 임상의는 환자의 상태를 수동으로 확인하여 입원을 계속할지 퇴원할지 결정할 수 있다. 전술된 결정에 따라 의료진 및 스태프는 가까운 시일 내에 병상 수용 인원을 파악하고 환자의 예약을 잡을 수 있다. 심혈관질환(CVD)과 같은 다양한 만성 및 급성 질환으로 입원하는 환자가 꾸준히 증가하고, 미흡한 치료로 인한 재입원 또는 합병증이 유발될 수 있다. 환자의 최적 치료 시간보다 오래 입원하게 되면 효과적인 병상 관리가 어려워질 수 있다. 환자의 입원 기간을 정확하게 예측하고 퇴원을 신중하게 결정하는 것이 중요할 수 있다.Utilization of expensive and scarce human and material resources may be essential for efficient operation of hospital processes. Hospitals may be required to improve overall management efficiency while handling various resources such as medical staff and medical staff schedule management, bed management, and clinical pathway management. Effective resource management of hospitals can improve the quality of care by reducing the labor-intensive burden of personnel, reducing the waiting time of hospitalized patients, and securing optimal treatment time. Hospital resource management may include bed management. In most hospitals today, clinicians can manually check the status of patients and decide whether to continue hospitalization or discharge them. Based on the aforementioned decision, medical staff and staff can determine the bed capacity in the near future and schedule patients. The number of patients hospitalized for various chronic and acute diseases such as cardiovascular disease (CVD) is steadily increasing, and readmission or complications due to inadequate treatment may occur. If patients are hospitalized longer than the optimal treatment time, effective bed management may become difficult. It may be important to accurately predict the length of hospitalization of patients and carefully decide on discharge.

많은 연구들이 병원 자원의 효율성에 초점을 맞추었고, 대부분은 병상 관리를 개선하기 위한 알고리즘이나 모델을 제시할 수 있다. 병상 계획을 조사하고 최적화 문제를 해결하기 위해 정수 선형 프로그램은 제안될 수 있다. 시뮬레이션된 병상 점유 일정이 설명될 수 있다. 또한, ICU 수용력을 파악하기 위해 Monte Carlo 시뮬레이션을 사용하여 수술 환자를 위한 병상 시뮬레이션이 연구될 수 있다. 특히, 예상된 체류 기간(length of stay; LOS)은 병상 관리에 필요한 정보 중 하나로, 전자 건강 기록(electronic health record; EHR)을 기반으로 LOS를 예측할 수 있다. LOS, 장기간 입원, 및 계획되지 않은 재입원을 예측하고 중병에 대한 바이오마커를 찾기 위하여, 기계 학습(ML) 기반 모델을 사용할 수 있다. 최근에는 해석 가능하거나 설명할 수 있는 인공지능(explainable artificial intelligence; XAI)에 대한 연구가 많이 진행될 수 있다. XAI 연구 중 하나에서 급성 질환을 예측하고 결과와 해석을 모두 제공하는 모델이 개발될 수 있다. EHR에 비해 영상 기법의 컴퓨터 비전 결과를 활용한 연구가 영상의 중요한 부분을 직접 시각화할 수 있기 때문에 보다 적극적으로 추진될 수 있다. 병상 관리를 지원하기 위해 ML 기반 예측 모델을 개발하여 매일(daily) 퇴원 확률을 제공하고 영향을 받는 피처를 시각화하는 "개별 설명자"를 제공할 수 있다.Many studies have focused on the efficiency of hospital resources, and most of them can suggest algorithms or models to improve bed management. Integer linear programs can be proposed to investigate bed planning and solve optimization problems. Simulated bed occupancy schedules can be explained. In addition, bed simulation for surgical patients can be studied using Monte Carlo simulation to figure out ICU capacity. In particular, the expected length of stay (LOS) is one of the information required for bed management, and LOS can be predicted based on electronic health records (EHR). Machine learning (ML)-based models can be used to predict LOS, long-term hospitalization, and unplanned readmissions, and to find biomarkers for critical illness. Recently, many studies on interpretable or explainable artificial intelligence (XAI) can be conducted. In one of the XAI studies, a model can be developed to predict acute diseases and provide both results and interpretation. Compared to EHR, research utilizing computer vision results of imaging techniques can be promoted more actively because it can directly visualize important parts of the image. To support bed management, ML-based predictive models can be developed to provide daily discharge probabilities and “individual descriptors” that visualize the affected features.

삭제delete

한국 공개특허공보 제10-2021-0113042호Korean Patent Publication No. 10-2021-0113042 국제공개공보 WO2021/028961International Publication WO2021/028961 미국 특허출원공개공보 US2021/0005321호United States Patent Application Publication No. US2021/0005321

심혈관 질환들(cardiovascular diseases; CVDs)을 갖는 입원 환자의 퇴원 확률을 예측하기 위해 기계 학습(machine learning; ML) 기반 예측 모델은 개발될 수 있다. 예측 모델의 결과는 평가되고 환자 맞춤형 치료를 위해 입원 환자의 주요 위험 요소는 설명될 수 있다. 병상 일정이 효율적으로 관리되고 장기 입원 환자가 사전에 감지될 수 있다. 병원 프로세스의 활용도는 개선되고 의료 서비스의 질을 높일 수 있다. A machine learning (ML)-based prediction model can be developed to predict the discharge probability of hospitalized patients with cardiovascular diseases (CVDs). The results of the prediction model can be evaluated, and the major risk factors of hospitalized patients can be explained for personalized treatment. Bed schedules can be managed efficiently, and long-term hospitalization patients can be detected in advance. The utilization of hospital processes can be improved, and the quality of medical services can be improved.

심혈관질환을 포함한 만성 및 급성질환 환자는 높은 입원율, 재입원율, 합병증 등을 가질 수 있다. 심각한 문제를 야기하는 치료 또는 입원 지연을 해결하기 위해 다른 병원으로 이송하는 대안이 있을 수 있다. 병원에서는 대기시간을 줄이기 위한 근본적인 방안을 지속적으로 모색해야 하며, 효율적인 병상관리도 그 중 하나라고 할 수 있다. 질병의 다양성 때문에, 공통 위험 팩터를 찾아 특정 과들(departments) 또는 질병(예를 들어, 군집된 특정 병동(clustered specific wards))에 대한 병상 관리를 시행한 후 병원 차원으로 확대하는 것이 더 유리할 수 있다. ML 기반 모델을 개발하고 CVD로 입원한 환자의 퇴원을 예측하여 가까운 장래에 가용 병상 용량을 결정하고 위험 팩터를 발견할 수 있다. 개인별 퇴원 예정일, 심혈관질환 위험 팩터 등 설득력 있는 퇴원 정보를 제공함으로써 의료진이 수동으로 하는 정확한 병상 관리를 실무에서 보조할 수 있다.Patients with chronic and acute diseases, including cardiovascular diseases, may have high rates of hospitalization, readmission, and complications. There may be alternatives to transfer to other hospitals to resolve treatment or delay in admission that causes serious problems. Hospitals should continuously seek fundamental measures to reduce waiting times, and efficient bed management is one of them. Due to the diversity of diseases, it may be more advantageous to find common risk factors and implement bed management for specific departments or diseases (e.g., clustered specific wards) and then expand to the hospital level. Developing an ML-based model and predicting the discharge of patients hospitalized with CVD can determine available bed capacity in the near future and discover risk factors. By providing persuasive discharge information such as expected discharge date for each individual and cardiovascular disease risk factors, it can assist medical staff in manual accurate bed management in practice.

예측의 결과를 평가하고 환자별 진료를 위해 입원환자의 주요 위험 팩터를 설명하기 위해 개별 설명자를 제안하였다. 환자가 같은 질병을 가지고 있고 질병을 나타내는 공통 변수가 있더라도, 환자마다 특성, 병력, 상황 및 치료법이 다를 수 있다. 환자마다 고유한 개별 변수를 식별하고 모니터링하는 것이 요구될 수 있다. ML 기반 예측 모델의 결과에는 환자의 일일 퇴원에 대한 정보뿐만 아니라 피처 중요도와 같은 피처의 기여도가 포함될 수 있다. 각 환자의 일별 퇴원 확률과 입원 기간 동안 개별 환자에게 영향을 미치는 피처를 시각화할 수 있다. 개별 설명자는 의료팀과 환자들이 ML 기반 모델의 결과에 대한 합리적인 근거를 확보하고 조건을 자세히 이해하고 치료를 미리 준비할 수 있도록 유도할 수 있다. 개인별 분석은 각 환자에 초점을 맞출 수 있으며 식별된 의미 있는 특징은 입원에 영향을 미치는 변수를 사전 식별하는 기초로 다른 연구에서 사용될 수 있다.To evaluate the results of prediction and explain the major risk factors of hospitalized patients for individual treatment, individual descriptors are proposed. Although patients have the same disease and have common variables indicating the disease, each patient may have different characteristics, medical history, circumstances, and treatments. It may be required to identify and monitor unique individual variables for each patient. The results of the ML-based prediction model may include information on the daily discharge of patients as well as the contribution of features such as feature importance. The daily discharge probability of each patient and the features that affect each patient during the hospitalization period can be visualized. Individual descriptors can help medical teams and patients obtain reasonable grounds for the results of the ML-based model, understand the conditions in detail, and prepare for treatment in advance. Individual analysis can focus on each patient, and the identified meaningful features can be used in other studies as a basis for pre-identifying variables that affect hospitalization.

병상예약을 효율적으로 관리하고 장기입원환자를 사전에 발견하는데 도움이 될 수 있다. 병상 관리는 퇴원 가능성이 가장 높은 환자를 지정하고, 이용 가능한 병상 수를 확보하고, 입원 예약 후 대기 중인 환자에게 병상을 배정하는 과정을 나타낼 수 있다. 프로세스는 복잡하고 일반적으로 수동으로 수행되므로 모델에서 반환된 예상 LOS 및 퇴원 확률을 제공하고 가까운 장래에 병상 수용력을 인식하여 프로세스를 지원하는 것을 의도할 수 있다. 퇴원 확률이 높은 환자뿐만 아니라 지속적으로 퇴원 확률이 낮은 환자도 감지될 수 있다. 다시 말해, 고위험군 환자의 장기입원 원인을 파악하고 분석하여 관리팀에 제공할 수 있다.It can help to manage bed reservations efficiently and detect long-term patients in advance. Bed management can refer to the process of designating patients with the highest probability of discharge, securing the number of available beds, and assigning beds to patients waiting after admission reservations. Since the process is complex and usually performed manually, it can be intended to support the process by providing the expected LOS and discharge probability returned from the model and recognizing bed capacity in the near future. Not only patients with a high probability of discharge, but also patients with a consistently low probability of discharge can be detected. In other words, the reasons for long-term hospitalization of high-risk patients can be identified and analyzed and provided to the management team.

요약하자면, CVD로 입원한 환자가 3일 이내에 퇴원할 것인지 예측하기 위해 ML 기반 모델을 개발할 수 있다. 모델을 기반으로 개별 설명자가 제안될 수 있고, 퇴원 확률과 인구 통계(demography), 처방된 약물 및 치료와 같은 영향을 받는 피처를 포함하는 병상 관리 시뮬레이션이 도시될 수 있다. 병원 자원의 효율적인 활용을 개선하고 의료 서비스의 질을 높이는 데 도움이 될 수 있다.In summary, an ML-based model can be developed to predict whether a patient hospitalized with CVD will be discharged within 3 days. Based on the model, individual descriptors can be proposed, and a bed management simulation including the discharge probability and the affected features such as demographics, prescribed medications, and treatments can be illustrated. It can help improve the efficient utilization of hospital resources and improve the quality of healthcare services.

다만, 기술적 과제는 상술한 기술적 과제들로 한정되는 것은 아니며, 또 다른 기술적 과제들이 존재할 수 있다.However, technical challenges are not limited to the technical challenges described above, and other technical challenges may exist.

코호트 기준을 설정하고 CVD를 전문으로 하는 수동 큐레이팅 데이터베이스 CardioNet에서 데이터를 추출할 수 있다. 데이터를 처리하여 날짜 인덱스를 다시 인덱싱하고 현재 기능을 3년 전의 과거 기능과 통합하고 누락된 값을 대치하여 적절한 데이터 세트를 생성할 수 있다. ML 기반 예측 모델을 훈련하고 평가하여 정교한 모델을 발견할 수 있다. 3일 이내의 퇴원 확률을 예측하고 모델의 피처들을 식별, 정량화, 및 시각화함으로써 결과를 설명할 수 있다.You can set the cohort criteria and extract data from CardioNet, a manually curated database specializing in CVD. You can process the data to reindex the date index, merge the current features with the past features from 3 years ago, and impute missing values to create a suitable dataset. You can train and evaluate ML-based predictive models to find sophisticated models. You can predict the probability of discharge within 3 days and explain the results by identifying, quantifying, and visualizing the features of the model.

ML 기반 예측 모델을 개발하여 각 심혈관 질환 환자에 대해 매일 3일 이내의 퇴원 확률을 예측하고 개인별 LOS를 획득할 수 있다.We developed an ML-based predictive model to predict the probability of discharge within 3 days for each cardiovascular disease patient and obtain individual LOS.

일실시예에 따른 환자의 퇴원 예측을 위한 장치는, 제1 시점에 입원 중인 환자에 대하여, 상기 환자의 입원 기간 중 입원 날짜로부터 상기 제1 시점까지 수집된 의료 데이터를 포함하는 입력 데이터에 기계 학습 모델을 적용함으로써 상기 환자가 상기 제1 시점으로부터 대상 기간 내에 퇴원할 제1 가능성 점수를 획득하고, 상기 획득된 제1 가능성 점수에 기초하여 상기 환자가 상기 제1 시점으로부터 상기 대상 기간 내에 퇴원할지 여부를 예측하는 프로세서를 포함할 수 있다.A device for predicting discharge of a patient according to an embodiment of the present invention may include a processor for applying a machine learning model to input data including medical data collected from a date of hospitalization to a first time point during a period of hospitalization of a patient hospitalized at a first time point, thereby obtaining a first likelihood score that the patient will be discharged from the hospital within a target period from the first time point, and predicting whether the patient will be discharged from the hospital within the target period from the first time point based on the obtained first likelihood score.

상기 프로세서는 상기 환자에 대하여 상기 입원 날짜로부터 상기 제1 시점까지 수집된 수술(operation), 처치(procedure), 의료영상저장전송시스템(Picture Archiving and Communication System; PACS), 진단(diagnosis), 복약(medication), 검사(laboratory), 및 신체(physical) 중 하나 또는 둘 이상의 조합에 관한 의료 데이터를 포함하는 입력 데이터에 상기 기계 학습 모델을 적용함으로써 상기 제1 가능성 점수를 획득할 수 있다.The processor may obtain the first likelihood score by applying the machine learning model to input data including medical data regarding one or more combinations of operation, procedure, Picture Archiving and Communication System (PACS), diagnosis, medication, laboratory, and physical collected from the date of admission to the first time point for the patient.

상기 프로세서는 상기 제1 시점 이후의 제2 시점에 상기 환자가 입원 중인 경우에 응답하여, 상기 제1 시점으로부터 상기 제2 시점까지 수집된 의료 데이터에 기초하여 상기 입원 기간 중 수집된 의료 데이터를 업데이트하고, 상기 업데이트된 의료 데이터를 포함하는 입력 데이터에 상기 기계 학습 모델을 적용함으로써 상기 환자가 상기 제2 시점으로부터 대상 기간 내에 퇴원할 제2 가능성 점수를 획득하며, 상기 획득된 제2 가능성 점수에 기초하여, 상기 환자가 상기 제2 시점으로부터 상기 대상 기간 내에 퇴원할지 여부를 예측할 수 있다.The processor, in response to a case where the patient is hospitalized at a second point in time after the first point in time, updates medical data collected during the hospitalization period based on medical data collected from the first point in time to the second point in time, and applies the machine learning model to input data including the updated medical data to obtain a second likelihood score that the patient will be discharged from the hospital within a target period from the second point in time, and predicts whether the patient will be discharged from the hospital within the target period from the second point in time based on the obtained second likelihood score.

상기 프로세서는, 상기 입원 기간 중 수집된 의료 데이터와 함께 상기 입원 기간 이전의 미리 정의된 기간 동안 수집된 의료 데이터를 포함하는 입력 데이터에 상기 기계 학습 모델을 적용함으로써 상기 제1 가능성 점수를 획득할 수 있다.The above processor can obtain the first likelihood score by applying the machine learning model to input data including medical data collected during the hospitalization period together with medical data collected during a predefined period prior to the hospitalization period.

상기 프로세서는 상기 입원 기간 중 수집된 의료 데이터와 함께 상기 환자에 대하여 상기 입원 기간 이전의 미리 정의된 기간 동안 수집된 진단(diagnosis), 복약(Medication), 검사(Laboratory), 신체(Physical), 및 중환자실(intensive care unit; ICU)의 체류 기간(length of stay; LOS) 중 하나 또는 둘 이상의 조합에 관한 의료 데이터를 포함하는 입력 데이터에 상기 기계 학습 모델을 적용함으로써 상기 제1 가능성 점수를 획득할 수 있다.The processor may obtain the first likelihood score by applying the machine learning model to input data including medical data collected during the period of hospitalization and medical data regarding one or more combinations of diagnosis, medication, laboratory, physical, and length of stay (LOS) in an intensive care unit (ICU) collected for the patient during a predefined period prior to the period of hospitalization.

상기 프로세서는 수집된 데이터의 모든 피처들(features)에 기초하여 트레이닝된 임시 기계 학습 모델에 대한 각 피처의 피처 중요도(feature importance)에 기초하여, 상기 수집된 데이터의 피처들 중 하나 이상의 피처들을 상기 기계 학습 모델의 상기 입력 데이터의 피처로 선택하고, 상기 기계 학습 모델을 상기 선택된 피처들에 기초하여 트레이닝시키며, 상기 선택된 피처들을 포함하는 상기 입력 데이터에 상기 기계 학습 모델을 적용함으로써 상기 제1 가능성 점수를 획득할 수 있다.The above processor may select one or more features of the features of the collected data as features of the input data of the machine learning model based on feature importance of each feature for a temporary machine learning model trained based on all features of the collected data, train the machine learning model based on the selected features, and obtain the first likelihood score by applying the machine learning model to the input data including the selected features.

상기 프로세서는 교차 검증에 따른 재귀적 피처 제거(Recursive feature elimination with cross validation; RFECV) 기법을 상기 입력 데이터의 피처들에 적용함으로써 하나 이상의 피처들을 상기 기계 학습 모델의 입력으로 선택할 수 있다.The above processor can select one or more features as inputs of the machine learning model by applying a recursive feature elimination with cross validation (RFECV) technique to the features of the input data.

상기 프로세서는 상기 입력 데이터에 XGB(extreme gradient boost) 모델을 적용함으로써 상기 제1 가능성 점수를 획득할 수 있다.The above processor can obtain the first likelihood score by applying an extreme gradient boost (XGB) model to the input data.

상기 프로세서는 상기 획득된 제1 가능성 점수에 대하여 상기 입력 데이터의 각 피처에 의하여 유발된 점수에 대응하는 피처 영향도(feature influence)에 기초하여 상기 피처들 중 하나 이상의 피처들을 선택할 수 있다.The processor may select one or more of the features based on a feature influence corresponding to a score induced by each feature of the input data with respect to the acquired first likelihood score.

일 실시예에 따른 환자의 퇴원 예측을 위한 장치는 상기 획득된 제1 가능성 점수에 대하여, 상기 선택된 하나 이상의 피처들의 상기 피처 영향도를 표시하는 디스플레이를 더 포함할 수 있다.The device for predicting discharge of a patient according to one embodiment may further include a display that displays feature influences of the selected one or more features with respect to the acquired first likelihood score.

일 실시예에 따른 환자의 퇴원 예측을 위한 방법은 제1 시점에 입원 중인 환자에 대하여, 상기 환자의 입원 기간 중 입원 날짜로부터 상기 제1 시점까지 수집된 의료 데이터를 포함하는 입력 데이터에 기계 학습 모델을 적용함으로써 상기 환자가 상기 제1 시점으로부터 대상 기간 내에 퇴원할 제1 가능성 점수를 획득하는 단계; 및 상기 획득된 제1 가능성 점수에 기초하여 상기 환자가 상기 제1 시점으로부터 상기 대상 기간 내에 퇴원할지 여부를 예측하는 단계를 포함할 수 있다.A method for predicting discharge of a patient according to one embodiment may include the steps of: obtaining a first likelihood score of a patient being discharged from the hospital within a target period from the first time point by applying a machine learning model to input data including medical data collected from a date of hospitalization to the first time point during the hospitalization period of a patient hospitalized at a first time point; and predicting whether the patient will be discharged from the hospital within the target period from the first time point based on the obtained first likelihood score.

상기 제1 가능성 점수를 획득하는 단계는, 상기 환자에 대하여 상기 입원 날짜로부터 상기 제1 시점까지 수집된 수술(operation), 처치(procedure), 의료영상저장전송시스템(Picture Archiving and Communication System; PACS), 진단(diagnosis), 복약(medication), 검사(laboratory), 및 신체(physical) 중 하나 또는 둘 이상의 조합에 관한 의료 데이터를 포함하는 입력 데이터에 상기 기계 학습 모델을 적용함으로써 상기 제1 가능성 점수를 획득하는 단계를 포함할 수 있다.The step of obtaining the first likelihood score may include the step of obtaining the first likelihood score by applying the machine learning model to input data including medical data regarding one or more combinations of operation, procedure, Picture Archiving and Communication System (PACS), diagnosis, medication, laboratory, and physical collected from the date of admission to the first time point for the patient.

일 실시예에 따른 환자의 퇴원 예측을 위한 방법은 상기 제1 시점 이후의 제2 시점에 상기 환자가 입원 중인 경우에 응답하여, 상기 제1 시점으로부터 상기 제2 시점까지 수집된 의료 데이터에 기초하여 상기 입원 기간 중 수집된 의료 데이터를 업데이트하는 단계; 상기 업데이트된 의료 데이터를 포함하는 입력 데이터에 상기 기계 학습 모델을 적용함으로써 상기 환자가 상기 제2 시점으로부터 대상 기간 내에 퇴원할 제2 가능성 점수를 획득하는 단계; 및 상기 획득된 제2 가능성 점수에 기초하여, 상기 환자가 상기 제2 시점으로부터 상기 대상 기간 내에 퇴원할지 여부를 예측하는 단계를 더 포함할 수 있다.A method for predicting discharge of a patient according to one embodiment may further include: in response to a case where the patient is hospitalized at a second point in time after the first point in time, updating medical data collected during the hospitalization period based on the medical data collected from the first point in time to the second point in time; obtaining a second likelihood score that the patient will be discharged from the hospital within a target period from the second point in time by applying the machine learning model to input data including the updated medical data; and predicting whether the patient will be discharged from the hospital within the target period from the second point in time based on the obtained second likelihood score.

상기 제1 가능성 점수를 획득하는 단계는, 상기 입원 기간 중 수집된 의료 데이터와 함께 상기 입원 기간 이전의 미리 정의된 기간 동안 수집된 의료 데이터를 포함하는 입력 데이터에 상기 기계 학습 모델을 적용함으로써 상기 제1 가능성 점수를 획득하는 단계를 포함할 수 있다.The step of obtaining the first likelihood score may include the step of obtaining the first likelihood score by applying the machine learning model to input data including medical data collected during the hospitalization period together with medical data collected during a predefined period prior to the hospitalization period.

상기 제1 가능성 점수를 획득하는 단계는, 상기 입원 기간 중 수집된 의료 데이터와 함께 상기 환자에 대하여 상기 입원 기간 이전의 미리 정의된 기간 동안 수집된 진단(diagnosis), 복약(Medication), 검사(Laboratory), 신체(Physical), 및 중환자실(intensive care unit; ICU)의 체류 기간(length of stay; LOS) 중 하나 또는 둘 이상의 조합에 관한 의료 데이터를 포함하는 입력 데이터에 상기 기계 학습 모델을 적용함으로써 상기 제1 가능성 점수를 획득하는 단계를 포함할 수 있다.The step of obtaining the first likelihood score may include the step of obtaining the first likelihood score by applying the machine learning model to input data including medical data collected during the period of hospitalization and medical data regarding one or more combinations of diagnosis, medication, laboratory, physical, and length of stay (LOS) in an intensive care unit (ICU) collected for the patient during a predefined period prior to the period of hospitalization.

일 실시예에 따른 환자의 퇴원 예측을 위한 방법은 수집된 데이터의 모든 피처들(features)에 기초하여 트레이닝된 임시 기계 학습 모델에 대한 각 피처의 피처 중요도(feature importance)에 기초하여, 상기 수집된 데이터의 피처들 중 하나 이상의 피처들을 상기 기계 학습 모델의 상기 입력 데이터의 피처로 선택하는 단계; 및 상기 기계 학습 모델을 상기 선택된 피처들에 기초하여 트레이닝시키는 단계를 더 포함할 수 있고, 상기 제1 가능성 점수를 획득하는 단계는 상기 선택된 피처들을 포함하는 상기 입력 데이터에 상기 기계 학습 모델을 적용함으로써 상기 제1 가능성 점수를 획득하는 단계를 포함할 수 있다.A method for predicting discharge of a patient according to one embodiment may further include: selecting one or more features of the collected data as features of the input data of the machine learning model based on feature importance of each feature to a temporary machine learning model trained based on all features of the collected data; and training the machine learning model based on the selected features; wherein the step of obtaining the first likelihood score may include obtaining the first likelihood score by applying the machine learning model to the input data including the selected features.

상기 하나 이상의 피처들을 상기 기계 학습 모델의 입력의 피처로 선택하는 단계는 교차 검증에 따른 재귀적 피처 제거(Recursive feature elimination and cross validation; RFECV) 기법을 상기 입력 데이터의 피처들에 적용함으로써 하나 이상의 피처들을 상기 기계 학습 모델의 입력으로 선택하는 단계를 포함할 수 있다.The step of selecting one or more features as features of the input of the machine learning model may include the step of selecting one or more features as input of the machine learning model by applying a recursive feature elimination and cross validation (RFECV) technique to features of the input data.

상기 제1 가능성 점수를 획득하는 단계는 상기 입력 데이터에 XG부스트(extreme gradient boost; XGboost) 모델을 적용함으로써 상기 제1 가능성 점수를 획득하는 단계를 포함할 수 있다.The step of obtaining the first likelihood score may include a step of obtaining the first likelihood score by applying an extreme gradient boost (XGboost) model to the input data.

일 실시예에 따른 환자의 퇴원 예측을 위한 방법은 상기 획득된 제1 가능성 점수에 대하여 상기 입력 데이터의 각 피처에 의하여 유발된 점수에 대응하는 피처 영향도(feature influence)에 기초하여 상기 피처들 중 하나 이상의 피처들을 선택하는 단계를 더 포함할 수 있다.A method for predicting discharge of a patient according to one embodiment may further include a step of selecting one or more of the features based on a feature influence corresponding to a score induced by each feature of the input data with respect to the obtained first likelihood score.

일 실시예에 따른 환자의 퇴원 예측을 위한 방법은 상기 획득된 제1 가능성 점수에 대하여, 상기 선택된 하나 이상의 피처들의 상기 피처 영향도를 표시하는 단계를 더 포함할 수 있다.A method for predicting discharge of a patient according to one embodiment may further include a step of displaying feature influences of the selected one or more features with respect to the obtained first likelihood score.

일 실시예에 따른 환자의 퇴원 예측을 위한 방법은 상기 입원 기간 중의 복수의 시점들에 대한 가능성 점수들을 표시하는 단계를 더 포함할 수 있다.A method for predicting discharge of a patient according to one embodiment may further include the step of displaying likelihood scores for multiple time points during the hospitalization period.

일 실시예에 따른 컴퓨터 프로그램은 하드웨어와 결합되어 전술된 방법들 중 어느 하나의 방법을 실행시키기 위하여 컴퓨터 판독 가능한 기록매체에 저장될 수 있다.A computer program according to one embodiment may be stored in a computer-readable recording medium to execute any one of the above-described methods in combination with hardware.

5-폴드 교차 검증(five-fold cross-validations)을 사용하여 5개의 ML 기반 모델들을 실험할 수 있다. 최종 모델로 선택된 XGB(Extreme Gradient Boosting)는 다른 모델들(예를 들어, 로지스틱 회귀, 랜덤 포레스트, 지원 벡터 머신 및 다층 퍼셉트론)보다 0.865 높은 평균 AUROC(area under receiver operating characteristic) 점수를 달성할 수 있다. 피처 축소(reduction)(예를 들어, 피처 선택(selection))를 수행하고 피처 중요도(feature importance)를 나타내고 예측 결과를 평가할 수 있다. 결과 중 하나인 개별 설명자(individual explainer)는 입원 중 퇴원 스코어 및 일별 피처 영향도 스코어를 의료진과 환자에게 제공할 수 있다. 결과를 활용하기 위해 시뮬레이션된 침대 관리를 시각화할 수 있다.Five ML-based models can be tested using five-fold cross-validations. The final model, Extreme Gradient Boosting (XGB), can achieve an average AUROC (area under receiver operating characteristic) score of 0.865 higher than other models (e.g., logistic regression, random forest, support vector machine, and multilayer perceptron). Feature reduction (e.g., feature selection) can be performed, feature importance can be expressed, and prediction results can be evaluated. One of the results, the individual explainer, can provide hospital discharge scores and daily feature influence scores to medical staff and patients. To utilize the results, simulated bed management can be visualized.

본 발명에서는 퇴원 확률과 피처의 상대적 기여도를 제공하는 개발된 ML 기반 예측 모델을 기반으로 개별 설명자를 제안할 수 있다. 본 발명에 따른 장치 및 방법은 의료 팀 및 환자가 CVD의 개인적 및 공통적 위험 팩터들을 식별하고 병원 관리자가 병상 및 다른 자원의 관리를 개선하는 것을 지원할 수 있다.The present invention can propose individual descriptors based on the developed ML-based predictive model that provides discharge probability and relative contribution of features. The device and method according to the present invention can assist the medical team and patients in identifying individual and common risk factors of CVD and hospital administrators in improving the management of beds and other resources.

도 1은 일 실시예에 따른 환자의 퇴원 예측을 위한 장치의 동작을 나타낸다.
도 2는 일 실시예에 따른 환자의 퇴원 예측 방법의 전반적인 흐름을 나타낸다.
도 3은 일 실시예에 따른 의료 데이터를 나타낸다.
도 4는 일실시예에 따른 원시 의료 데이터로부터 의료 데이터를 획득하기 위한 전처리 과정을 나타낸다.
도 5는 일 실시예에 따른 의료 데이터의 라벨링을 나타낸다.
도 6는 일 실시예에 따른 프로세서에 의한 가능성 점수의 획득 및 퇴원 여부 예측을 나타낸다.
도 7은 일 실시예에 따른 기계 학습 모델의 트레이닝에서 수행되는 교차 검증을 나타낸다.
도 8은 일 실시예에 따른 복수의 기계 학습 모델들의 성능을 비교하기 위한 ROC 곡선을 나타낸다.
도 9은 일 실시예에 따른 피처 중요도에 기초하여 기계 학습 모델에 적용될 피처들을 선택하는 것을 나타낸다.
도 10는 일 실시예에 따른 선택된 피처들을 포함하는 입력 데이터의 기계 학습 모델들의 성능을 나타낸다.
도 11은 일 실시예에 따른 피처 영향도를 표현하는 폭포형 차트를 나타낸다.
도 12은 일 실시예에 따른 환자의 입원 기간 중 복수의 시점들에 예측된 가능성 점수들을 나타낸다.
도 13는 일 실시예에 따른 예측 모델과 개별 설명자가 적용된 병상 관리에 대한 시뮬레이션된 임팩트를 나타낸다.Figure 1 illustrates the operation of a device for predicting discharge of a patient according to one embodiment.
Figure 2 illustrates the overall flow of a method for predicting patient discharge according to one embodiment.
Figure 3 illustrates medical data according to one embodiment.
Figure 4 illustrates a preprocessing process for obtaining medical data from raw medical data according to an embodiment.
Figure 5 illustrates labeling of medical data according to one embodiment.
FIG. 6 illustrates obtaining a probability score and predicting whether or not to discharge by a processor according to one embodiment.
Figure 7 illustrates cross-validation performed in training a machine learning model according to one embodiment.
Figure 8 shows a ROC curve for comparing the performance of multiple machine learning models according to one embodiment.
Figure 9 illustrates selecting features to be applied to a machine learning model based on feature importance according to one embodiment.
Figure 10 illustrates the performance of machine learning models on input data including selected features according to one embodiment.
Figure 11 illustrates a waterfall chart representing feature influence according to one embodiment.
Figure 12 shows predicted likelihood scores at multiple time points during a patient's hospitalization period according to one embodiment.
Figure 13 shows the simulated impact on bed management with the prediction model and individual descriptors applied according to one embodiment.

실시예들에 대한 특정한 구조적 또는 기능적 설명들은 단지 예시를 위한 목적으로 개시된 것으로서, 다양한 형태로 변경되어 구현될 수 있다. 따라서, 실제 구현되는 형태는 개시된 특정 실시예로만 한정되는 것이 아니며, 본 명세서의 범위는 실시예들로 설명한 기술적 사상에 포함되는 변경, 균등물, 또는 대체물을 포함한다.Specific structural or functional descriptions of the embodiments are disclosed for illustrative purposes only and may be implemented in various forms. Therefore, the actual implemented form is not limited to the specific embodiments disclosed, and the scope of the present disclosure includes modifications, equivalents, or alternatives included in the technical idea described in the embodiments.

제1 또는 제2 등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 해석되어야 한다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Although the terms first or second may be used to describe various components, such terms should be construed only for the purpose of distinguishing one component from another. For example, a first component may be referred to as a second component, and similarly, a second component may also be referred to as a first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다.When it is said that a component is "connected" to another component, it should be understood that it may be directly connected or connected to that other component, but there may also be other components in between.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설명된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions include plural expressions unless the context clearly indicates otherwise. In this specification, the terms "comprises" or "have" should be understood to specify the presence of a described feature, number, step, operation, component, part, or combination thereof, but not to exclude the possibility of the presence or addition of one or more other features, numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 해당 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning they have in the context of the relevant art, and will not be interpreted in an idealized or overly formal sense unless explicitly defined herein.

이하, 실시예들을 첨부된 도면들을 참조하여 상세하게 설명한다. 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조 부호를 부여하고, 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments will be described in detail with reference to the attached drawings. When describing with reference to the attached drawings, identical components are given the same reference numerals regardless of the drawing numbers, and redundant descriptions thereof will be omitted.

도 1은 일 실시예에 따른 환자의 퇴원 예측을 위한 장치의 동작을 나타낸다. 환자의 퇴원 예측을 위한 장치(100)는 프로세서(110) 및 디스플레이(120)를 포함할 수 있다.Figure 1 illustrates the operation of a device for predicting patient discharge according to one embodiment. The device (100) for predicting patient discharge may include a processor (110) and a display (120).

프로세서(110)는 기계 학습 모델을 이용하여 환자가 퇴원할 가능성 점수(본 명세서에서 퇴원 확률 또는 가능성으로도 표현됨)를 획득할 수 있다. 가능성 점수는 환자가 예측 시점으로부터 대상 기간(예를 들어, 3일) 내에 퇴원할 가능성을 나타낼 수 있다. 프로세서(110)는 획득된 가능성 점수에 기초하여 상기 환자가 예측 시점으로부터 대상 기간 내에 퇴원할지 여부를 예측할 수 있다. 기계 학습 모델을 이용하여 가능성 점수를 획득하고 퇴원 여부를 예측하는 프로세서의 동작은 하기 도 6에서 자세히 설명한다.The processor (110) can obtain a likelihood score (also expressed as discharge probability or likelihood in this specification) of a patient's discharge using a machine learning model. The likelihood score can represent the likelihood that the patient will be discharged within a target period (e.g., 3 days) from the predicted time point. The processor (110) can predict whether the patient will be discharged within the target period from the predicted time point based on the obtained likelihood score. The operation of the processor that obtains the likelihood score and predicts whether to be discharged using a machine learning model is described in detail in FIG. 6 below.

디스플레이(120)는 환자의 예측 퇴원 여부를 표시할 수 있다. 디스플레이(120)는 복수의 시점들에서 각각 예측된 가능성 점수들을 시간에 따라 그래프(121)을 통해 표시할 수 있다. 또한, 디스플레이(120)는 각 피처가 가능성 점수에 영향을 미치는 정도를 나타내는 피처 영향도(feature influence)를 폭포형 차트(122)로 표시할 수도 있다. 디스플레이(120)의 동작은 하기 도 11 내지 도 13에서 자세히 설명한다.The display (120) can display whether the patient is expected to be discharged. The display (120) can display predicted probability scores at multiple points in time through a graph (121). In addition, the display (120) can display feature influence, which indicates the degree to which each feature influences the probability score, as a waterfall chart (122). The operation of the display (120) is described in detail in FIGS. 11 to 13 below.

본 명세서에서 피처(예를 들어, 의료 피처)는 환자의 의료 상황과 관련된 정보를 분류한 개별 항목 및/또는 카테고리를 나타낼 수 있다. 의료 피처는 수술 피처, 처치 피처, 의료영상전송시스템 피처, 진단 피처, 복약 피처, 검사 피처, 신체 피처, 및 중환자실 체류기간 피처 중 하나 또는 둘 이상의 조합을 포함할 수 있다. 전술한 각 의료 피처는 복수의 서브 피처들을 포함할 수 있다. 개별 의료 피처는 도 3에서 후술한다.In this specification, a feature (e.g., a medical feature) may represent an individual item and/or category that classifies information related to a patient's medical condition. A medical feature may include one or a combination of two or more of a surgical feature, a treatment feature, a medical image transmission system feature, a diagnosis feature, a medication feature, a test feature, a body feature, and an intensive care unit length of stay feature. Each of the aforementioned medical features may include a plurality of sub-features. Individual medical features are described below in FIG. 3.

도 2는 일 실시예에 따른 환자의 퇴원 예측 방법의 전반적인 흐름을 나타낸다. 데이터 처리(data processing)에서, 코호트 기준(cohort criteria)은 설정될 수 있고 데이터를 처리함으로써 적절한 데이터 세트는 생성될 수 있다. 인공 지능 모델 평가(AI Model Evaluation)에서, 기계 학습 기반의 예측 모델(예를 들어, 기계 학습 모델)을 트레이닝 및 평가(evaluation)함으로써 정교한(elaborate) 모델은 발견될 수 있다. 예측(prediction) 및 설명(explanation)에서, 대상 기간(예를 들어, 3일) 이내의 퇴원 확률은 예측될 수 있고 피처를 식별, 정량화, 및 시각화함으로써 모델의 결과들은 설명될 수 있다.FIG. 2 illustrates the overall flow of a method for predicting patient discharge according to one embodiment. In data processing, cohort criteria can be set, and an appropriate data set can be generated by processing data. In AI Model Evaluation, an elaborate model can be found by training and evaluating a machine learning-based prediction model (e.g., a machine learning model). In prediction and explanation, the probability of discharge within a target period (e.g., 3 days) can be predicted, and the results of the model can be explained by identifying, quantifying, and visualizing features.

도 3은 일 실시예에 따른 의료 데이터를 나타낸다.Figure 3 illustrates medical data according to one embodiment.

일 실시예에 따른 의료 데이터(310)는 기계 학습 모델에 입력 가능한 포맷을 가지는 데이터로서, 원시 의료 데이터의 전처리를 통해 획득될 수 있다. 참고로, 의료 데이터(310)는 그 자체로 기계 학습 모델의 입력 포맷일 수 있고, 도 9에서 후술하는 바와 같이 의료 데이터의 피처들 중 일부 피처들을 포함하는 데이터가 기계 학습 모델의 입력 포맷일 수도 있다.Medical data (310) according to one embodiment is data having a format that can be input to a machine learning model, and can be acquired through preprocessing of raw medical data. For reference, the medical data (310) can be an input format of the machine learning model itself, and as described later in FIG. 9, data including some features among the features of the medical data can be an input format of the machine learning model.

의료 데이터(310)는 과거 의료 피처(311)를 포함하는 제1 부분 의료 데이터 및 현재 의료 피처(312)를 포함하는 제2 부분 의료 데이터를 포함할 수 있다.Medical data (310) may include first partial medical data including past medical features (311) and second partial medical data including current medical features (312).

과거 의료 피처(예를 들어, 도 3에서 표시된 과거 피처(past feature)(311))는 환자의 의료와 관련하여 환자의 입원 전에 대응하는 정보를 나타내는 피처를 포함할 수 있다. 예를 들어, 과거 의료 피처는 환자의 입원 날짜 이전에 수집된 원시 의료 데이터로부터 획득된 피처를 나타낼 수 있다. Past medical features (e.g., past features (311) shown in FIG. 3) may include features representing information corresponding to the patient's medical history prior to the patient's hospitalization. For example, past medical features may represent features obtained from raw medical data collected prior to the patient's hospitalization date.

입원 기간 이전의 미리 정의된 기간에 수집된 제1 부분 의료 데이터(예를 들어, 과거 부분 의료 데이터)는 전술한 과거 의료 피처에 대응하는 의료 정보를 포함할 수 있다. 예를 들어, 제1 부분 의료 데이터는 환자의 과거(예를 들어, 입원 날짜 이전의 3년 동안)의 과거 의료 피처들을 포함할 수 있다. 제1 부분 의료 데이터는 예시적으로, 진단(diagnosis), 복약(medication), 검사(laboratory), 신체 정보(physical information), 및 중환자실 체류 기간(LOS of ICU) 중 하나 또는 둘 이상의 조합의 과거 의료 피처들에 대응하는 데이터를 포함할 수 있다.The first partial medical data (e.g., historical partial medical data) collected during a predefined period prior to the hospitalization period may include medical information corresponding to the historical medical features described above. For example, the first partial medical data may include historical medical features of the patient's past (e.g., for three years prior to the date of hospitalization). The first partial medical data may include data corresponding to historical medical features of one or more combinations of diagnosis, medication, laboratory, physical information, and length of stay in an intensive care unit (LOS) (ICU), for example.

현재 의료 피처(예를 들어, 도 3에서 표시된 현재 피처(present feature)(312))는 입원 날짜 이후에 대응하는 정보를 나타내는 피처를 포함할 수 있다. 예를 들어, 현재 의료 피처는 입원 기간 동안 수집된 원시 의료 데이터로부터 획득된 피처를 나타낼 수 있다. The present medical feature (e.g., the present feature (312) shown in FIG. 3) may include features representing information corresponding to a date after the hospitalization date. For example, the present medical feature may represent features obtained from raw medical data collected during the hospitalization period.

입원 기간 동안 수집된 제2 부분 의료 데이터(예를 들어, 현재 부분 의료 데이터)는, 환자의 입원 날짜 이후의 의료 정보를 포함하는 데이터로서, 현재 의료 피처에 대응하는 의료 정보를 포함할 수 있다. 제2 부분 의료 데이터는, 예시적으로, 수술(operation), 처치(procedure), PACS, 진단(diagnosis), 복약(medication), 검사(laboratory), 및 신체 정보(physical information) 중 하나 또는 둘 이상의 조합의 현재 의료 피처들에 대응하는 데이터를 포함할 수 있다.The second partial medical data collected during the hospitalization period (e.g., current partial medical data) is data including medical information after the date of the patient's hospitalization, and may include medical information corresponding to the current medical feature. The second partial medical data may include data corresponding to the current medical features of one or more combinations of, for example, operation, procedure, PACS, diagnosis, medication, laboratory, and physical information.

제2 부분 의료 데이터는 환자가 퇴원할 때까지 일정한 주기(예를 들어, 하루)마다 업데이트될 수 있다. 예를 들어, 환자의 퇴원 예측을 위한 장치는 미리 결정된 주기마다 추가적인 의료 정보를 수집할 수 있다. 추가적인 의료 정보는 환자의 입원 이후에 추가적으로 수집되는 정보로서, 예를 들어, 환자가 입원 기간 중에 진단을 받는 것을 포함할 수 있다. 환자의 퇴원 예측을 위한 장치는 전술된 추가적인 의료 정보에 기초하여 의료 데이터에서 제2 부분 의료 데이터의 현재 의료 피처를 업데이트할 수 있다.The second part medical data may be updated at regular intervals (e.g., daily) until the patient is discharged. For example, the device for predicting the discharge of the patient may collect additional medical information at predetermined intervals. The additional medical information may include information additionally collected after the patient's hospitalization, for example, a diagnosis that the patient receives during the hospitalization. The device for predicting the discharge of the patient may update the current medical features of the second part medical data in the medical data based on the additional medical information described above.

환자의 퇴원 예측을 위한 장치는 전술한 현재 부분 의료 데이터 및 과거 부분 의료 데이터를 결합함으로써 전술한 의료 데이터를 생성할 수 있다. The device for predicting patient discharge can generate the aforementioned medical data by combining the aforementioned current partial medical data and past partial medical data.

참고로, 입원 기간은 현재 입원 중인 환자의 현재까지 연속적인 입원 기간으로서, 여러 번 입원한 환자의 경우 가장 최근의 입원 날짜로부터의 기간을 나타낼 수 있다. 예를 들어, 환자가 제1 날짜에 입원하고 제2 날짜에 퇴원하며 제3 날짜에 재입원한 경우, 제1 날짜 및 제2 날짜 사이의 입원 기간 및 제3 날짜 이후의 입원 기간은 연속되지 않은(예를 들어, 분리된) 2개의 입원 기간들일 수 있다. 제3 날짜 이후의 시점에서 환자의 퇴원 여부를 예측하면, 해당 환자의 입원 기간은 가장 최근의 입원 날짜인 제3 날짜 이후의 기간을 나타낼 수 있다. 제1 날짜 및 제2 날짜 사이의 기간은 입원 기간이 아니라, 입원 기간 이전의 기간에 포함될 수 있다.For reference, the length of stay is the continuous length of stay for the currently hospitalized patient, and for patients who have been hospitalized multiple times, it can represent the period from the most recent date of admission. For example, if the patient is admitted on Day 1, discharged on Day 2, and readmitted on Day 3, the length of stay between Day 1 and Day 2 and the length of stay after Day 3 can be two non-consecutive (e.g., separate) lengths of stay. If it is predicted whether the patient will be discharged at a point in time after Day 3, the length of stay for that patient can represent the period after Day 3, which is the most recent date of admission. The period between Day 1 and Day 2 can be included in the period prior to the length of stay, not the length of stay.

도 4는 일실시예에 따른 원시 의료 데이터로부터 의료 데이터를 획득하기 위한 전처리 과정을 나타낸다.Figure 4 illustrates a preprocessing process for obtaining medical data from raw medical data according to an embodiment.

원시 의료 데이터는 하나 이상의 주체들에 의하여 수집된 환자들에 대한 의료 정보를 포함하는 데이터로서, 수집된 의료 정보에 전처리가 적용되기 전의 데이터를 나타낼 수 있다.Raw medical data refers to data containing medical information about patients collected by one or more entities, and may represent data before any preprocessing is applied to the collected medical information.

예를 들어, 원시 의료 데이터는 CVD에 특화된 수동으로(manually) 선별된(curate) 전자 건강 기록(EHR) 데이터베이스 CardioNet, Inc. 사의 데이터로부터 추출될 수 있다. CardioNet은 예시적으로 2000년 1월 1일 및 2016년 12월 31일 사이에 CVD로 서울 아산 병원을 방문한 572811명의 환자들로 구성될 수 있다. CardioNet의 수집은 사전 동의(informed consent)가 포기(waive)된 경우 AMC 기관 검토 위원회(AMC institutional review board)의 승인을 받을 수 있다. 방문(visitation), 인구 통계(demographic), 진단(diagnosis), 복약(medication), 및 검사(laboratory)와 같은 27개의 테이블들이 있을 수 있다. CardioNet의 대부분의 테이블들은 환자 ID(patient's identification; PAID), 환자 인카운터 번호(patient's encounter number; INNO), 방문 또는 입원 날짜(INDT), 퇴원 날짜(OUDT)와 같은 공통 변수들을 가질 수 있다. PAID와 INNO를 연결(concatenate)한 형태의 KEY 컬럼은 방문 테이블과 다른 테이블들을 연결(connect)할 수 있다. KEY를 통해, 분석될 각 테이블에서의 변수들은 추출될 수 있다.For example, raw medical data can be extracted from CardioNet, Inc., a manually curated electronic health record (EHR) database specializing in CVD. CardioNet may consist of, for example, 572,811 patients who visited Asan Medical Center in Seoul, Korea for CVD between January 1, 2000 and December 31, 2016. Collection of CardioNet may be approved by the AMC institutional review board if informed consent is waived. There may be 27 tables such as visitation, demographic, diagnosis, medication, and laboratory. Most of the tables in CardioNet can have common variables such as patient's identification (PAID), patient's encounter number (INNO), visit or admission date (INDT), and discharge date (OUDT). The KEY column, which is a concatenation of PAID and INNO, can connect the visit table with other tables. Through KEY, variables from each table to be analyzed can be extracted.

CardioNet의 572,811명의 환자들로부터 심장내과(Cardiology) 또는 흉부외과(Thoracic Surgery)에 입원한 익명의 환자들 63,261명 중 84,251명의 기록은 획득될 수 있다. 더욱이, 실용적이고 사용 가능한 모델을 개발하기 위해 대상 기간(예를 들어, 3일) 이내에 퇴원을 예측하고 장기 환자를 감지하는 것에 집중될 수 있다. 30일 이상의 장기 환자들은 아산 메디컬 센터(asan medical center; AMC)에 의하여 별도로 관리될 수 있다. 따라서, 체류 기간(length of stay)의 기간은 3 및 30일 사이로 설정될 수 있다.Of the 572,811 patients in CardioNet, 84,251 records of 63,261 anonymous patients admitted to Cardiology or Thoracic Surgery can be obtained. Furthermore, in order to develop a practical and usable model, it can be focused on predicting discharge within a target period (e.g., 3 days) and detecting long-term patients. Long-term patients of more than 30 days can be managed separately by Asan Medical Center (AMC). Therefore, the length of stay can be set between 3 and 30 days.

CardioNet로부터 추출된 데이터는 복수의 테이블에 대하여 하기 변수들을 포함할 수 있다:Data extracted from CardioNet may contain the following variables across multiple tables:

- 방문 테이블: PAID, INNO, KEY, INDT, OUDT, 방문 유형, 진료과, 중환자실(Intensive Care Unit; ICU) 체류 기간(Length of Stay; LOS)- Visit table: PAID, INNO, KEY, INDT, OUDT, visit type, department, Intensive Care Unit (ICU) length of stay (LOS)

- 진단 테이블: 국제질병분류(International Classification of Diseases; ICD)-10차 진단 코드- Diagnosis table: International Classification of Diseases (ICD)-10th diagnosis code

- 검사(Laboratory) 테이블: 병리 검사(pathology examination) 날짜 및 코드, 검사(examination) 결과- Laboratory table: Pathology examination date and code, examination results

- 신체 정보(Physical information) 테이블: 환자의 나이, 키, 체중, 수축기 및 이완기 혈압(systolic and diastolic blood pressure), 호흡수, 맥박수, 체질량지수, 체표면적, 측정일자- Physical information table: patient's age, height, weight, systolic and diastolic blood pressure, respiratory rate, pulse rate, body mass index, body surface area, and measurement date.

- 복약 테이블: 처방(prescription)의 날짜 및 코드- Medication table: Prescription date and code

- 처치(Procedure) 테이블: 오더(order)의 날짜 및 코드- Procedure table: Date and code of order

- 수술(Operation) 테이블: 수술(surgery) 또는 치료(treatment)의 날짜 및 코드- Operation table: Date and code of surgery or treatment

- 의료영상저장전송시스템(Picture Archiving and Communication System; PACS) 테이블: 오더의 날짜 및 코드- Picture Archiving and Communication System (PACS) table: Order date and code

- 수혈(Transfusion) 테이블: 오더의 날짜 및 코드- Transfusion table: Order date and code

참고로, ICU 목록은 다음과 같다: ACU(Acute Care Unit), CCU(Coronary Care Unit), CSICU(Cardiac Surgery ICU), MICU(Medical ICU), NICU(Neonatal ICU), NRICU(Neurological ICU), NSICU(Neurosurgical ICU), PICU(Pediatric ICU), 및 SICU(Surgical ICU)For reference, the ICU list is as follows: ACU (Acute Care Unit), CCU (Coronary Care Unit), CSICU (Cardiac Surgery ICU), MICU (Medical ICU), NICU (Neonatal ICU), NRICU (Neurological ICU), NSICU (Neurosurgical ICU), PICU (Pediatric ICU), and SICU (Surgical ICU).

일 실시예에 따른 원시 의료 데이터의 방문 테이블 및 다른 테이블들은 로우 당 하나의 정보만을 포함할 수 있고, ML 모델이 데이터를 한 번에 모두 학습하는 것은 어려울 수 있다. 따라서, 장치는 임상적으로 중요한 오더들(orders)과 코드들(codes)의 OHE(one-hot encoding)를 포함하는 전처리를 수행함으로써 새로운 데이터 세트(예를 들어, 의료 데이터)의 피처를 획득할 수 있다. 전처리를 통해, 장치는 각 환자의 날짜별로 집계된(aggregated) 기록에 액세스할 수 있다.The visit table and other tables of the raw medical data according to one embodiment may contain only one piece of information per row, and it may be difficult for the ML model to learn all the data at once. Therefore, the device can obtain features of a new data set (e.g., medical data) by performing preprocessing including one-hot encoding (OHE) of clinically important orders and codes. Through the preprocessing, the device can access aggregated records by date for each patient.

참고로, 도 3에서 전술한 바와 같이, 진단, 복약, 검사, 및 신체에 대한 테이블들은 과거(past) 피처들 및 현재(present) 피처들 모두를 위하여 사용될 수 있다. 예를 들어, 진단 테이블의 경우, 진단 테이블에 포함된 입원 날짜 이후(예를 들어, 입원 기간)의 정보 및 입원 기간 이전의 미리 정의된 기간(예를 들어, 입원 날짜 이전의 3년)의 정보는 각각 현재 피처들(또는 현재 부분 의료 데이터) 및 과거 피처들(또는 과거 부분 의료 데이터)을 생성하기 위하여 사용될 수 있다. 수술, 처치, PACS의 테이블들은 현재 피처들을 위하여 사용될 수 있다. ICU의 LOS는 과거 피처들을 위하여 사용될 수 있다. For reference, as described above in FIG. 3, the tables for Diagnosis, Medication, Examination, and Body can be used for both past features and present features. For example, in the case of the Diagnosis table, the information included in the Diagnosis table after the date of admission (e.g., the period of admission) and the information of a predefined period prior to the period of admission (e.g., 3 years prior to the date of admission) can be used to generate present features (or present partial medical data) and past features (or past partial medical data), respectively. The tables for Surgery, Procedure, and PACS can be used for present features. The LOS of the ICU can be used for past features.

단계(410)에서, 환자의 퇴원 예측을 위한 장치는 원시 데이터에서 높은 빈도를 가지는 코드들을 선택(select top frequent codes)할 수 있다. 예를 들어, 복약 테이블에서 포함된 처방의 코드 변수가 가질 수 있는 값들이 지나치게 많은 경우, 처방의 코드 변수의 값을 모두 구분하는 기계 학습 모델의 트레이닝 및/또는 추론은 비효율을 초래할 수 있다. 환자의 퇴원 예측을 위한 장치는, 코드가 가질 수 있는 값들의 개수를 미리 결정된 개수 이하로 제한하기 위하여, 높은 빈도를 가지는 코드들을 선택하고 나머지 코드들은 모두 하나의 코드(예를 들어, "기타"를 지시하는 코드)로 변경할 수 있다.In step (410), the device for predicting patient discharge can select codes having a high frequency from the raw data (select top frequent codes). For example, if there are too many values that the code variables of the prescription included in the medication table can have, training and/or inference of a machine learning model that distinguishes all values of the code variables of the prescription can result in inefficiency. The device for predicting patient discharge can select codes having a high frequency and change all remaining codes into one code (for example, a code indicating "other") in order to limit the number of values that the codes can have to a predetermined number or less.

예를 들어, 진단 및 수술 테이블의 경우, ICD-10차 코드들 및 수술 코드들의 모든 값들은 3 자릿수 코드들로 변환되기 위하여 세번째 자릿수에서 슬라이스될 수 있다. 네 번째 숫자 이후의 문자열은 세 자리 코드의 하위 계층을 나타낼 수 있기 때문일 수 있다. 값들의 모든 빈도 수는 내림차순으로 정렬되고 처음 99개의 코드들은 선택될 수 있다. 나머지 코드들(예를 들어, 선택되지 않은 코드들)은 "기타(other)" 피처로 변경(transform)될 수 있다. For example, for the diagnosis and surgery table, all values of ICD-10 codes and surgery codes can be sliced at the third digit to be converted into 3-digit codes. This may be because the string after the fourth digit may represent a lower level of the 3-digit code. All frequency counts of the values are sorted in descending order and the first 99 codes can be selected. The remaining codes (e.g., unselected codes) can be transformed into "other" features.

단계(420)에서, 장치는 원-핫 인코딩(one-hot encoding; OHE)를 수행할 수 있다. 예를 들어, 진단 및 수술 테이블의 경우, 100개의 코드들 모두에 대해 OHE는 수행될 수 있다. "Z_DICD", "Z_OPCD"와 같은 "Z_code" 형태의 피처는 각 원본 테이블의 "기타(Others)"를 참조할 수 있다.At step (420), the device can perform one-hot encoding (OHE). For example, for the diagnosis and surgery table, OHE can be performed on all 100 codes. Features in the form of "Z_code" such as "Z_DICD", "Z_OPCD" can refer to "Others" of each original table.

단계(430)에서, 장치는 값을 채울 수 있다(fill values). 예를 들어, 진단 및 수술 테이블 경우, 각 테이블에 대해 총 100개의 코드는 획득되고 날짜 인덱스 값은 유효한 처방된 또는 오더된 데이터가 있으면 1, 그렇지 않으면 0으로 채워질 수 있다.At step (430), the device can fill values. For example, in the case of the diagnosis and surgery tables, a total of 100 codes can be obtained for each table and the date index value can be filled with 1 if there is valid prescribed or ordered data, otherwise 0.

단계(440)에서, 장치는 결측값들(missing values)에 대하여 임퓨테이션(imputation)을 수행할 수 있다. At step (440), the device can perform imputation for missing values.

예를 들어, 검사, 신체 정보, 및 날짜 관련된 피처들을 제외한 테이블에서, null 값은 0으로 대체될 수 있다. 대부분의 다른 피처들의 값 유형은 빈도로 계산될 수 있으므로, null 또는 정수일 수 있다. For example, in tables that exclude features related to examination, body information, and date, null values can be replaced with 0. The value type of most other features can be computed as a frequency, so they can be null or integer.

다른 예를 들어, 검사 및 신체 테이블의 피처들의 연속형 데이터 유형의 결측 값들을 다루기 위하여, 먼저 KEY를 기준으로 데이터 세트는 분리됨으로써 개별 입원들이 혼합되지 않을 수 있다. KEY는 한 환자의 하나의 입원 사례에 대하여 참조할 수 있다. 시간순으로(예를 들어, 과거에서 현재까지) null 값은 채워질 수 있다. 그 이후에, 입원 초기에 결과가 측정되지 않은 경우를 처리하기 위해 나머지 null 값을 시간 역순(예를 들어, 현재에서 과거)으로 채워질 수 있다. 개별 환자의 각 입원에 대해 null 값은 산입(impute)될 수 있다. 마지막으로, 모든 피처들이 정렬 또는 측정되지 않은 값들을 채우기 위하여, 나머지 null 값들을 각 피처에 대해 가장 빈번한 값으로 채울 수 있다.For example, to handle missing values of continuous data types of features in the Examination and Body tables, the data set can first be disjoint based on KEY so that individual admissions are not mixed. KEY can refer to one admission of a patient. Null values can be filled in chronologically (e.g., from past to present). Afterwards, the remaining null values can be filled in chronologically (e.g., from present to past) to handle cases where outcomes were not measured at the beginning of the admission. Null values can be imputed for each admission of an individual patient. Finally, to fill in values where all features are not aligned or measured, the remaining null values can be filled with the most frequent value for each feature.

이상으로 주로 진단 및 수술 테이블에 대한 예시로서 설명되었으나, 다른 테이블에도 유사한 전처리가 적용될 수 있다. 예를 들어, 진단 및 수술 테이블과 유사하게, PACS 테이블의 값은, 100개의 피처들로 변환될 수 있다. 다른 예를 들어, 진단 테이블과 유사하게, 복약 및 처치 테이블에서, OHE 수행을 통해 가장 빈번한 99개의 코드들 및 "기타"는 획득되었고, 대응하는 데이터는 채워질 수 있다. 또 다른 예를 들어, 검사 테이블에서, 전체 환자 중 50% 이상이 검사한 가장 빈번한 60개의 검사 코드들은 선택될 수 있다. 값들의 OHE는 수행되고 각 검사에 대응하는 결과들로 값들은 채워질 수 있다. 환자가 하루에 여러 번 검사를 받은 경우, 데이터 세트는 결과들의 평균으로 채워질 수 있다.While the above has been described primarily as an example for the diagnosis and surgery tables, similar preprocessing can be applied to other tables. For example, similar to the diagnosis and surgery tables, the values in the PACS table can be transformed into 100 features. For another example, similar to the diagnosis table, in the medication and treatment table, the 99 most frequent codes and "Other" can be obtained by performing OHE, and the corresponding data can be filled in. For another example, in the examination table, the 60 most frequent examination codes that were examined by more than 50% of the total patients can be selected. OHE of the values can be performed and the values can be filled in with the results corresponding to each examination. If the patient was examined multiple times in a day, the data set can be filled in with the average of the results.

다만, 전처리를 전술한 바로 한정하는 것은 아니고, 테이블 별로 일부 단계들이 생략되거나 추가될 수 있다. However, preprocessing is not limited to what was described above, and some steps may be omitted or added for each table.

일 실시예에 따른 환자의 퇴원 예측을 위한 장치는 테이블에 대하여 전처리 과정 중 빈번한 코드들을 선택하는 단계(410)를 생략할 수 있다. 예를 들어, 수혈 테이블의 경우, 사용 가능한 27개의 코드들은 모두 사용될 수 있다. 값은 각 환자의 질병의 심각성(severity)를 고려하여 1일 또는 1회 처방들의 개수로 채워질 수 있다. 다른 예를 들어, 신체 테이블은 10개의 코드들을 가지고, 모든 코드들이 사용될 수 있다.The device for predicting discharge of a patient according to one embodiment may omit the step (410) of selecting frequent codes during the preprocessing process for the table. For example, in the case of the blood transfusion table, all 27 available codes may be used. The value may be filled with the number of daily or one-time prescriptions considering the severity of each patient's disease. For another example, the body table may have 10 codes, and all codes may be used.

다른 일 실시예에 따른 환자의 퇴원 예측을 위한 장치는 전처리 과정에서 단계들(410 내지 440)와 함께 추가적인 단계를 더 수행할 수 있다.A device for predicting discharge of a patient according to another embodiment may perform additional steps in addition to steps (410 to 440) in the preprocessing process.

도 4에서는 생략되었으나, 환자의 퇴원 예측을 위한 장치는 복수의 테이블들을 병합 및 연결할 수 있다. 일 실시예에 따른 원시 의료 데이터(예를 들어, CardioNet)의 주요 테이블(primary table)(예를 들어, 방문 테이블)에서, 복수의 메인 컬럼들(예를 들어, PAID, INNO, INDT, OUDT) 및 방문에 관련된 변수들이 있을 수 있다. 각 로우는 각 환자에 대한 단일 입원 케이스를 나타낼 수 있다. 입원과 퇴원 사이의 기간을 날짜 인덱스로 새로운 데이터 세트 포맷을 생성하기 위하여, 인덱스는 재설정될 수 있다. 예를 들어, INDT가 2021.02.01이고 OUDT가 2021.02.10인 로우는 10일의 LOS를 가질 수 있다. 방문 테이블의 하나의 로우는 10개의 날짜 인덱스들을 갖는 10개의 로우들로 변환될 수 있다. 다른 테이블들의 PAID, INNO 및 날짜 인덱스들에 대응하는 모든 값들을 전처리한 후에 모델 트레이닝을 위한 새로운 데이터 세트를 생성하기 위하여 테이블들은 병합 및 연결될 수 있다.Although omitted in FIG. 4, a device for predicting patient discharge can merge and link multiple tables. In a primary table (e.g., a visit table) of raw medical data (e.g., CardioNet) according to one embodiment, there may be multiple main columns (e.g., PAID, INNO, INDT, OUDT) and variables related to visits. Each row may represent a single hospitalization case for each patient. In order to create a new data set format with a date index for the period between admission and discharge, the index may be reset. For example, a row with INDT of 2021.02.01 and OUDT of 2021.02.10 may have a LOS of 10 days. One row of the visit table may be converted into 10 rows with 10 date indices. After preprocessing all values corresponding to PAID, INNO, and date indices of other tables, the tables may be merged and linked to create a new data set for model training.

또한, 환자의 퇴원 예측을 위한 장치는 피처를 제거하거나 추가적인 피처를 생성할 수 있다. 예를 들어, 환자의 퇴원 예측을 위한 장치는 새로운 데이터 세트를 만든 후에, 미래(future) 정보를 포함하는 OUDT를 제거할 수 있다. 다른 예를 들어, 유형에 따라 날짜의 시간 정보를 구분하고 인식하기 위하여, 총 10개의 날짜 관련 피처들이 생성될 수 있다. INDT와 날짜 인덱스는 연도, 월, 일, 요일과 같은 정수 피처로 분할될 수 있다. 날짜 인덱스가 공휴일인지 여부를 표시하는 피처 및 날짜 인덱스로부터 INDT를 빼서 날짜 인덱스에서 LOS를 나타내는 또 다른 피처들이 생성될 수 있다.In addition, the device for predicting patient discharge can remove features or generate additional features. For example, the device for predicting patient discharge can remove OUDT containing future information after creating a new data set. For another example, in order to distinguish and recognize time information of a date according to type, a total of 10 date-related features can be generated. INDT and date index can be split into integer features such as year, month, day, and day of the week. A feature indicating whether the date index is a public holiday and another feature indicating LOS from the date index can be generated by subtracting INDT from the date index.

이상 주로 의료 데이터의 현재 피처를 획득하는 전처리 과정을 예시로 설명하였으나, 원시 의료 데이터로부터 과거 피처를 획득하는 전처리 과정도 이와 유사하게 수행될 수 있다. ML 모델이 데이터를 깊게 학습하기 위하여, 의료 데이터의 피처는 일일(day-by-day) 피처(예를 들어, 현재 의료 피처)와 함께 환자의 병력(예를 들어, 의료 기록(history))에 관한 과거 의료 피처를 포함할 수 있다. 각 입원의 날짜 인덱스가 입원 날짜(INDT)에서 시작할 때, 입원 날짜로부터 3년 전의 병원 방문 기록의 주요 정보에서 일부 과거 의료 피처들은 획득될 수 있다.The above mainly describes the preprocessing process for obtaining current features from medical data as an example, but the preprocessing process for obtaining past features from raw medical data can be performed similarly. In order for the ML model to learn data deeply, the features of the medical data can include past medical features about the patient's medical history (e.g., medical history) along with day-by-day features (e.g., current medical features). When the date index of each hospitalization starts from the date of admission (INDT), some past medical features can be obtained from key information of the hospital visit records 3 years prior to the date of admission.

예를 들어, 환자의 퇴원 예측을 위한 장치는 입원 기간 이전에 수집된 원시 의료 데이터로부터 과거 의료 피처를 획득할 수 있다. 현재 의료 피처와 유사하게, 과거 의료 피처에 대해 OHE는 수행되고 값은 채워질 수 있다. 의료 데이터의 과거 의료 피처는 각 피처에 대응하는 합계 값 또는 최근 값으로 채워질 수 있다. 예를 들어, 방문 테이블에서의 각 중환자실의 입원 기간들은 합산될 수 있다. 100개의 진단 코드들에 대해 과거에 진단 기록이 있는 경우, 각 값은 합산될 수 있다. 100개의 복약 코드들에 대해 기록이 있는 경우, 하루 또는 한 번에 처방된 개수는 합산될 수 있다. 다른 예를 들어, 3년 이내의 신체 정보 및 최근 검사 결과는 총 70개 코드들에 대하여 사용될 수 있다.For example, a device for predicting patient discharge can obtain past medical features from raw medical data collected before the period of hospitalization. Similarly to the current medical features, OHE can be performed on the past medical features and the values can be populated. The past medical features of the medical data can be populated with a sum value or a recent value corresponding to each feature. For example, the length of stay in each intensive care unit in the visit table can be summed. If there is a past diagnosis record for 100 diagnosis codes, each value can be summed. If there is a record for 100 medication codes, the number of prescriptions per day or at a time can be summed. For another example, physical information and recent test results within 3 years can be used for a total of 70 codes.

도 5는 일 실시예에 따른 의료 데이터의 라벨링을 나타낸다. Figure 5 illustrates labeling of medical data according to one embodiment.

분류(classification)를 위한 지도 학습 알고리즘은 정답을 지시하기 위하여 참(True) 또는 거짓(False)라는 레이블들이 요구될 수 있다. 참으로 라벨링하기 위한 타겟 기준(target criteria)이 도 5에 나타난다.Supervised learning algorithms for classification may require labels such as True or False to indicate the correct answer. The target criteria for labeling as True are shown in Figure 5.

도 5에서, Day 1은 입원 날짜(INDT), Day N은 퇴원 날짜(OUDT)일 수 있고, 하나의 원은 입원 기간 중의 각 날짜를 나타낼 수 있다. ML 모델에 힌트를 줄 수 있는 퇴원 절차(discharge procedure)와 같은 정보 때문에, 데이터 세트에서 Day N(예를 들어, 퇴원 날짜)은 제외될 수 있다. 퇴원일부터 이틀 전까지는 퇴원 예측의 정확도가 높을 수 있음에도 불구하고, 실제 모델을 사용하는 경우 대상 기간(예를 들어, 3일) 전에 미리 예측하는 것이 유용할 수 있다. 따라서 퇴원 날짜(OUDT) 1일 전부터 퇴원 날짜(OUDT) 3일 전까지의 날짜들은 1(예를 들어, 참 또는 양성), 입원 날짜(INDT)에서부터 퇴원 날짜(OUDT) 4일 전까지의 날짜들은 0(예를 들어, 거짓 또는 음성)으로 레이블링될 수 있다.In Fig. 5, Day 1 can be the admission date (INDT), Day N can be the discharge date (OUDT), and one circle can represent each day during the hospital stay. Day N (e.g., the discharge date) can be excluded from the dataset because of information such as discharge procedure that can provide hints to the ML model. Although the accuracy of discharge prediction can be high up to two days before the discharge date, it can be useful to predict in advance the target period (e.g., 3 days) when using the real model. Therefore, the dates from 1 day before the discharge date (OUDT) to 3 days before the discharge date (OUDT) can be labeled as 1 (e.g., true or positive), and the dates from the admission date (INDT) to 4 days before the discharge date (OUDT) can be labeled as 0 (e.g., false or negative).

일실시예에 따른 의료 데이터에서, 원본 테이블(예를 들어, 원시 의료 데이터)의 다양한 변수들은 10개의 날짜 관련 피처들, 597개의 현재 피처들, 및 279개의 과거 피처들로 변환될 수 있다. CVD 갖는 63,261명의 입원 환자들에 대한 84,251개의 기록들로부터 886개의 피처들을 갖는 669,667개의 로우들의 의료 데이터는 생성될 수 있다. 진단 코드, 검사 테스트 결과들, 신체 정보, 복약, 처치, 수술, PACS, 및 수혈을 포함하는 886개의 피처들을 가진 669,667개의 기록들로 구성된 의료 데이터는 생성될 수 있다. 환자들은 심장내과 또는 흉부외과에 입원했을 수 있고, 환자들의 LOS는 3일에서 30일 사이일 수 있다. 환자들의 평균 연령은 61.03세이고, 표준 편차는 13.42세일 수 있다. 의료 데이터는 38%(예를 들어, 254,254개의 로우들)의 여성과 62%(예를 들어, 415,413개의 로우들)의 남성으로 구성될 수 있다.In the medical data according to an embodiment, various variables of the original table (e.g., raw medical data) can be transformed into 10 date-related features, 597 current features, and 279 past features. Medical data of 669,667 rows with 886 features can be generated from 84,251 records for 63,261 hospitalized patients with CVD. Medical data consisting of 669,667 records with 886 features including diagnosis codes, laboratory test results, physical information, medications, procedures, surgeries, PACS, and blood transfusions can be generated. The patients may have been admitted to the department of cardiology or thoracic surgery, and the LOS of the patients may be between 3 and 30 days. The mean age of the patients may be 61.03 years, and the standard deviation may be 13.42 years. Medical data may consist of 38% females (e.g., 254,254 rows) and 62% males (e.g., 415,413 rows).

도 6는 일 실시예에 따른 프로세서에 의한 가능성 점수의 획득 및 퇴원 여부 예측을 나타낸다.FIG. 6 illustrates obtaining a probability score and predicting whether or not to discharge by a processor according to one embodiment.

단계(610)에서, 프로세서는 원시 의료 데이터를 전처리함으로써 의료 데이터를 획득할 수 있다.In step (610), the processor can obtain medical data by preprocessing raw medical data.

도 3에서 전술하였으나, 의료 데이터는, 수집된 기간에 기초하여, 입원 기간(a1)에 수집된 부분 의료 데이터(예를 들어, 제2 부분 의료 데이터 또는 현재 부분 의료 데이터) 및 입원 기간 이전의 미리 정의된 기간(a2)에 수집된 부분 의료 데이터(예를 들어, 제1 부분 의료 데이터 또는 과거 부분 의료 데이터)를 포함할 수 있다.As described above in FIG. 3, the medical data may include partial medical data (e.g., second partial medical data or current partial medical data) collected during the hospitalization period (a1) and partial medical data (e.g., first partial medical data or past partial medical data) collected during a predefined period (a2) prior to the hospitalization period, based on the collected period.

단계(620)에서, 프로세서는 입력 데이터에 기계 학습 모델을 적용함으로써 제1 가능성 점수를 획득할 수 있다. At step (620), the processor can obtain a first likelihood score by applying a machine learning model to the input data.

입력 데이터는 의료 데이터의 적어도 일부를 포함할 수 있다. 의료 데이터는 복수의 피처들을 포함할 수 있다. 프로세서는 의료 데이터의 피처들 중 하나 이상의 피처들을 입력 데이터의 피처로 선택할 수 있다. 입력 데이터의 피처의 선택은 하기 도 9에서 자세히 설명한다.The input data may include at least a portion of the medical data. The medical data may include a plurality of features. The processor may select one or more of the features of the medical data as features of the input data. The selection of features of the input data is described in detail in FIG. 9 below.

가능성 점수는 예측 시점으로부터 미리 정의된 대상 기간 내에 환자가 퇴원할 가능성을 나타낼 수 있다. 예를 들어, 제1 가능성 점수는, 제1 시점(d1)에서 예측된 환자의 퇴원 가능성을 나타내는 점수로서, 제1 시점으로부터 대상 기간(p1) 내에 환자가 퇴원할 가능성을 나타낼 수 있다.The likelihood score may represent the likelihood that the patient will be discharged within a predefined target period from the predicted time point. For example, the first likelihood score may represent the likelihood that the patient will be discharged within a target period (p1) from the predicted time point (d1).

참고로, 프로세서에 의하여 환자에 대한 퇴원을 예측하는 것은 미리 정의된 주기에 따라 반복적으로 수행될 수 있다. 일 실시예에 따른 퇴원 예측의 주기는 후술되는 의료 데이터의 업데이트의 주기와 같을 수 있다. 본 명세서에서는 퇴원 예측의 주기 및 의료 데이터의 업데이트의 주기가 모두 하루인 것으로 주로 설명되나, 이에 한정하는 것은 아니다. 다른 실시예에 따른 퇴원 예측의 주기는 의료 데이터의 업데이트 주기보다 길거나 의료 데이터의 업데이트 주기의 배수일 수 있다. For reference, predicting discharge for a patient by the processor may be performed repeatedly according to a predefined cycle. The cycle of discharge prediction according to one embodiment may be the same as the cycle of updating medical data described below. In this specification, the cycle of discharge prediction and the cycle of updating medical data are mainly described as being one day, but are not limited thereto. The cycle of discharge prediction according to another embodiment may be longer than the cycle of updating medical data or may be a multiple of the cycle of updating medical data.

단계(630)에서, 프로세서는 환자의 제1 가능성 점수에 기초하여 환자의 퇴원 여부를 예측할 수 있다. 제1 가능성 점수에 기초하여 예측되는 환자의 퇴원 여부는, 구체적으로, 환자가 제1 시점으로부터 대상 기간 내에 퇴원할지 여부를 나타낼 수 있다. 예를 들어, 프로세서는 제1 가능성 점수를 임계 점수와 비교함으로써 환자의 퇴원 여부를 예측할 수 있다.In step (630), the processor can predict whether the patient will be discharged based on the patient's first likelihood score. The patient's discharge predicted based on the first likelihood score can specifically indicate whether the patient will be discharged within a target period from the first time point. For example, the processor can predict whether the patient will be discharged by comparing the first likelihood score with a threshold score.

단계(640)에서, 프로세서는, 제2 시점에 환자가 입원 중인 경우에 응답하여, 입원 기간 중 수집된 의료 데이터를 업데이트할 수 있다. 제2 시점은 제1 시점으로부터 의료 데이터의 업데이트의 한 주기가 경과한 시점을 나타낼 수 있다. 다만, 이에 한정하는 것은 아니고 제2 시점은 제1 시점으로부터 하나 이상의 주기들이 경과한 시점을 나타낼 수도 있다. 제1 시점(d1)과 제2 시점(d2) 사이에 발생한 추가적인 의료 정보로 인하여, 의료 데이터 중 일부(예를 들어, 현재 부분 의료 데이터 또는 의료 데이터의 현재 피처)가 변경(예를 들어, 업데이트)될 수 있다.At step (640), the processor may update the medical data collected during the hospitalization period in response to the patient being hospitalized at the second time point. The second time point may represent a time point at which one period of updating the medical data has elapsed from the first time point. However, the present invention is not limited thereto, and the second time point may represent a time point at which one or more periods have elapsed from the first time point. Due to additional medical information generated between the first time point (d1) and the second time point (d2), a portion of the medical data (e.g., current partial medical data or current features of the medical data) may be changed (e.g., updated).

단계(650)에서, 프로세서는 업데이트된 입력 데이터에 기계 학습 모델을 적용함으로써 제2 가능성 점수를 획득할 수 있다. 제2 가능성 점수는 환자가 제2 시점으로부터 대상 기간 내에 퇴원할 가능성을 나타낼 수 있다. 기계 학습 모델로부터 제2 가능성 점수를 출력시키기 위하여, 입력 데이터는 입원 기간 중 입원 날짜로부터 제2 시점까지 획득된 의료 데이터 중 적어도 일부를 포함할 수 있다. 제2 시점에서의 제2 가능성 점수는 입원 기간 중 제2 시점까지의 수집된 의료 데이터를 포함하는 입력 데이터에 기초하여 출력될 수 있다.In step (650), the processor can obtain a second likelihood score by applying a machine learning model to the updated input data. The second likelihood score can represent the likelihood that the patient will be discharged from the hospital within the target period from the second time point. In order to output the second likelihood score from the machine learning model, the input data can include at least a portion of medical data acquired from the date of admission to the second time point during the hospitalization period. The second likelihood score at the second time point can be output based on the input data including the medical data collected up to the second time point during the hospitalization period.

단계(660)에서, 프로세서는 환자의 제2 가능성 점수에 기초하여 환자의 퇴원 여부를 예측할 수 있다. 제2 가능성 점수에 기초하여 예측되는 환자의 퇴원 여부는, 구체적으로, 환자가 제2 시점(d2)으로부터 대상 기간(p2) 내에 퇴원할지 여부를 나타낼 수 있다.In step (660), the processor can predict whether the patient will be discharged based on the patient's second likelihood score. The patient's discharge predicted based on the second likelihood score can specifically indicate whether the patient will be discharged within a target period (p2) from the second time point (d2).

도 7은 일 실시예에 따른 기계 학습 모델의 트레이닝에서 수행되는 교차 검증을 나타낸다.Figure 7 illustrates cross-validation performed in training a machine learning model according to one embodiment.

트레이닝 데이터는 퇴원 및 입원 중 하나로 레이블링될 수 있다. 퇴원에는 양성(positive)(예를 들어, 1) 레이블을, 입원에는 음성(negative)(예를 들어, 0) 레이블으로 설정될 수 있다. 모델들의 성능을 평가하고 비교하기 위하여, 정확도(accuracy), 민감도(sensitivity)(또는 양성(positive)에 대한 재현율(recall)), 특이성(specificity), 정밀도(precision), 양성 예측도(positive predictive value; PPV), 음성 예측도(negative predictive value; NPV), 거짓 양성 비율(false positive rate; FPR), 및 참 양성 비율(true positive rate; TPR)을 포함하는 메트릭들은 사용될 수 있다. 모델 트레이닝 및 검증을 모니터링할 때, 불균형 대상들을 반영하기 위하여 F1-점수(F1-Score)는 사용될 수 있고, 최적의 임계를 찾기 위하여 수신기 작동 특성(receiver operating characteristic; ROC) 곡선은 사용될 수 있으며, 비교하기 위하여 ROC 아래 영역(area under ROC; AUROC) 스코어는 사용될 수 있다.The training data can be labeled as either discharge or admission. Discharge can be labeled as positive (e.g., 1) and admission can be labeled as negative (e.g., 0). To evaluate and compare the performance of the models, metrics including accuracy, sensitivity (or recall for positive), specificity, precision, positive predictive value (PPV), negative predictive value (NPV), false positive rate (FPR), and true positive rate (TPR) can be used. When monitoring the model training and validation, F1-Score can be used to reflect imbalanced subjects, receiver operating characteristic (ROC) curve can be used to find the optimal threshold, and area under ROC (AUROC) score can be used for comparison.

ML 기반 모델의 과적합을 방지하고 편향된(biased) 결과를 감소하기 위하여, 도 7과 같이 계층화된(stratified) 5-폴드(5-fold) 교차 검증은 수행될 수 있다. 63,261개의 PAID들은 무작위로 셔플될 수 있고, 약 12,000명의 사람들의 5개 그룹들로 분할될 수 있다. 단일 환자의 기록을 트레이닝(예를 들어, 도 7의 도트 상자) 및 테스트 세트(예를 들어, 도 7의 대각선 해칭 상자)로 분할하지 않는 것을 시도하기 때문일 수 있다. 제1 그룹은 테스트 세트가 되고 나머지 그룹들은 폴드 1의 트레이닝 세트가 될 수 있다. 불균형한 대상들의 동일한 분할을 보장하기 위한 유사한 방식으로 폴드 1 내지 폴드 5는 생성될 수 있다(예를 들어, 데이터 세트의 참 값 레이블(true lable)은 모든 폴드들에서 레이블 0에 대해 62.4% 및 레이블 1에 대해 37.6%로 구성됨). 트레이닝 세트 중 25%는 하이퍼파라미터를 조정하기 위한 검증 세트로 분할될 수 있다. 결과적으로, 각 폴드에서 데이터 세트는 테스트 세트의 경우 약 133,000개의 로우들로, 트레이닝 세트(예를 들어, 검증 세트 포함)의 경우 535,000개 로우들로 분할될 수 있다. ML 기반 모델들은 5개의 폴드들을 모두 트레이닝 및 테스트될 수 있다.To avoid overfitting of ML-based models and reduce biased results, stratified 5-fold cross-validation can be performed as shown in Fig. 7. The 63,261 PAIDs can be randomly shuffled and split into 5 groups of about 12,000 people. This may be because we try not to split a single patient's record into training (e.g., dotted box in Fig. 7) and test sets (e.g., diagonally hatched box in Fig. 7). The first group can be the test set and the remaining groups can be the training set of fold 1. Folds 1 to 5 can be generated in a similar way to ensure equal splitting of imbalanced subjects (e.g., the true labels of the dataset consist of 62.4% for label 0 and 37.6% for label 1 in all folds). 25% of the training set can be split into a validation set to tune the hyperparameters. As a result, the data set in each fold can be split into about 133,000 rows for the test set and 535,000 rows for the training set (i.e., including the validation set). ML-based models can be trained and tested on all five folds.

일실시예에 따른 환자의 퇴원 예측을 위한 장치는, 가장 적합한 모델을 찾기 위하여, 5가지 기계 학습 모델들을 실험할 수 있다. 예를 들어, 로지스틱 회귀(logistic regression; LR) 모델은 성능 추정을 위한 기준선으로 설정될 수 있다. 서포트 벡터 머신(support vector machine; SVM), 랜덤 포레스트(random forest; RF), 다층 퍼셉트론(multi-layer perceptron; MLP) 및 XGBoost(Extreme Gradient Boosting; XGB)는 비교를 위한 기계 학습 모델들로 선택될 수 있다. 일 실시예에 따른 퇴원 예측을 위한 장치는, 랜덤 검색을 통해 각 모델에 대한 하이퍼파라미터(hyperparameter) 튜닝을 수행할 수 있다.The device for predicting discharge of a patient according to an embodiment can experiment with five machine learning models to find the most suitable model. For example, the logistic regression (LR) model can be set as a baseline for performance estimation. A support vector machine (SVM), a random forest (RF), a multi-layer perceptron (MLP), and Extreme Gradient Boosting (XGBoost) can be selected as machine learning models for comparison. The device for predicting discharge according to an embodiment can perform hyperparameter tuning for each model through a random search.

일실시예에 따른 퇴원 예측을 위한 장치는 GBM(Gradient-Boosting Algorithm) 모델 중 하나인 XGB을 최종 모델로 선택할 수 있다. GBM은 여러 약한 분류기들(예를 들어, 트리들)을 결합하는 앙상블 방법을 포함할 수 있다. GBM의 주요 아이디어는 잘못 예측된 결과들에 초점을 맞추고 가중치를 두는 것일 수 있다. XGB가 트레이닝되는 동안, 하나의 트리는 데이터 세트를 학습하고, 에러들을 갖는 잘못 예측된 기록들에 가중치들을 할당하며, 같은 모델의 다음 트리는 가중치가 할당된 데이터 세트를 학습하고 가중치들을 할당하는 것의 프로세스를 반복할 수 있다. The device for predicting discharge according to an embodiment of the present invention may select XGB, which is one of the GBM (Gradient-Boosting Algorithm) models, as the final model. GBM may include an ensemble method that combines several weak classifiers (e.g., trees). The main idea of GBM may be to focus on and weight incorrectly predicted results. While XGB is being trained, one tree learns a data set and assigns weights to incorrectly predicted records with errors, and the next tree of the same model may repeat the process of learning the data set with assigned weights and assigning weights.

참고로, GBM은, 설명가능한(explainable) 기계 학습 모델로서, 피처 중요도와 같은 예측 결과들에 대한 피처들의 기여도를 정량화할 수 있다. 특히, XGB는 정규화와 성능의 장점을 가질 수 있다. XGB는 병렬 처리를 수행할 수 있고, 과적합을 방지하도록 규제될 수 있으며, 구조화된 데이터 학습에 널리 사용될 수 있고, 우수한 예측 성능을 가질 수 있다.For reference, GBM is an explainable machine learning model that can quantify the contribution of features to prediction results such as feature importance. In particular, XGB can have the advantages of regularization and performance. XGB can perform parallel processing, can be regulated to prevent overfitting, can be widely used in structured data learning, and can have excellent prediction performance.

피처 중요도는 트리 기반 알고리즘 모델에 의하여 데이터를 트레이닝하는 것의 프로세스에서 모델이 중요하다고 생각하는 피처들 및 피처들의 기여도 점수를 나열할 수 있다. XGB는, XGB의 고성능뿐만 아니라 의사 결정 프로세스를 포함하는 모델의 내부에 액세스할 수 있기 때문에, 최종 모델로 고려될 수 있다. 트리에 접근(approach)함으로써 각 환자의 일일 퇴원 예측에 기여한 특정 피처들 및 피처들의 영향도들은 설명될 수 있다.Feature importance can list the features and contribution scores of features that the model considers important in the process of training data by a tree-based algorithm model. XGB can be considered as the final model because it can access the internals of the model including the decision-making process as well as the high performance of XGB. By approaching the tree, the specific features and influences of the features that contributed to the daily discharge prediction of each patient can be explained.

도 8은 일 실시예에 따른 복수의 기계 학습 모델들의 성능을 비교하기 위한 ROC 곡선을 나타낸다.Figure 8 shows a ROC curve for comparing the performance of multiple machine learning models according to one embodiment.

[표 1] 각 모델에 대한 5-폴드 교차 검증의 AUROC 점수에 의한 평가[Table 1] Evaluation by AUROC score of 5-fold cross validation for each model

5 교차 검증들을 사용하여 5개의 ML 기반 모델들은 실험될 수 있고, 각 폴드에 대한 AUROC 점수는 표 1에 나타날 수 있다. 각 폴드에 대한 가장 높은 AUROC 점수는 볼드체로 표시되고 표 1의 "서포트(Support)" 컬럼은 각 참 값 레이블의 개수를 나타냅니다. 도 8에서, ROC 곡선 플롯이 나타날 수 있다. 곡선의 면적은 0과 1 사이의 값을 갖는 AUROC를 나타낼 수 있다. AUROC 점수는 1에 가까울수록 모델의 성능이 높다는 것을 의미할 수 있다. XGB는 모든 폴드에서 가장 높고 비교적 안정적인 점수를 획득할 수 있다. 5 Using cross-validation, five ML-based models can be tested, and the AUROC scores for each fold can be shown in Table 1. The highest AUROC score for each fold is shown in bold, and the "Support" column in Table 1 represents the number of each true value label. In Fig. 8, the ROC curve plot can be shown. The area under the curve can represent the AUROC, which has a value between 0 and 1. The closer the AUROC score is to 1, the higher the performance of the model can be. XGB can get the highest and relatively stable score in all folds.

[표 2] 메트릭들의 결과들로 5개의 ML 기반 모델들의 비교[Table 2] Comparison of five ML-based models with metrics results

표 2는 5가지 ML 기반 모델들의 평가 결과를 비교한 것이다. 표 2의 모든 점수들은 5개의 폴드들에서의 결과들의 평균 값 및 표준편차이고, 각 메트릭의 최고 점수는 볼드체로 표시될 수 있다. 특이성(Specificity)의 경우, LR, SVM은 0.828로 가장 높았지만 나머지 메트릭에서는 XGB가 가장 높을 수 있다. 특히, 데이터 세트의 레이블이 불균형한 경우에도 레이블 1을 예측하는 데 XGB은 0.7점 이상을 기록할 수 있다. 따라서, XGB는 퇴원 확률을 예측하기 위한 최종 모델로 선택될 수 있다.Table 2 compares the evaluation results of five ML-based models. All scores in Table 2 are the mean and standard deviation of the results in five folds, and the highest score for each metric can be bolded. In terms of Specificity, LR and SVM were the highest at 0.828, but XGB could be the highest in the remaining metrics. In particular, even when the labels of the dataset are imbalanced, XGB can score more than 0.7 in predicting label 1. Therefore, XGB can be selected as the final model for predicting the probability of discharge.

도 9은 일 실시예에 따른 피처 중요도에 기초하여 기계 학습 모델에 적용될 피처들을 선택하는 것을 나타낸다.Figure 9 illustrates selecting features to be applied to a machine learning model based on feature importance according to one embodiment.

피처 중요도(feature importance)는 기계 학습 모델에 대한 입력 데이터의 해당 피처의 중요성을 나타낼 수 있다. 피처 중요도는 해당 피처의 값을 임의의 값으로 치환하면 원본 데이터보다 예측 에러가 증가하는 정도에 따라 산출될 수 있다. Feature importance can indicate the importance of a feature in input data for a machine learning model. Feature importance can be calculated based on the degree to which the prediction error increases compared to the original data when the value of the feature is replaced with a random value.

그래프(900)는 XGB의 이득 점수(gain score)에 따라 정렬된 상대적인 피처 중요도를 나타낼 수 있다. 이득 점수는 피처가 사용되는 모든 분할들(spilts)의 평균 이득을 나타낼 수 있다. 일실시예에 따른 기계 학습 모델에 사용된 모든 피처들은 아산 메디컬 센터(asan medical center; AMC)에서 사용된 이름으로 대체될 수 있다. 날짜 관련 피처를 제외하고 모델에 영향을 미치는 대부분의 피처들은 모든 테이블들에서 발견될 수 있다. 처치 테이블의 피처들은 임상적으로 중요한 상황과 실질적으로 관련될 수 있다. 예를 들어, (D)로 표시된 용어는 다른 것보다 더 심각한 상태를 의미할 가능성이 높을 수 있다. 나머지 피처들은 또한 CVD와 연관되거나 입원 중 주요 검사(primary examination) 및 처방을 포함할 수 있다.The graph (900) can represent relative feature importance sorted by gain score of XGB. The gain score can represent the average gain of all splits where the feature is used. All features used in the machine learning model according to one embodiment can be replaced by the names used in Asan Medical Center (AMC). Most of the features that affect the model, except for the date-related features, can be found in all tables. Features in the treatment table can be substantially related to clinically important situations. For example, a term marked (D) can be more likely to mean a more severe condition than others. The remaining features can also be associated with CVD or include primary examinations and prescriptions during hospitalization.

피처 중요도는 기계 학습 모델에 대한 해당 피처의 중요성을 나타낼 수 있을 뿐이고, 도 11에서 후술될 피처 영향도(feature influence)와 구분될 수 있다. 후술하겠으나, 피처 영향도는 하나의 출력(예를 들어, 가능성 점수)에 대하여 해당 피처의 값이 영향을 미친 정도를 나타내는 값이다. 피처 중요도는 모델에 대해서 설명할 수 있고 각 환자를 설명하기 어려울 수 있기 때문에, 예측(예를 들어, 가능성 점수)에 대한 개별 설명자로 사용하기에는 부족할 수 있다. 환자의 상태에 따라 매번 다른 피처들이 일일 퇴원 확률에 영향을 미칠 수 있다. 환자별로 입원 기간 중 일일 퇴원 확률에 대하여 영향을 미친 피처들을 제공하는 개별 설명자는 제안될 수 있다. 개별 설명자는 도 11 내지 도 13에서 자세히 설명한다.Feature importance can only indicate the importance of a feature to a machine learning model, and can be distinguished from feature influence, which will be described later in Fig. 11. As will be described later, feature influence is a value indicating the degree to which the value of a feature influences an output (e.g., likelihood score). Since feature importance can explain the model and may be difficult to explain each patient, it may not be sufficient to be used as an individual descriptor for prediction (e.g., likelihood score). Depending on the condition of the patient, different features may affect the daily discharge probability each time. An individual descriptor that provides features that affect the daily discharge probability during the hospitalization period for each patient can be proposed. The individual descriptors are described in detail in Figs. 11 to 13.

일 실시예에 따른 프로세서는 피처 중요도에 기초하여 의료 데이터의 피처들 중 입력 데이터의 피처로 선택할 수 있다.A processor according to one embodiment may select features of input data from among features of medical data based on feature importance.

단계(910)에서, 프로세서는 임시 기계 학습 모델에 대한 피처 중요도를 산출할 수 있다. 임시 기계 학습 모델은 전처리된 의료 데이터의 모든 피처들을 포함하는 데이터로 트레이닝된 기계 학습 모델을 나타낼 수 있다. 피처 중요도는 기계 학습 모델에 대한 해당 피처의 중요성을 나타낼 수 있다. 예를 들어, 피처 중요도는 데이터에서 해당 피처의 값을 임의의 값으로 치환하면 원본 데이터보다 임시 기계 학습 모델의 예측 에러가 증가하는 정도에 따라 산출될 수 있다. In step (910), the processor can calculate feature importance for the temporary machine learning model. The temporary machine learning model can represent a machine learning model trained with data including all features of the preprocessed medical data. The feature importance can represent the importance of the corresponding feature for the machine learning model. For example, the feature importance can be calculated based on the degree to which the prediction error of the temporary machine learning model increases compared to the original data when the value of the corresponding feature in the data is replaced with a random value.

단계(920)에서, 프로세서는 산출된 피처 중요도에 기초하여 기계 학습 모델의 입력 데이터의 피처로 선택할 수 있다. 예를 들어, 프로세서는 산출된 피처 중요도를 내림차순으로 정렬하고 미리 정의된 개수에 대응하는 상위 피처들을 선택할 수 있다. 다른 예를 들어, 프로세서는 미리 정의된 임계 피처 중요도 이상의 피처 중요도를 갖는 피처들을 선택할 수 있다. In step (920), the processor may select features of the input data of the machine learning model based on the calculated feature importance. For example, the processor may sort the calculated feature importance in descending order and select top features corresponding to a predefined number. For another example, the processor may select features having a feature importance greater than a predefined threshold feature importance.

입력 데이터의 피처 선택을 통해, 기계 학습 모델의 입력 포맷은 선택된 피처들을 포함하는 데이터 포맷으로 결정될 수 있다.Through feature selection of input data, the input format of the machine learning model can be determined as a data format that includes the selected features.

단계(930)에서, 프로세서는 기계 학습 모델을 선택된 피처들에 기초하여 트레이닝시킬 수 있다. 프로세서는 의료 데이터로부터 선택된 피처들을 추출하고 추출된 피처들로 구성된 트레이닝 데이터를 획득할 수 있다. 프로세서는 의료 데이터의 모든 피처들에 기초하여 트레이닝되는 대신에, 선택된 피처들에 기초하여 트레이닝시킴으로써 기계 학습 모델을 획득할 수 있다.At step (930), the processor can train the machine learning model based on the selected features. The processor can extract the selected features from the medical data and obtain training data composed of the extracted features. Instead of being trained based on all features of the medical data, the processor can obtain the machine learning model by training based on the selected features.

단계(940)에서, 프로세서는 선택된 피처들에 기계 학습 모델을 적용함으로써 가능성 점수를 획득할 수 있다. 기계 학습 모델은 선택된 피처들로 구성된 입력 데이터에 적용될 수 있다. 일 실시예에 따르면, 환자에 대한 모든 피처들을 포함하는 의료 데이터로부터 선택된 피처들을 추출함으로써, 입력 데이터는 획득될 수 있다. 다른 일 실시예에 따르면, 환자로부터 기계 학습 모델의 입력으로 선택된 피처들만이 수집됨으로써, 입력 데이터는 획득될 수도 있다.In step (940), the processor can obtain a likelihood score by applying a machine learning model to the selected features. The machine learning model can be applied to input data consisting of the selected features. In one embodiment, the input data can be obtained by extracting the selected features from medical data including all features for the patient. In another embodiment, the input data can be obtained by collecting only the features selected as inputs of the machine learning model from the patient.

도 10는 일 실시예에 따른 선택된 피처들을 포함하는 입력 데이터의 기계 학습 모델들의 성능을 나타낸다. 피처들이 너무 많으면 모델 성능에 부정적인 영향을 미칠 수 있다. 따라서 적절한 수의 피처를 선택하는 것이 요구될 수 있다. Figure 10 illustrates the performance of machine learning models of input data including selected features according to one embodiment. Too many features may have a negative impact on model performance. Therefore, selecting an appropriate number of features may be required.

일 실시예에 따르면, 교차 검증으로 재귀적 피처 제거(recursive feature elimination with cross-validation; RFECV)는 수행될 수 있고, RFECV의 목표는 모델 성능을 비교함으로써 최적의 피처 개수를 식별하면서 피처 중요도가 낮은 피처를 한 번에 하나씩 제거하는 것일 수 있다. RFECV는 모든 피처의 순위와 이름을 반환할 수 있다. 최종 모델인 XGB에 RFECV를 적용하여 순위가 1인 약 150개의 피처들은 식별될 수 있다. 성능 비교를 위해 동일한 파라미터를 가진 동일한 데이터 세트를 사용하여 5-폴드 교차 검증은 수행될 수 있다. In one embodiment, recursive feature elimination with cross-validation (RFECV) can be performed, and the goal of RFECV can be to remove features with low feature importance one at a time while identifying the optimal number of features by comparing model performances. RFECV can return ranks and names of all features. By applying RFECV to the final model XGB, about 150 features with rank 1 can be identified. 5-fold cross-validation can be performed using the same dataset with the same parameters to compare performance.

도 10에서 비교될 실시예들은 모든 886개의 피처들에 기초한 실시예(XGB 886로 표시됨), RFECV에 의하여 선택된 150개의 피처들에 기초한 실시예(XGB RFE 150으로 표시됨), 및 RFECV에 의하여 선택된 150개로 학습된 모델의 피처 중요도 상위 50개 피처들에 기초한 실시예(XGB RFE & FI 50으로 표시됨)를 포함할 수 있다. 도 10에서 범례의 괄호 안의 숫자들은 각각의 실시예에 대한 AUROC 점수를 나타낼 수 있다.The embodiments to be compared in Fig. 10 may include an embodiment based on all 886 features (denoted as XGB 886), an embodiment based on 150 features selected by RFECV (denoted as XGB RFE 150), and an embodiment based on top 50 features of feature importance of the model trained with 150 features selected by RFECV (denoted as XGB RFE & FI 50). The numbers in parentheses of the legend in Fig. 10 may represent the AUROC scores for each embodiment.

[표 3] 피처들을 선택하기 위한 5-폴드 교차 검증의 AUROC 점수에 의한 평가[Table 3] Evaluation by AUROC score of 5-fold cross validation for feature selection

도 10과 함께 표 3은, 전체 피처들을 사용한 모델, 150개의 피처들을 사용한 모델, 및 50개의 피처들을 사용한 모델 간의 성능 차이는 AUROC 점수 기준으로 약 1~2.5%에 불과함을 나타낼 수 있다. 일 실시예에 따라 83.1%에서 94.4%의 피처 축소를 적용하더라도 최대 성능 차이는 2.5%에 불과하다는 것이 나타날 수 있다. 각 병원의 상황이나 데이터의 특성을 고려하여 특성의 개수는 적절히 조정될 수 있다.Table 3 together with Figure 10 can show that the performance difference between the model using all features, the model using 150 features, and the model using 50 features is only about 1 to 2.5% in terms of AUROC score. In one embodiment, even when applying feature reduction of 83.1% to 94.4%, the maximum performance difference can be shown to be only 2.5%. The number of features can be appropriately adjusted considering the situation of each hospital or the characteristics of the data.

예측 모델은 임계 점수에 따라 데이터를 0 또는 1로 분류할 수 있다. 최적 임계 점수는 민감도 및 정밀도의 합이 동시에 최대화될 수 있는 점수일 수 있다. ROC 곡선에서 TPR과 FPR은 서로 비례할 수 있으나, 민감도와 정밀도는 트레이드 오프를 가질 수 있다. FN(false negative)을 줄이면 민감도가 증가하고 FP(false positive)를 줄이면 정밀도가 높아집니다. 병원 운영의 결정 시점에서 임계 점수를 적절하게 조정하는 것이 요구될 수 있다.The prediction model can classify data as 0 or 1 based on the threshold score. The optimal threshold score can be the score that can simultaneously maximize the sum of sensitivity and precision. In the ROC curve, TPR and FPR can be proportional to each other, but sensitivity and precision can have a trade-off. Reducing false negatives (FN) increases sensitivity, and reducing false positives (FP) increases precision. It may be required to appropriately adjust the threshold score at the decision point of hospital operation.

참고로, 병원 상황에 따라 최적의 임계 점수는 조정될 수 있지만 임계 점수 부근의 가능성 점수로 인한 의사 결정의 모호성이 존재할 수 있다. 이러한 의사 결정의 모호성을 감소시키기 위하여 추가적인 기법을 사용할 수 있다. 예를 들어, 결과를 보다 보수적이지만 신뢰할 수 있도록 하기 위하여 가중 평균을 이용한 기법은 사용될 수 있다. 모델에서 반환된 가능성 점수(예를 들어, 확률)을 직접 사용하는 것보다 예측 시점 이전의 결과에 가중치를 부여하여 예측 시점에 과거 결과가 적어도 일부 반영되도록 하는 것이 더 유용할 수 있다. 모델과 그 내부 피처를 설명하는 것만큼이나 신뢰할 수 있는 결과를 만들어내는 것도 중요할 수 있다.Note that the optimal threshold score can be adjusted depending on the hospital situation, but there may be ambiguity in decision-making due to the likelihood score near the threshold score. Additional techniques can be used to reduce this ambiguity in decision-making. For example, a technique using a weighted average can be used to make the results more conservative but reliable. Rather than directly using the likelihood score (e.g., probability) returned by the model, it may be more useful to weight the results before the prediction time so that the past results are at least partially reflected at the prediction time. It may be just as important to produce reliable results as it is to explain the model and its internal features.

입원 기간 중 일일 퇴원 확률들 및 날짜별로 피처 영향도들은 예측에 대한 개별 설명자를 통해 제시될 수 있다.Daily discharge probabilities and date-wise feature influences during hospitalization can be presented through individual descriptors for prediction.

도 11은 일 실시예에 따른 피처 영향도를 표현하는 폭포형 차트를 나타낸다.Figure 11 illustrates a waterfall chart representing feature influence according to one embodiment.

폭포형 차트(waterfall chart)를 사용하여 XGB의 예측 결과들을 해석하는 데 도움을 줄 수 있는 개별 설명자(individual explainer)는 제시될 수 있다. 폭포형 차트는 브리지 또는 캐스케이드 차트라고도 하는 막대 차트의 일종으로 인접 값들 간의 차이를 계산하는 상대 값들을 나타낼(portray) 수 있다. 최종 퇴원 확률의 점진적인 방향 및 긍정적 또는 부정적인 영향도를 나타낼 수 있다.Individual explainers can be presented to help interpret the prediction results of XGB using a waterfall chart. A waterfall chart is a type of bar chart, also known as a bridge or cascade chart, that can portray relative values that calculate the difference between adjacent values. It can represent the gradual direction of the final discharge probability and the degree of positive or negative influence.

개별 설명자(individual explainer)는 획득된 가능성 점수에 대하여 피처 영향도를 나타낼 수 있다. 피처 영향도는 가능성 점수에 각 피처에 의하여 유발된 점수에 대응할 수 있다. 피처 영향도는 해당 피처가 가능성 점수에 기여한 정도를 수치화한 기여도로 표현될 수 있다. 디스플레이는 가능성 점수에 대한 피처 영향도를 표시함으로써, 해당 가능성 점수를 유발하는 데 영향을 크게 미친 피처들을 확인할 수 있을 뿐만 아니라, 영향도의 크기들을 시각적으로 사용자에게 제시할 수 있다.An individual explainer can display feature influence for the obtained likelihood score. The feature influence can correspond to the score caused by each feature for the likelihood score. The feature influence can be expressed as a contribution that quantifies the degree to which the feature contributed to the likelihood score. By displaying the feature influence for the likelihood score, the display can identify features that had a significant influence on causing the likelihood score, and can also visually present the magnitude of the influence to the user.

일 실시예에 따르면, 개별 설명자의 값들을 추정하기 위해, 트레이닝된 XGB로 원하는 기록들은 예측되고 모든 피처들의 기여도들은 획득될 수 있다. 예를 들어, 기여도는 각 피처가 모든 트리에 기여한 점수를 집계(aggregate)함으로써 얻은 피처 영향도를 나타낼 수 있다. 후속으로, 피처 영향도의 로지스틱 값() 및 설명자를 위하여 요구되는 상대적인 값들은 계산될 수 있다. In one embodiment, to estimate the values of individual descriptors, desired records can be predicted with trained XGB and contributions of all features can be obtained. For example, the contributions can represent feature influence obtained by aggregating the scores contributed by each feature to all trees. Subsequently, the logistic value of feature influence ( ) and the relative values required for the descriptors can be calculated.

프로세서는 피처 영향도에 기초하여 복수의 피처들 중 하나 이상의 피처들을 선택할 수 있다. 프로세서는 피처들을 피처 영향도에 따라 내림차순으로 정렬하고, 미리 정의된 개수에 대응하는 상위에 정렬된 피처들을 선택할 수 있다. 선택된 피처들은 디스플레이에 의하여 표시될 수 있다. 예를 들어, 표시될 피처들의 개수는 15개로 선택될 수 있고, 나머지 871개의 피처는 모두 통합되어 설명자에서 "기타"로 동시에 표시될 수 있다.The processor can select one or more features from among the plurality of features based on feature influence. The processor can sort the features in descending order of feature influence and select features that are sorted higher than a predefined number. The selected features can be displayed. For example, the number of features to be displayed can be selected to be 15, and the remaining 871 features can all be combined and displayed simultaneously as "Other" in the descriptor.

도 11에서 플롯의 x축은 0부터 1까지의 점수이며, y축은 기여도와 예측 시점(예를 들어, 제1 시점)에서의 가능성 점수(예를 들어, 제1 가능성 점수)에 영향을 미친 값을 나타낼 수 있다. y축 하단의 일반 대각선 해칭 상자의 인터셉트(intercept)은 각 참 값 레이블의 수가 불균형한 것을 반영하는 수정된 값일 수 있다. y축 상단의 회색 상자인 퇴원 확률은 가능성 점수를 나타낼 수 있다. 피처에 대응하는 각 상자의 너비는 각 점수의 절대값을 나타낼 수 있다. 실제 점수는 플롯의 오른쪽에 표시될 수 있다. 절대값은 아래에서 위로 감소할 수 있고, 퇴원 확률에 대한 기여도 또한 감소하는 것을 나타낼 수 있다. 참고로, “기타(Others)”의 상자는 그 아래의 피처들을 제외한 약 800개 피처들의 점수의 합이기 때문에 상대적으로 넓을 수 있다. 도트 표시된 상자는 퇴원 확률에 양성적으로(positively) 기여한 피처의 각 점수를 낼 수 있다. 도트 표시된 상자에 의하여 그래프에서 점수는 오른쪽으로 이동될 수 있다. 반대로, 대각선 해칭된 상자는 음성적으로(negatively) 기여하는 피처의 점수를 나타내며 그래프에서 점수는 왼쪽으로 이동될 수 있다.In Fig. 11, the x-axis of the plot is a score from 0 to 1, and the y-axis can represent the contribution and the value that affected the likelihood score at the prediction time point (e.g., time point 1) (e.g., the 1st likelihood score). The intercept of the general diagonal hatched box at the bottom of the y-axis can be a modified value that reflects the imbalance in the number of each true value label. The discharge probability, which is the gray box at the top of the y-axis, can represent the likelihood score. The width of each box corresponding to a feature can represent the absolute value of each score. The actual score can be displayed on the right side of the plot. The absolute value can decrease from bottom to top, and can also represent a decrease in the contribution to the discharge probability. Note that the box of “Others” can be relatively wide because it is the sum of the scores of about 800 features excluding the features below it. The dotted boxes can represent each score of a feature that positively contributed to the discharge probability. The dotted boxes indicate that the scores can be shifted to the right in the graph. Conversely, the diagonally hatched boxes indicate scores for negatively contributing features and the scores can be shifted to the left in the graph.

요약하자면, 아래쪽에서 위쪽으로 y축에 예측에 기여한 피처가 있고, 오른쪽의 도트 상자는 양성(positive)이고 왼쪽의 대각선 해칭 상자는 음성(negative)를 나타낼 수 있다.In summary, from bottom to top, there are features that contributed to the prediction on the y-axis, with the dotted box on the right representing positive and the diagonal hatched box on the left representing negative.

그래프(1110)은 7일(Date: 7)에 획득된 0.004의 가능성 점수에 대한 피처 영향도를 나타내고, 그래프(1120)은 12일(Date: 12)에 획득된 0.811의 가능성 점수에 대한 피처 영향도를 나타낼 수 있다. 그래프(1110)에서 (D)동맥 모니터링 = 1(ARTERIAL MONITORING = 1.0) 및 (D)주입 펌프 = 3(INFUSION PUMP = 3.0)는 가능성 점수에 음성적인 영향을 미칠 수 있다. 이와 달리, 그래프(1120)에서 (D)주입 펌프 = 0(Infusion Pump = 0)은 퇴원 확률에 양성적인 영향을 미칠 수 있다. 동맥 모니터링 및 주입 펌프는 주로 위독한 환자에게 처방되기 때문에 둘 다 데이터 세트에서 대부분 0으로 구성될 수 있다. 피처와 함께 해당 피처의 값을 표시하면 의료진이 플롯을 직관적으로 해석하는 데 도움이 될 수 있다. Graph (1110) may represent the feature influence for a likelihood score of 0.004 obtained on Date: 7, and graph (1120) may represent the feature influence for a likelihood score of 0.811 obtained on Date: 12. In graph (1110), (D)ARTERIAL MONITORING = 1.0 and (D)INFUSION PUMP = 3.0 may have a negative influence on the likelihood score. In contrast, in graph (1120), (D)INFUSION PUMP = 0 may have a positive influence on the discharge probability. Since arterial monitoring and infusion pump are mainly prescribed to critically ill patients, both may consist of mostly 0 in the data set. Displaying the values of the features along with the features may help the medical staff to intuitively interpret the plot.

개별 설명자는 피처 중요도 플롯(도 9에서 설명됨)에 나타난 피처를 가질 수도 있고 없을 수도 있다. 전체적인 모델에 대한 피처 중요도의 특징만을 관리하는 것보다 개별 환자에게 기여한 피처를 식별하는 것이 필요함을 시사할 수 있다.Individual descriptors may or may not have features that appear in the feature importance plot (as illustrated in Figure 9). This may suggest that it is necessary to identify features that contribute to individual patients rather than just managing the feature importance for the overall model.

도 12은 일 실시예에 따른 환자의 입원 기간 중 복수의 시점들에 예측된 가능성 점수들을 나타낸다.Figure 12 shows predicted likelihood scores at multiple time points during a patient's hospitalization period according to one embodiment.

디스플레이는 입원 중인 환자에 대하여 예측된 복수의 가능성 점수들을 시간에 따라 표시할 수 있다. 복수의 가능성 점수들은 서로 다른 시점에서 예측된 가능성 점수로서, 시간에 따라 표시됨으로써 환자의 퇴원 가능성이 변화하는 것을 나타낼 수 있다.The display can display multiple predicted likelihood scores for a hospitalized patient over time. The multiple likelihood scores are predicted likelihood scores at different points in time, and can be displayed over time to indicate the changing likelihood of the patient's discharge.

예를 들어, 도 12에 나타난 바와 같이, 샘플 데이터 세트는 PAID가 228,443이고 INNO가 2이고 13일 동안 입원하고 14일에 퇴원한 환자의 기록일 수 있다. 환자의 가능성 점수의 플롯은 도 12에 나타날 수 있다. 플롯의 x축은 퇴원일(14일로 표시됨)을 제외한 환자의 입원 기간의 날짜를 나타내고, y축은 가능성 점수(예를 들어, 퇴원 확률)를 나타낼 수 있다. 모델의 최적 임계 점수는 수평 점선으로 표시된 0.39일 수 있다. 원과 삼각형은 각각 참 값 레이블(true lable) 1 및 0을 나타낼 수 있고, 원과 삼각형의 크기는 가능성 점수에 비례할 수 있다. 그림의 무늬는 모델에 의해 예측된 결과를 나타낼 수 있다. 도트는 양성 예측(예를 들어, 레이블 1; 퇴원으로 예측됨) 및 대각선 해칭은 음성 예측(예를 들어, 레이블 0; 입원으로 예측됨)을 나타낼 수 있다.For example, as shown in Fig. 12, the sample data set can be a record of a patient with PAID of 228,443, INNO of 2, who was admitted for 13 days and discharged on the 14th day. A plot of the likelihood score of the patient can be shown in Fig. 12. The x-axis of the plot can represent the days of the patient's hospitalization excluding the discharge date (indicated as day 14), and the y-axis can represent the likelihood score (e.g., the probability of discharge). The optimal threshold score of the model can be 0.39, which is indicated by the horizontal dotted line. The circles and triangles can represent the true labels 1 and 0, respectively, and the sizes of the circles and triangles can be proportional to the likelihood scores. The patterns in the plot can represent the outcomes predicted by the model. The dots can represent positive predictions (e.g., label 1; predicted as discharge) and the diagonal hatching can represent negative predictions (e.g., label 0; predicted as admission).

도 12에서의 샘플의 경우, 모델은 3일 이내에 퇴원을 정확하게 예측할 수 있다. 그러나 임계 점수를 조정하면 11일과 12일의 예측 결과가 변경될 수 있다. 예를 들어, 임계 점수가 증가하면 레이블 1은 12일 및 13일에만 해당될 수 있다. FN(false negative)이 증가하는 것에도 불구하고 FP(false positive)를 감소시키려고 할 때 임계 점수를 증가시키는 것이 유용할 수 있다.For the sample in Fig. 12, the model can accurately predict discharge within 3 days. However, adjusting the threshold score may change the prediction results for days 11 and 12. For example, if the threshold score increases, label 1 may only apply to days 12 and 13. Increasing the threshold score can be useful when trying to reduce false positives (FPs) despite increasing false negatives (FNs).

도 13는 일 실시예에 따른 예측 모델과 개별 설명자가 적용된 병상 관리에 대한 시뮬레이션된 임팩트를 나타낸다. 각 병동별로 모든 환자의 퇴원 확률을 매일 인식할 수 있고, 퇴원할 가능성 점수에 영향을 미치는 가장 중요한 피처 및 피처의 값을 한 번에 파악할 수 있다. 개별 설명자는 퇴원뿐만 아니라 장기 퇴원에 대한 추론을 함축하고 있기 때문에, 높은 퇴원 확률 및 낮은 퇴원 확률 모두를 해석하는 데 유용할 수 있다. 유사하게, 가까운 장래에 병상 수용력(capacity)과 같은 각 환자의 예상 퇴원 날짜를 기반으로 정보를 얻을 수 있다. 병원의 인적, 물적 자원을 효율적으로 활용하기 위해서는 미래의 병상 정보가 병상 관리 및 입원 예약 개선을 통해 병원 비용을 줄이는 데 도움이 될 수 있다.Figure 13 shows the simulated impact on bed management with the prediction model and individual descriptors applied according to one embodiment. The discharge probability of all patients in each ward can be recognized daily, and the most important features and the values of the features that affect the discharge probability score can be identified at once. Since the individual descriptors imply inferences about not only discharge but also long-term discharge, they can be useful for interpreting both high and low discharge probabilities. Similarly, information can be obtained based on the expected discharge date of each patient in the near future, such as bed capacity. In order to efficiently utilize human and material resources of the hospital, future bed information can help reduce hospital costs by improving bed management and admission reservations.

병원 프로세스를 활용해야 하는 병상 관리와 환자 맞춤형 치료를 위한 바이오마커 검출에 대한 연구가 활발히 진행될 수 있다. 본 발명은 더 나은 병상 관리를 위해 퇴원일을 식별하고 퇴원 및 CVD와 관련된 위험 요소를 식별하기 위해 ML 기반 예측 모델을 제안할 수 있다. 다만, 병원마다 환경변수가 다르기 때문에 이를 종합적으로 고려할 수 있는 알고리즘은 요구될 수 있다. 본 발명은 알고리즘을 개선하고 의료 서비스를 지원하는 데 기여할 수 있다. 아래에서 예측 모델의 기대치를 설명한다.Research on bed management that requires hospital processes and biomarker detection for patient-tailored treatment can be actively conducted. The present invention can propose an ML-based prediction model to identify discharge dates for better bed management and to identify risk factors related to discharge and CVD. However, since environmental variables differ for each hospital, an algorithm that can comprehensively consider these may be required. The present invention can contribute to improving the algorithm and supporting medical services. The expectations of the prediction model are described below.

본 발명에 따른 모델은 병동 수준에서 병원 수준의 병상 관리로 확장될 수 있다. 본 발명은 의료진의 노동집약적 업무와 환자의 대기시간을 줄이는 데 기여할 수 있다. The model according to the present invention can be extended from the ward level to the hospital level bed management. The present invention can contribute to reducing the labor-intensive work of medical staff and the waiting time of patients.

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The embodiments described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, the devices, methods, and components described in the embodiments may be implemented using a general-purpose computer or a special-purpose computer, such as, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing instructions and responding to them. The processing device may execute an operating system (OS) and software applications running on the OS. In addition, the processing device may access, store, manipulate, process, and generate data in response to the execution of the software. For ease of understanding, the processing device is sometimes described as being used alone, but those skilled in the art will appreciate that the processing device may include multiple processing elements and/or multiple types of processing elements. For example, a processing device may include multiple processors, or a processor and a controller. Other processing configurations, such as parallel processors, are also possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of these, which may configure a processing device to perform a desired operation or may independently or collectively command the processing device. The software and/or data may be permanently or temporarily embodied in any type of machine, component, physical device, virtual equipment, computer storage medium or device, or transmitted signal waves, for interpretation by the processing device or for providing instructions or data to the processing device. The software may also be distributed over network-connected computer systems and stored or executed in a distributed manner. The software and data may be stored on a computer-readable recording medium.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있으며 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program commands that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program commands, data files, data structures, etc., alone or in combination, and the program commands recorded on the medium may be those specially designed and configured for the embodiment or may be those known to and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, and hardware devices specially configured to store and execute program commands such as ROMs, RAMs, and flash memories. Examples of program commands include not only machine language codes generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter, etc.

위에서 설명한 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 또는 복수의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 이를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described with limited drawings as described above, those skilled in the art can apply various technical modifications and variations based on the described embodiments. For example, even if the described techniques are performed in a different order than the described method, and/or the components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or are replaced or substituted by other components or equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also included in the scope of the claims described below.

Claims

In a device for predicting patient discharge,
A processor for obtaining, from raw medical data including medical information on a plurality of patients, medical data collected from a date of hospitalization to the first time point during the hospitalization period of a patient hospitalized at a first time point, applying a machine learning model to input data including the medical data to obtain a first likelihood score that the patient will be discharged from the hospital within a target period from the first time point, and predicting whether the patient will be discharged from the hospital within the target period from the first time point based on the obtained first likelihood score;
Including,
The above processor,
Obtaining the raw medical data including multiple tables,
For the above multiple tables, in response to a case where the number of values that each feature can have exceeds a predetermined number, values having a frequency greater than a threshold frequency are maintained and values having a frequency less than the threshold frequency are replaced with a common value.
Merge the plurality of tables based on at least one feature among the patient ID, patient encounter number, or admission date included in each of the plurality of tables, and
From the result of merging the above multiple tables, the medical data aggregated by date for the patient is obtained.
A device for predicting patient discharge.

In the first paragraph,
The above processor,
By applying the machine learning model to input data including medical data regarding one or more combinations of operation, procedure, Picture Archiving and Communication System (PACS), diagnosis, medication, laboratory, and physical collected from the date of admission to the first time point for the patient, the first likelihood score is obtained.
A device for predicting patient discharge.

In the first paragraph,
The above processor,
In response to the patient being hospitalized at a second point in time after the first point in time, the medical data collected during the hospitalization period is updated based on the medical data collected from the first point in time to the second point in time,
By applying the machine learning model to the input data including the updated medical data, a second likelihood score of the patient being discharged from the hospital within the target period from the second time point is obtained,
Based on the second probability score obtained above, predicting whether the patient will be discharged from the hospital within the target period from the second time point.
A device for predicting patient discharge.

In the first paragraph,
The above processor,
Obtaining the first likelihood score by applying the machine learning model to input data including medical data collected during the said hospitalization period together with medical data collected during a predefined period prior to the said hospitalization period.
A device for predicting patient discharge.

In the first paragraph,
The above processor,
By applying the machine learning model to input data including medical data collected during the said hospitalization period and medical data regarding one or more combinations of diagnosis, medication, laboratory, physical, and length of stay (LOS) in an intensive care unit (ICU) collected for the patient during a predefined period prior to the said hospitalization period, the first likelihood score is obtained.
A device for predicting patient discharge.

In the first paragraph,
The above processor,
Based on the feature importance of each feature for a temporary machine learning model trained on the basis of all features of the collected data, one or more features among the features of the collected data are selected as features of the input data of the machine learning model,
The above machine learning model is trained based on the above selected features,
Obtaining the first likelihood score by applying the machine learning model to the input data including the selected features.
A device for predicting patient discharge.

In the first paragraph,
The above processor,
Selecting one or more features as inputs to the machine learning model by applying the recursive feature elimination with cross validation (RFECV) technique to the features of the input data.
A device for predicting patient discharge.

In the first paragraph,
The above processor,
By applying the XGB (extreme gradient boost) model to the above input data, the first likelihood score is obtained.
A device for predicting patient discharge.

In the first paragraph,
The above processor,
Selecting one or more features among the features of the input data based on the feature influence corresponding to the score induced by each feature of the input data for the first possibility score obtained above.
A device for predicting patient discharge.

In Article 9,
A display showing the feature influence of one or more of the selected features for the first probability score obtained above.
A device for predicting discharge of a patient including:

In the first paragraph,
Display showing probability scores for multiple time points during the above hospitalization period
A device for predicting discharge of a patient including:

A method for predicting patient discharge performed by a device for predicting patient discharge,
A step of obtaining medical data collected from the date of hospitalization to the first time point during the hospitalization period of a patient hospitalized at a first time point, from raw medical data including medical information on multiple patients;
A step of obtaining a first likelihood score of the patient being discharged from the hospital within a target period from the first time point by applying a machine learning model to input data including the above medical data; and
A step of predicting whether the patient will be discharged from the hospital within the target period from the first time point based on the first probability score obtained above.
Including,
The step of obtaining the medical data from the above raw medical data is:
A step of obtaining the raw medical data including a plurality of tables;
For the above multiple tables, in response to a case where the number of values that each feature can have exceeds a predetermined number, a step of maintaining values having a frequency greater than a threshold frequency and replacing values having a frequency less than the threshold frequency with a common value; and
A step of merging the plurality of tables based on at least one feature among a patient ID, a patient encounter number, or a hospitalization date included in each of the plurality of tables;
A step of obtaining the medical data aggregated by date for the patient from the result of merging the plurality of tables,
A method for predicting patient discharge.

In Article 12,
The step of obtaining the above first possibility score is:
A step of obtaining the first likelihood score by applying the machine learning model to input data including medical data regarding one or more combinations of operation, procedure, Picture Archiving and Communication System (PACS), diagnosis, medication, laboratory, and physical collected from the date of admission to the first time point for the patient,
A method for predicting patient discharge.

In Article 12,
In response to a case where the patient is hospitalized at a second point in time after the first point in time, a step of updating medical data collected during the hospitalization period based on medical data collected from the first point in time to the second point in time;
A step of obtaining a second likelihood score of the patient being discharged from the hospital within the target period from the second time point by applying the machine learning model to input data including the updated medical data; and
A step of predicting whether the patient will be discharged from the hospital within the target period from the second time point based on the second probability score obtained above.
A method for predicting discharge of a patient including:

In Article 12,
The step of obtaining the above first possibility score is:
A step of obtaining the first likelihood score by applying the machine learning model to input data including medical data collected during the said hospitalization period together with medical data collected during a predefined period prior to the said hospitalization period,
A method for predicting patient discharge.

In Article 12,
The step of obtaining the above first possibility score is:
A step of obtaining the first likelihood score by applying the machine learning model to input data including medical data collected during the period of hospitalization and one or more combinations of diagnosis, medication, laboratory, physical, and length of stay (LOS) in an intensive care unit (ICU) collected for the patient during a predefined period prior to the period of hospitalization, comprising:
A method for predicting patient discharge.

In Article 12,
A step of selecting one or more features of the collected data as features of the input data of the machine learning model based on the feature importance of each feature for a temporary machine learning model trained based on all features of the collected data; and
A step of training the machine learning model based on the selected features.
Including more,
The step of obtaining the above first possibility score is:
Comprising a step of obtaining the first likelihood score by applying the machine learning model to the input data including the selected features.
A method for predicting patient discharge.

In Article 17,
The step of selecting one or more of the above features as features of the input of the machine learning model comprises:
A method comprising: selecting one or more features as inputs to the machine learning model by applying a recursive feature elimination and cross validation (RFECV) technique to the features of the input data.
A method for predicting patient discharge.

In Article 12,
The step of obtaining the above first possibility score is:
A step of obtaining the first likelihood score by applying an extreme gradient boost (XGboost) model to the input data,
A method for predicting patient discharge.

In Article 12,
A step of selecting one or more features among the features of the input data based on the feature influence corresponding to the score induced by each feature of the input data for the first possibility score obtained above.
A method for predicting discharge of a patient including:

In Article 20,
A step of displaying the feature influence of one or more of the selected features for the first possibility score obtained above.
A method for predicting discharge of a patient including:

In Article 15,
Step of displaying probability scores for multiple time points during the above hospitalization period
A method for predicting discharge of a patient including:

A computer program stored on a computer-readable recording medium for executing the method of any one of claims 12 to 22 in combination with hardware.