A multi-source heterogeneous medical data enhancement framework based on lakehouse

Ming Sheng¹,
Shuliang Wang¹,
Yong Zhang²,
Rui Hao ORCID: orcid.org/0009-0002-6634-3156³,
Ye Liang⁴,
Yi Luo¹,
Wenhan Yang³,
Jincheng Wang⁵,
Yinan Li⁶,
Wenkui Zheng⁷ &
…
Wenyao Li⁷

507 Accesses
Explore all metrics

Abstract

Obtaining high-quality data sets from raw data is a key step before data exploration and analysis. Nowadays, in the medical domain, a large amount of data is in need of quality improvement before being used to analyze the health condition of patients. There have been many researches in data extraction, data cleaning and data imputation, respectively. However, there are seldom frameworks integrating with these three techniques, making the dataset suffer in accuracy, consistency and integrity. In this paper, a multi-source heterogeneous data enhancement framework based on a lakehouse MHDP is proposed, which includes three steps of data extraction, data cleaning and data imputation. In the data extraction step, a data fusion technique is offered to handle multi-modal and multi-source heterogeneous data. In the data cleaning step, we propose HoloCleanX, which provides a convenient interactive procedure. In the data imputation step, multiple imputation (MI) and the SOTA algorithm SAITS, are applied for different situations. We evaluate our framework via three tasks: clustering, classification and strategy prediction. The experimental results prove the effectiveness of our data enhancement framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MHDP: An Efficient Data Lake Platform for Medical Multi-source Heterogeneous Data

MHDML: Construction of a Medical Lakehouse for Multi-source Heterogeneous Data

A novel clustering-based purity and distance imputation for handling medical data with missing values

Article 16 June 2021

Data availability

The data used in this study consist of sensitive patient medical records, which include personal information that requires strict confidentiality. As a result, we are unable to share these data publicly. The restrictions are in place to protect the privacy and well-being of the research participants, ensuring that their personal health information remains secure and is not misused.

References

Zhang G. Research on the deployment strategy of big data visualization platform by the internet of things technology. EAI Endorsed Trans Scalable Inf Syst. 2023;10(4):11. https://doi.org/10.4108/eetsis.v10i3.3051.
Article Google Scholar
Ge YF, Wang H, Bertino E, Zhan ZH, Cao J, Zhang Y, Zhang J. Evolutionary dynamic database partitioning optimization for privacy and utility. IEEE Trans Dependable Secure Comput. 2023. https://doi.org/10.1109/TDSC.2023.3302284.
Article Google Scholar
Ge Y-F, Yu W-J, Cao J, Wang H, Zhan Z-H, Zhang Y, Zhang J. Distributed memetic algorithm for outsourced database fragmentation. IEEE Trans Cybern. 2021;51(10):4808–21. https://doi.org/10.1109/TCYB.2020.3027962.
Article Google Scholar
Li J-Y, Zhan Z-H, Wang H, Zhang J. Data-driven evolutionary algorithm with perturbation-based ensemble surrogates. IEEE Trans Cybern. 2021;51(8):3925–37. https://doi.org/10.1109/TCYB.2020.3008280.
Article Google Scholar
Wang C, Sun B, Du KJ, Li JY, Zhan ZH, Jeon SW, Wang H, Zhang J. A novel evolutionary algorithm with column and sub-block local search for sudoku puzzles. IEEE Trans Games. 2024;16(1):162–72. https://doi.org/10.1109/TG.2023.3236490.
Article Google Scholar
Yang JQ, Yang QT, Du KJ, Chen CH, Wang H, Jeon SW, Zhang J, Zhan ZH. Bi-directional feature fixation-based particle swarm optimization for large-scale feature selection. IEEE Trans Big Data. 2023;9(3):1004–17. https://doi.org/10.1109/TBDATA.2022.3232761.
Article Google Scholar
Li JY, Du KJ, Zhan ZH, Wang H, Zhang J. Distributed differential evolution with adaptive resource allocation. IEEE Trans Cybern. 2023;53(5):2791–804. https://doi.org/10.1109/TCYB.2022.3153964.
Article Google Scholar
Shi W, Chen WN, Kwong S, Zhang J, Wang H, Gu T, Yuan H, Zhang J. A coevolutionary estimation of distribution algorithm for group insurance portfolio. IEEE Trans Syst Man Cybern Syst. 2022;52(11):6714–28. https://doi.org/10.1109/TSMC.2021.3096013.
Article Google Scholar
Huang T, Gong Y-J, Chen W-N, Wang H, Zhang J. A probabilistic niching evolutionary computation framework based on binary space partitioning. IEEE Trans Cybern. 2022;52(1):51–64. https://doi.org/10.1109/TCYB.2020.2972907.
Article Google Scholar
Hao R, Sheng M, Zhang Y, Zhao H, Hao C, Li W, Wang L, Li C. Enhancing clustering performance in sepsis time series data using gravity field. In: Health information science. Singapore: Springer; 2023. p. 199–212.
Chapter Google Scholar
Jiang H, Zhou R, Zhang L, Wang H, Zhang Y. Sentence level topic models for associated topics extraction. World Wide Web. 2019;22(6):2545–60. https://doi.org/10.1007/s11280-018-0639-1.
Article Google Scholar
Sarki R, Ahmed K, Wang H, Zhang Y. Automated detection of mild and multi-class diabetic eye diseases using deep learning. Health Inf Sci Syst. 2020;8(1):32. https://doi.org/10.1007/s13755-020-00125-5.
Article Google Scholar
Vimalachandran P, Liu H, Lin Y, Ji K, Wang H, Zhang Y. Improving accessibility of the Australian my health records while preserving privacy and security of the system. Health Inf Sci Syst. 2020;8(1):31. https://doi.org/10.1007/s13755-020-00126-4.
Article Google Scholar
Supriya S, Siuly S, Wang H, Zhang Y. Automated epilepsy detection techniques from electroencephalogram signals: a review study. Health Inf Sci Syst. 2020;8(1):33. https://doi.org/10.1007/s13755-020-00129-1.
Article Google Scholar
Pandey D, Wang H, Yin X, Wang K, Zhang Y, Shen J. Automatic breast lesion segmentation in phase preserved dce-mris. Health Inf Sci Syst. 2022;10(1):9. https://doi.org/10.1007/s13755-022-00176-w.
Article Google Scholar
Alvi AM, Siuly S, Wang H. A long short-term memory based framework for early detection of mild cognitive impairment from eeg signals. IEEE Trans Emerg Topics Comput Intell. 2023;7(2):375–88. https://doi.org/10.1109/TETCI.2022.3186180.
Article Google Scholar
Miao Z, Sealey MD, Sathyanarayanan S, Delen D, Zhu L, Shepherd S. A data preparation framework for cleaning electronic health records and assessing cleaning outcomes for secondary analysis. Inf Syst. 2023;111: 102130.
Article Google Scholar
Nguyen BNT, Phạm PN, Nguyen VT, Viet PQ, Tuan LD, Snasel V. Py_ape: Text data acquiring, extracting, cleaning and schema matching in python. In: Future data and security engineering. Big Data, security and privacy, smart city and industry 4.0 applications: 7th international conference, FDSE 2020, Quy Nhon, Vietnam, November 25–27, 2020, Proceedings 7. Springer; 2020. pp. 78–89.
Mutinda FW, Liew K, Yada S, Wakamiya S, Aramaki E. Automatic data extraction to support meta-analysis statistical analysis: a case study on breast cancer. BMC Med Inf Decis Mak. 2022;22(1):1–13.
Google Scholar
Li H, Zhou G, Zhou S, Chen S, Mao S, Jin T Multi-source heterogeneous log fusion technology of power information system based on big data and imprecise reasoning theory. In: 2020 IEEE 20th international conference on communication technology (ICCT). 2020. pp. 1609–14. https://doi.org/10.1109/ICCT50939.2020.9295848
Lv Z, Deng W, Zhang Z, Guo N, Yan G. A data fusion and data cleaning system for smart grids big data. In: 2019 IEEE Intl conf on parallel & distributed processing with applications, big data & cloud computing, sustainable computing & communications, social computing & networking (ISPA/BDCloud/SocialCom/SustainCom). 2019. pp. 802–7. 10.1109/ISPA-BDCloud-SustainCom-SocialCom48970.2019.00119
Miao X, Wu Y, Wang J, Gao Y, Mao X, Yin J. Generative semi-supervised learning for multivariate time series imputation. In: Proceedings of the AAAI conference on artificial intelligence. 2021; pp. 8983–91.
Du W, Côté D, Liu Y. Saits: self-attention-based imputation for time series. Expert Syst Appl. 2023;219: 119619.
Article Google Scholar
Khayati M, Lerner A, Tymchenko Z, Cudré-Mauroux P. Mind the gap: an experimental evaluation of imputation of missing values techniques in time series. Proc VLDB Endowment. 2020;13:768–82.
Article Google Scholar
Ren P, Li S, Hou W, Zheng W, Li Z, Cui Q, Chang W, Li X, Zeng C, Sheng M. Mhdp: an efficient data lake platform for medical multi-source heterogeneous data. In: Web information systems and applications: 18th international conference, WISA 2021, Kaifeng, China, September 24–26, 2021, Proceedings 18. Springer; 2021. pp. 727–38.
Rekatsinas T, Chu X, Ilyas IF, Ré C. Holoclean: Holistic data repairs with probabilistic inference. 2017. Available from http://arxiv.org/abs/1702.00820
Rubin DB, Schenker N. Multiple imputation in health-are databases: an overview and some applications. Stat Med. 1991;10(4):585–98.
Article Google Scholar
Das PP, Mast M, Wiese L, Jack T, Wulf A. Data extraction for associative classification using mined rules in pediatric intensive care data. BTW; 2023.
Google Scholar
Li H, Zhou G, Zhou S, Chen S, Mao S, Jin T Multi-source heterogeneous log fusion technology of power information system based on big data and imprecise reasoning theory. In: 2020 IEEE 20th international conference on communication technology (ICCT). IEEE; 2020. pp. 1609–14.
Wang C, Feng S. Research on collection and preprocessing of multisource heterogeneous elevator data. In: 2020 IEEE international conference on power, intelligent computing and systems (ICPICS). IEEE; 2020. p. 490–3.
Chapter Google Scholar
Lv Z, Deng W, Zhang Z, Guo N, Yan G. A data fusion and data cleaning system for smart grids big data. In: 2019 IEEE Intl Conf on parallel & distributed processing with applications, big data & cloud computing, sustainable computing & communications, social computing & networking (ISPA/BDCloud/SocialCom/SustainCom). IEEE; 2019. pp. 802–7.
Ying Z, Huang Y, Chen K. Yu T Big data cleaning model of multi-source heterogeneous power grid based on machine learning classification algorithm. J Phys Conf Ser. 2021;2087: 012095.
Article Google Scholar
Dalca AV, Guttag J, Sabuncu MR. Unsupervised data imputation via variational inference of deep subspaces. 2019. Available form http://arxiv.org/abs/1903.03503
Srivastava M, Garg R, Mishra P. Analysis of data extraction and data cleaning in web usage mining. In: Proceedings of the 2015 international conference on advanced research in computer science engineering and technology (ICARCSET 2015). 2015. pp. 1–6.
Jonnalagadda SR, Goyal P, Huffman MD. Automating data extraction in systematic reviews: a systematic review. Syst Rev. 2015;4(1):78. https://doi.org/10.1186/s13643-015-0066-7.
Article Google Scholar
Pradhan R, Hoaglin DC, Cornell M, Liu W, Wang V. Automatic extraction of quantitative data from clinicaltrials.gov to conduct meta-analyses. J Clin Epidemiol. 2019;105:92–100. https://doi.org/10.1016/j.jclinepi.2018.08.023.
Article Google Scholar
Gao P, Han H. Robust web data extraction based on weighted path-layer similarity. J Comput Inf Syst. 2022;62(3):536–46.
Google Scholar
Musleh M, Ouzzani M, Tang N, Doan A. Coclean: Collaborative data cleaning. In: Proceedings of the 2020 ACM SIGMOD international conference on management of data. 2020. pp. 2757–60.
Liu W, Zhang C, Yu B, Li Y. A general multi-source data fusion framework. In: Proceedings of the 2019 11th international conference on machine learning and computing. IEEE; 2019. p. 285–9.
Chapter Google Scholar
Krishnan S, Wu E Alphaclean: Automatic generation of data cleaning pipelines. 2019. Available from http://arxiv.org/abs/1904.11827
Batista GE, Monard MC. A study of k-nearest neighbour as an imputation method. His. 2002;87(251–260):48.
Google Scholar
Singh R, Subramani S, Du J, Zhang Y, Wang H, Miao Y, Ahmed K. Antisocial behavior identification from twitter feeds using traditional machine learning algorithms and deep learning. EAI Endorsed Trans Scalable Inf Syst. 2023;10:17. https://doi.org/10.4108/eetsis.v10i3.3184.
Article Google Scholar
Cao W, Wang D, Li J, Zhou H, Li L, Li Y. Brits: bidirectional recurrent imputation for time series. Adv Neural Inf Process Syst. 2018;31:10.
Google Scholar
Luo Y, Zhang Y, Cai X, Yuan X. E2gan: End-to-end generative adversarial network for multivariate time series imputation. In: Proceedings of the 28th international joint conference on artificial intelligence. AAAI press; 2019. p. 3094–100.
Google Scholar
Zhang Y, Sheng M, Liu X, Wang R, Lin W, Ren P, Wang X, Zhao E, Song W. A heterogeneous multi-modal medical data fusion framework supporting hybrid data exploration. Health Inf Sci Syst. 2022;10(1):22.
Article Google Scholar
Hyndman RJ. Hospital. 2015. http://www.hospitalcompare.hhs.gov/
Barry Becker RK. Adult. 1996. https://archive.ics.uci.edu/dataset/2/adult
Royston P. Multiple imputation of missing values. Stand Genomic Sci. 2004;4(3):227–41.
Google Scholar
Breiman L. Random forests. Mach Learn. 2001;45:5–32.
Article Google Scholar
Johnson A, Bulgarelli L, Pollard T, Horng S, Celi LA, Mark R. Mimic-iv. PhysioNet. 2020. https://physionet.org/content/mimiciv/1.0/ . Accessed 23 Aug 2021.
Pollard TJ, Johnson AE, Raffa JD, Celi LA, Mark RG, Badawi O. The EICU collaborative research database, a freely available multi-center database for critical care research. Sci Data. 2018;5(1):1–13.
Article Google Scholar
Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, Chen K, Mitchell R, Cano I, Zhou T. Xgboost: extreme gradient boosting. R package version 0.4-2. 2015;1:1–4.
Balakrishnama S, Ganapathiraju A. Linear discriminant analysis-a brief tutorial. Inst Signal Inf Process. 1998;18(1998):1–8.
Google Scholar
Gunn SR. Support vector machines for classification and regression. ISIS Techn Rep. 1998;14(1):5–16.
Google Scholar
Fujimoto S, Meger D, Precup D. Off-policy deep reinforcement learning without exploration. In: International conference on machine learning. PMLR; 2019. p. 2052–62.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China
Ming Sheng, Shuliang Wang & Yi Luo
BNRist, DCST, RIIT, Tsinghua University, Beijing, 100084, China
Yong Zhang
School of Computer Science, International School, Beijing University of Posts and Telecommunications, Haidian District, Beijing, 100876, China
Rui Hao & Wenhan Yang
School of Information Science and Technology, Beijing Foreign Studies University, Beijing, China
Ye Liang
School of Computer and Information Technology, Beijing Jiaotong University, Haidian District, Beijing, 100044, China
Jincheng Wang
Dam Safety Monitoring Center, Yellow River Engineering Consulting Co., Ltd, Zhengzhou, 450000, China
Yinan Li
School of Software, Henan University, Kaifeng, 475004, Henan, China
Wenkui Zheng & Wenyao Li

Authors

Ming Sheng
View author publications
You can also search for this author in PubMed Google Scholar
Shuliang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Rui Hao
View author publications
You can also search for this author in PubMed Google Scholar
Ye Liang
View author publications
You can also search for this author in PubMed Google Scholar
Yi Luo
View author publications
You can also search for this author in PubMed Google Scholar
Wenhan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jincheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yinan Li
View author publications
You can also search for this author in PubMed Google Scholar
Wenkui Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Wenyao Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Shuliang Wang, Yong Zhang, Rui Hao or Ye Liang.

Ethics declarations

Conflict of interest

All authors declare that there is no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sheng, M., Wang, S., Zhang, Y. et al. A multi-source heterogeneous medical data enhancement framework based on lakehouse. Health Inf Sci Syst 12, 37 (2024). https://doi.org/10.1007/s13755-024-00295-6

Download citation

Received: 31 August 2023
Accepted: 17 June 2024
Published: 05 July 2024
DOI: https://doi.org/10.1007/s13755-024-00295-6

A multi-source heterogeneous medical data enhancement framework based on lakehouse

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MHDP: An Efficient Data Lake Platform for Medical Multi-source Heterogeneous Data

MHDML: Construction of a Medical Lakehouse for Multi-source Heterogeneous Data

A novel clustering-based purity and distance imputation for handling medical data with missing values

Data availability

References

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A multi-source heterogeneous medical data enhancement framework based on lakehouse

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MHDP: An Efficient Data Lake Platform for Medical Multi-source Heterogeneous Data

MHDML: Construction of a Medical Lakehouse for Multi-source Heterogeneous Data

A novel clustering-based purity and distance imputation for handling medical data with missing values

Data availability

References

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation