A Privacy-Preserving and Standard-Based Architecture for Secondary Use of Clinical Data
<p>Proposed architecture.</p> "> Figure 2
<p>Flow chart of an anonymization algorithm.</p> "> Figure 3
<p>E-R schema of extracted data case study.</p> "> Figure 4
<p>Example of statistics obtained through the secondary use of the data.</p> ">
Abstract
:1. Introduction
- The definition of an architecture able to provide the integration of heterogeneous health and clinical data according to the FHIR interoperable standard, which also includes the capability of automatically selecting and applying the most suitable pseudonymization algorithms, enabling the secondary use of clinical data;
- The proposal of a solution to perform the analysis of the obtained large collections of FHIR resources.
2. Background
2.1. International Regulation and Law
2.2. ETL Process
- Extraction: In this phase, the data is extracted from heterogeneous sources. The extracted data is managed in a staging area, such as a data lake;
- Transformation: In this phase, the collected data is transformed by applying the correct format, which is defined through the application of the following rules:
- ◦
- Standardization: Includes the selection of useful data, the methods, and the standard format;
- ◦
- Deduplication: Identifies useless duplicated data;
- ◦
- Verification: Eliminates incorrect data;
- ◦
- Sorting: Groups and sorts data;
- ◦
- Other activities: Depend on the context and the purposes of the ETL process (i.e., de-identification).
- Loading: In this phase, the extracted and transformed data are loaded into a new destination to the end of managing and analyzing the data, such as a data management system, data lake, or any kind of data repository. The upload of the data can be full or incremental.
2.3. Data De-Identification
- Character/record masking [15]: Represents an anonymization technique that provides for the cancellation of main personal identifiers, such as name, date of birth, and more. This technique is used for example in legal databases.
- Shuffling [16]: Represents an anonymization technique that has the purpose of modifying the data in order to eliminate the relationship between the data and person by replacing the sensitive data with a different one belonging to the same type and extracted from the same corpus. There are numerous methodologies of anonymization that use this technique at different levels in the information structure [17].
- Pseudonymization [13]: Allows the replacement of an attribute with another value. Pseudonymization can be defined as the technique with which a unique attribute of one data is replaced with another. The person could, however, be identified indirectly.
- Generalization [18]: Is an anonymization technique that aims to generalize attributes associated with people. For example, the information relating to a date of birth can be generalized using only the year of birth and avoiding the indication of the day and month.
- K-anonymity [20]: This technique, through aggregation with k different people, tries to prevent the identification of a person. By sharing the same value with k people, it is more difficult to identify a specific person. The generalization therefore allows to share a given value of an attribute for a greater number of people. The main flaw of the k-anonymity model is that it does not protect against deductive attacks. Furthermore, with the intersection of different groups represented by different attributes, it can be even easier to identify the person.
- L-diversity [21]: This technique extends the k-anonymity technique to make attacks by deterministic deduction ineffective by ensuring that in each equivalence class, there are at least L different values of L attributes. L-diversity is subject to attack by probabilistic deduction.
- T-Closeness [20,22]: This technique represents an evolution of L-diversity, as the goal is to create equivalent classes that are similar to the initial attributes. This is useful when it is necessary that the values obtained are as close to the starting ones. This technique requires that not only must exist at least L different values within each equivalence class, as indicated by the L-diversity technique, but also that each value is represented as many times as necessary to reflect the initial distribution of each attribute.
2.4. HL7 CDA and HL7 FHIR Standards
3. Related Works
4. System Architecture
4.1. Extraction Module
4.2. Transformation Module
4.3. Loader Module
4.4. Data Retrieval Module
5. Implementation Details
- Read, to get the status of a specific resource. It is used by information retrieval applications interested in a specific resource profile;
- Search, to search for a specific FHIR resource and obtain information of interest;
- Update and patch, to update an existing resource. It is used to modify some statistical data or to update specific clinical observations;
- Delete, to remove a specific resource.
6. Preliminary Tests and Discussion
- Lawfulness of processing: It depends on the type of processing performed by an authorized professional, for example, data could be collected through a patient’s consent or for public interest.
- Purpose limitations: The proposal is limited to the purpose of secondary use (research purposes, etc.), and so when the data are then processed, they will be managed ensuring this principle. In addition, the proposed architecture provides specific retrieval capabilities depending on the context of the request.
- Data minimization: The information retrieval functionalities return only the requested FHIR resources that depend on the purpose of use and type of operation.
- Accuracy: This principle is reached by making the transformation and loading process with a high frequency.
- Storage limitations: The proposed architecture stores only the data necessary for the type of processing.
- Integrity and confidentiality: The access to resources can be regulated by access control mechanisms to avoid unauthorized access and reduce related cybersecurity vulnerabilities and risks.
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Teasdale, S.; Bates, D.; Kmetik, K.; Suzewits, J.; Bainbridge, M. Secondary uses of clinical data in primary care. J. Innov. Health Inform. 2007, 15, 157–166. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Hutchings, E.; Loomes, M.; Butow, P.; Boyle, F.M. A systematic literature review of attitudes towards secondary use and sharing of health administrative and clinical trial data: A focus on consent. Syst. Rev. 2021, 10, 1–44. [Google Scholar] [CrossRef] [PubMed]
- ICH Harmonised Guideline Integrated Addendum to ICH E6(R1): Guideline for Good Clinical Practice ICH E6(R2) ICH Consensus Guideline. Available online: https://ichgcp.net (accessed on 12 December 2021).
- European Commission. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the Protection of Natural Persons with Regard to the Processing of Personal Data and on the Free Movement of Such Data, and Repealing Directive 95/46, General Data Protection Regulation; European Commission: Brussels, Belgium, 2016. [Google Scholar]
- Albrecht, J.P. How the GDPR will change the world. Eur. Data Prot. Law Rev. 2016, 2, 287–289. [Google Scholar] [CrossRef]
- Carrion, I.; Aleman, J.L.F.; Toval, A. Assessing the HIPAA standard in practice: PHR privacy policies. In Proceedings of the Engineering in Medicine and Biology Society, EMBC, 2011 Annual International Conference of the IEEE, Boston, MA, USA, 30 August–3 September 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 2380–2383. [Google Scholar] [CrossRef]
- Summary of the HIPAA Privacy Rule. Available online: http://www.hhs.gov/ocr/privacy/hipaa/understanding/summary (accessed on 14 October 2021).
- United States Congress. Health Insurance Portability and Accountability Act of 1996, Accountability Act; United States Congress: Washington, DC, USA, 1996.
- West, S.L.; Blake, C.; Zhiwen, L.; McKoy, J.N.; Oertel, M.D.; Carey, T.S. Reflections on the use of electronic health record data for clinical research. Health Inform. J. 2009, 15, 108–121. [Google Scholar] [CrossRef] [PubMed]
- Katulic, T.; Katulic, A. GDPR and the reuse of personal data in scientific research. In Proceedings of the 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, 21–25 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1311–1316. [Google Scholar] [CrossRef]
- Tayefi, M.; Ngo, P.; Chomutare, T.; Dalianis, H.; Salvi, E.; Budrionis, A.; Godtliebsen, F. Challenges and opportunities beyond structured data in analysis of electronic health records. Wiley Interdiscip. Rev. Comput. Stat. 2021, 13, e1549. [Google Scholar] [CrossRef]
- Silvestri, S.; Esposito, A.; Gargiulo, F.; Sicuranza, M.; Ciampi, M.; De Pietro, G. A Big Data Architecture for the Extraction and Analysis of EHR Data. In Proceedings of the 2019 IEEE World Congress on Services (SERVICES), Milan, Italy, 8–13 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 283–288. [Google Scholar] [CrossRef]
- Bolognini, L.; Bistolfi, C. Pseudonymization and impacts of Big (personal/anonymous) Data processing in the transition from the Directive 95/46/EC to the new EU General Data Protection Regulation. Comput. Law Secur. Rev. 2017, 33, 171–181. [Google Scholar] [CrossRef]
- Dankar, F.K.; El Emam, K.; Neisa, A.; Roffey, T. Estimating the re-identification risk of clinical data sets. BMC Med. Inform. Decis. Mak. 2012, 12, 1–15. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Data Masking and Encryption are Different. Available online: https://www.iri.com/blog/data-protection/data-masking-and-data-encryption-are-not-the-same-things (accessed on 10 December 2021).
- Deleger, L.; Lingren, T.; Ni, Y.; Kaiser, M.; Stoutenborough, L.; Marsolo, K.; Kouril, M.; Molnar, K.; Solti, I. Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research. J. Biomed. Inform. 2014, 50, 173–183. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Tomashchuk, O.; Van Landuyt, D.; Pletea, D.; Wuyts, K.; Joosen, W. A data utility-driven benchmark for de-identification methods. In Proceedings of the International Conference on Trust and Privacy in Digital Business, Linz, Austria, 26–29 August 2019; Springer: Cham, Switzerland, 2019; pp. 63–77. [Google Scholar] [CrossRef] [Green Version]
- Naldi, M.; D’Acquisto, G. Big Data and Privacy by Design. Anonymization, Pseudo-Anonymization and Security; Giappichelli, G.: Torino, Italy, 2019. [Google Scholar]
- Kayaalp, M. Modes of De-identification. In American Medical Informatics Association Annual Symposium (AMIA) 2017, Washington, DC, USA, 4–8 November 2017; AMIA: Bethesda, MD, USA, 2017; p. 1044. [Google Scholar]
- Li, N.; Li, T.; Venkatasubramanian, S. t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. In Proceedings of the 23rd International Conference on Data Engineering ICDE 2007, Istanbul, Turkey, 17–20 April 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 106–115. [Google Scholar] [CrossRef] [Green Version]
- Machanavajjhala, A.; Kifer, D.; Gehrke, J.; Venkitasubramaniam, M. L-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 2007, 1, 3. [Google Scholar] [CrossRef]
- Dutta, A.; Bhattacharyya, A.; Sen, A. Comparative Analysis of Anonymization Techniques. In Privacy and Security Issues in Big Data. Services and Business Process Reengineering; Das, P.K., Tripathy, H.K., Mohd Yusof, S.A., Eds.; Springer: Singapore, 2021; pp. 69–78. [Google Scholar] [CrossRef]
- HL7 Clinical Document Architecture (CDA). Available online: http://www.hl7.org/implement/standards/product_brief.cfm?product_id=7 (accessed on 8 December 2021).
- HL7 Fast Healthcare Interoperability Resources (FHIR). Available online: https://www.hl7.org/fhir/ (accessed on 8 December 2021).
- Ciampi, M.; Marangio, F.; Schmid, G.; Sicuranza, M. A Blockchain-based Smart Contract System Architecture for Dependable Health Processes. In Proceedings of the Italian Conference on Cybersecurity ITASEC 2021, Virtual Event, Italy, 7–9 April 2021; pp. 360–373. Available online: https://www.rheagroup.com/event/itasec-2021/ (accessed on 8 December 2021).
- Hripcsak, G.; Duke, J.D.; Shah, N.H.; Reich, C.G.; Huser, V.; Schuemie, M.J.; Suchard, M.S.; Park, R.W.; Wong, I.C.K.; Rijnbeek, P.R.; et al. Observational Health Data Sciences and Informatics (OHDSI): Opportunities for observational researchers. Stud. Health Technol. Inform. 2015, 216, 574–578. [Google Scholar]
- Overhage, J.M.; Ryan, P.B.; Reich, C.G.; Hartzema, A.G.; Stang, P.E. Validation of a common data model for active safety surveillance research. J. Am. Med. Inform. Assoc. 2012, 19, 54–60. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- OHDSI FhirToCdm Github Repository. Available online: https://github.com/OHDSI/FhirToCdm (accessed on 18 January 2022).
- Pfaff, E.R.; Champion, J.; Bradford, R.L.; Clark, M.; Xu, H.; Fecho, K.; Krishnamurthy, A.; Cox, S.; Chute, C.G.; Overby Taylor, C.; et al. Fast Healthcare Interoperability Resources (FHIR) as a Meta Model to Integrate Common Data Models: Development of a Tool and Quantitative Validation Study. JMIR Med. Inform. 2019, 7, e15199. [Google Scholar] [CrossRef] [PubMed]
- OMOP on FHIR. Available online: https://omoponfhir.org (accessed on 18 January 2022).
- Murphy, S.; Wilcox, A. Mission and Sustainability of Informatics for Integrating Biology and the Bedside (i2b2). EGEMS 2014, 2, 1074. [Google Scholar] [CrossRef] [Green Version]
- Boussadi, A.; Zapletal, E. A Fast Healthcare Interoperability Resources (FHIR) layer implemented over i2b2. BMC Med. Inform. Decis. Mak. 2017, 17, 120. [Google Scholar] [CrossRef] [PubMed]
- FHIR2TranSMART. Available online: https://github.com/thehyve/python_fhir2transmart (accessed on 18 January 2022).
- TranSMART Project. Available online: https://github.com/transmart (accessed on 18 January 2022).
- Berg, H.; Henriksson, A.; Fors, U.; Dalianis, H. De-identification of Clinical Text for Secondary Use: Research Issues. In HEALTHINF 2021; Online Streaming, 11–13 February 2021; SCITEPRESS; 2021; pp. 592–599. Available online: https://www.scitepress.org/Papers/2021/103187/103187.pdf (accessed on 23 December 2021).
- Somolinos, R.; Muñoz, A.; Hernando, M.E.; Pascual, M.; Cáceres, J.; Sánchez-de-Madariaga, R.; Fragua, J.A.; Serrano, P.; Salvador, C.H. Service for the Pseudonymization of Electronic Healthcare Records Based on ISO/EN 13606 for the Secondary Use of Information. IEEE J. Biomed. Health Inform. 2015, 19, 1937–1944. [Google Scholar] [CrossRef] [Green Version]
- ISO 13606-1; Electronic Health Record Communication Part 1: Reference Model. International Organization for Standardization: Geneva, Switzerland, 2008.
- Hripcsak, G.; Mirhaji, P.; Low, A.F.H.; Malin, B.A. Preserving temporal relations in clinical data while maintaining privacy. J. Am. Med. Inform. Assoc. 2016, 23, 1040–1045. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- WhiteRabbit for ETL Design. Available online: https://www.ohdsi.org/analytic-tools/whiterabbit-for-etl-design (accessed on 16 January 2022).
- OHDSI Usagi. Available online: http://ohdsi.github.io/Usagi (accessed on 16 January 2022).
- Park, J.; You, S.C.; Jeong, E.; Weng, C.; Park, D.; Roh, J.; Lee, D.Y.; Cheong, J.Y.; Choi, J.W.; Kang, M.; et al. A Framework (SOCRATex) for Hierarchical Annotation of Unstructured Electronic Health Records and Integration into a Standardized Medical Database: Development and Usability Study. JMIR Med. Inform. 2021, 9, e23983. [Google Scholar] [CrossRef]
- Ciampi, M.; De Pietro, G.; Masciari, E.; Silvestri, S. Health Data Information Retrieval For Improved Simulation. In Proceedings of the 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), Västerås, Sweden, 11–13 March 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 364–368. [Google Scholar] [CrossRef]
- Bender, D.; Sartipi, K. HL7 FHIR: An agile and RESTful approach to healthcare information exchange. In Proceedings of the 26th IEEE International Symposium on Computer-Based Medical System, Porto, Portugal, 20–22 June 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 326–331. [Google Scholar]
- Kreimeyer, K.; Foster, M.; Pandey, A.; Arya, N.; Halford, G.; Jones, S.F.; Forshee, R.; Walderhaug, M.; Botsis, T. Natural language processing systems for capturing and standardizing unstructured clinical information: A systematic review. J. Biomed. Inform. 2017, 73, 14–29. [Google Scholar] [CrossRef]
- Alicante, A.; Corazza, A.; Isgrò, F.; Silvestri, S. Unsupervised entity and relation extraction from clinical records in Italian. Comput. Biol. Med. 2016, 72, 263–275. [Google Scholar] [CrossRef]
- Osmani, V.; Li, L.; Danieletto, M.; Glicksberg, B.; Dudley, J.; Mayora, O. Processing of electronic health records using deep learning: A review. arXiv 2018, arXiv:1804.01758. [Google Scholar]
- Azure Healthcare APIs, A Unified Solution That Helps Protect and Combine Health Data in the Cloud and Generates Healthcare Insights with Analytics. Available online: https://azure.microsoft.com/en-us/services/healthcare-apis/#overview (accessed on 1 December 2021).
- The HAPI FHIR Library, an Implementation of the HL7 FHIR Specification for Java. Available online: https://hapifhir.io (accessed on 1 December 2021).
- Ayaz, M.; Pasha, M.F.; Alzahrani, M.Y.; Budiarto, R.; Stiawan, D. Standard: Systematic Literature Review of Implementations, Applications, Challenges and Opportunities. JMIR Med. Inform. 2021, 9, e21929. [Google Scholar] [CrossRef] [PubMed]
- Khalique, F.; Khan, S.A. An FHIR-based Framework for Consolidation of Augmented EHR from Hospitals for Public Health Analysis. In Proceedings of the 2017 IEEE 11th International Conference on Application of Information and Communication Technologies (AICT), Moscow, Russia, 20–22 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–4. [Google Scholar] [CrossRef]
- Jiang, G.; Xiao, G.; Kiefer, R.C.; Prud’hommeaux, E.; Solbrig, H.R. Building an FHIR Ontology based Data Access Framework with the OHDSI Data Repositories. In Proceedings of the American Medical Informatics Association Annual Symposium (AMIA) 2017, Washington, DC, USA, 4–8 November 2017; AMIA: Bethesda, MD, USA, 2017. [Google Scholar]
- Lee, Y.L.; Lee, H.A.; Hsu, C.Y.; Kung, H.H.; Chiu, H.W. Implement an international interoperable phr by FHIR—A Taiwan innovative application. Sustainability 2021, 13, 198. [Google Scholar] [CrossRef]
- Hong, J.; Morris, P.; Seo, J. Interconnected Personal Health Record Ecosystem Using IoT Cloud Platform and HL7 FHIR. In Proceedings of the 2017 IEEE International Conference on Healthcare Informatics (ICHI), Park City, UT, USA, 23–26 August 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 362–367. [Google Scholar] [CrossRef]
- Apache Spark. Available online: https://spark.apache.org/docs/latest (accessed on 10 December 2021).
- Gargiulo, F.; Silvestri, S.; Ciampi, M.; De Pietro, G. Deep neural network for hierarchical extreme multi-label text classification. Appl. Soft Comput. 2019, 79, 125–138. [Google Scholar] [CrossRef]
- Scalar Used Defined Functions (UDFs). Available online: https://spark.apache.org/docs/latest/sql-ref-functions-udf-scalar.html (accessed on 2 December 2021).
- MongoDB: The Application Data Platform. Available online: https://www.mongodb.com (accessed on 10 December 2021).
- Armbrust, M.; Xin, R.S.; Lian, C.; Huai, Y.; Liu, D.; Bradley, J.K.; Meng, X.; Kaftan, T.; Franklin, M.J.; Ghodsi, A.; et al. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD’ 15), New York, NY, USA, 31 May–4 June 2015; ACM: New York, NY, USA, 2015; pp. 1383–1394. [Google Scholar] [CrossRef]
Attribute | De-Identification Technique |
---|---|
Address | L-diversity algorithm |
Date | L-diversity algorithm |
Identifier | Deletion of the value |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ciampi, M.; Sicuranza, M.; Silvestri, S. A Privacy-Preserving and Standard-Based Architecture for Secondary Use of Clinical Data. Information 2022, 13, 87. https://doi.org/10.3390/info13020087
Ciampi M, Sicuranza M, Silvestri S. A Privacy-Preserving and Standard-Based Architecture for Secondary Use of Clinical Data. Information. 2022; 13(2):87. https://doi.org/10.3390/info13020087
Chicago/Turabian StyleCiampi, Mario, Mario Sicuranza, and Stefano Silvestri. 2022. "A Privacy-Preserving and Standard-Based Architecture for Secondary Use of Clinical Data" Information 13, no. 2: 87. https://doi.org/10.3390/info13020087