Computer Science > Computation and Language

arXiv:2010.01165 (cs)

[Submitted on 2 Oct 2020 (v1), last revised 25 Mar 2021 (this version, v2)]

Title:Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit

Authors:Zeljko Kraljevic, Thomas Searle, Anthony Shek, Lukasz Roguski, Kawsar Noor, Daniel Bean, Aurelie Mascio, Leilei Zhu, Amos A Folarin, Angus Roberts, Rebecca Bendayan, Mark P Richardson, Robert Stewart, Anoop D Shah, Wai Keong Wong, Zina Ibrahim, James T Teo, Richard JB Dobson

View PDF

Abstract:Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of Information Extraction (IE) technologies to enable clinical analysis. We present the open-source Medical Concept Annotation Toolkit (MedCAT) that provides: a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; b) a feature-rich annotation interface for customising and training IE models; and c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448-0.738 vs 0.429-0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ~8.8B words from ~17M clinical records and further fine-tuning with ~6K clinician annotated examples. We show strong transferability (F1 > 0.94) between hospitals, datasets, and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.

Comments:	Preprint: 27 Pages, 3 Figures
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2010.01165 [cs.CL]
	(or arXiv:2010.01165v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2010.01165

Submission history

From: Zeljko Kraljevic [view email]
[v1] Fri, 2 Oct 2020 19:01:02 UTC (1,758 KB)
[v2] Thu, 25 Mar 2021 13:21:50 UTC (1,868 KB)

Computer Science > Computation and Language

Title:Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators