Dataset compiled for Natural Language Processing using a corpus of medical transcriptions and custom-generated clinical stop words and vocabulary.
Clone or download files for use in medical text Natural Language Processing (NLP) experiments.
mtsamples.csv
. Compiled from Kaggle's medical transcriptions dataset by Tara Boyle, scraped from Transcribed Medical Transcription Sample Reports and Examples. See Kaggle repository.clinical-stopwords.txt
. Compiled from Dr. Kavita Ganesan clinical-concepts repository. See the Discovering Related Clinical Concepts Using Large Amounts of Clinical Notes paper.vocab.txt
. Generated vocabulary text files for Natural Language Processing (NLP) using the Systematized Nomenclature of Medicine International (SNMI) data. See how to Generate your own vocab file.X.csv
. Fully processed dataset obtained from running the Data Modelling notebook. Simplified dataset to 4 classes.classes.txt
. Text file describing the dataset's classes:Surgery
,Medical Records
,Internal Medicine
andOther
train.csv
. Training data subset. Contains 90% of theX.csv
processed file.test.csv
. Test data subset. Contains 10% of theX.csv
processed file.