medical-nlp

Dataset compiled for Natural Language Processing using a corpus of medical transcriptions and custom-generated clinical stop words and vocabulary.

Usage

Clone or download files for use in medical text Natural Language Processing (NLP) experiments.

mtsamples.csv. Compiled from Kaggle's medical transcriptions dataset by Tara Boyle, scraped from Transcribed Medical Transcription Sample Reports and Examples. See Kaggle repository.
clinical-stopwords.txt. Compiled from Dr. Kavita Ganesan clinical-concepts repository. See the Discovering Related Clinical Concepts Using Large Amounts of Clinical Notes paper.
vocab.txt. Generated vocabulary text files for Natural Language Processing (NLP) using the Systematized Nomenclature of Medicine International (SNMI) data. See how to Generate your own vocab file.
X.csv. Fully processed dataset obtained from running the Data Modelling notebook. Simplified dataset to 4 classes.
classes.txt. Text file describing the dataset's classes: Surgery, Medical Records, Internal Medicine and Other
train.csv. Training data subset. Contains 90% of the X.csv processed file.
test.csv. Test data subset. Contains 10% of the X.csv processed file.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
LICENSE		LICENSE
README.md		README.md