Notes on processing basic NLP methods for Clinical Notes on MIMIC-3 Database

Jeong Min Lee (jlee@cs.pitt.edu) / Jan 22, 2020


On this note, I briefly explain how to start to work on clinical notes in MIMIC-3 database, using Python-based NLP toolkits.



1. Clinical Notes Data in MIMIC-3 Database

In MIMIC-3, there exist clinical notes written by physicians and nurses in the form of free-text.

There exist official documentation about the note data. You will find it useful to understand the meanings of each column: NOTEEVENTS. In the NOTEEVENTS table, the column TEXT contains the note in free text form.


Category of the notes

The table also has a CATEGORY column. Depends on which category you use, you could ends up with different analysis/task. So instead of using any notes, try to be specific about the kinds of the note you will be using. Following is the table of number of notes by category.



Gaining access to MIMIC-3 Database

To gain the access to the MIMIC-3 Database, use this link to request the access: Requesting access


For brief exploration of the MIMIC-3 database, you might use the demo dataset, which does not require any specific request and permission: MIMIC-III Clinical Database Demo v1.4. But please note that the demo dataset does not include the note. You must have the full access to the database.



2. Loading the note data into Python

As the text data resides in CSV file, you could use Python based data-libraries such as Pandas. I recommend following blog articles on learning how to use Pandas for loading a dataset in CSV into your Python instance:



3. Clean up your free text

Once load the text into Python, you should clean up the text with basic processing methods. The basic methods I mention here include (1) tokenization, which is to divide big chunk of text into smaller parts called tokens (2) stemming, which is a kind of normalization by removing postfix. Also, at some moment of your project, you should change the original words into index numbers to put into ML/NLP models.


There should be many NLP packages out there, but one I really recommend is toolkit named NLTK Natural Language Toolkit — NLTK 3.4.5 documentation. It is lightweight, very straightforward, and easy to start with! Another really good thing about NLTK is it provides a detailed book-like tutorial NLTK Book. You will find most of techniques to starts on the basic data preprocessing methods on the area of NLP with this.


For the tokenization and stemming on NLTK, I also found following blog posts useful:



4. Further steps towards training Word2Vec models

The same blog contains good tutorial on how to jump on word2vec models using another package named Gensim. Gensim is Python based library used to train various NLP-models include word2vec(skipgram and CBOW) topic modeling (LDA), and etc.



Comprehensive tutorial on predicting readmission rate using discharge note

I found very nice tutorial on predicting hospital readmission with discharge summaries. Introduction to Clinical Natural Language Processing: Predicting Hospital Readmission with Discharge Summaries It uses NLTK to preprocess the text portion and Logistic Regression model from Scikit-Learn package to train a classification model. I highly recommend to follow the step-by-step instructions on the post.