Hadoop Ecosystem
Spark Ecosystem
NLP
Deep Learning for Healthcare Applications
For both homework and project, we will use MIMIC-III Critical Care Database. This page describes information about the dataset and procedures to obtain the dataset.
MIMIC-III is a large, openly-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012.
Among the types of data included are:
MIMIC supports a diverse range of analytic studies spanning epidemiology, clinical decision-rule improvement, and electronic tool development. It is notable for three factors:
During this course, we will be working with the MIMIC database. MIMIC, although de-identified, still contains detailed information regarding the clinical care of patients, and must be treated with appropriate care and respect.
You must finish CITI training first to get MIMIC access.Do NOT request access individually.
We will collect all student information and send a batch request to MIT, after which you'll be notified and send the access request.
Throughout the training exercises on this site we will use a small sample data set. If you followed the instructions documented on the environment setup page to set up your environment, you will find the sample data in the /bigdata-bootcamp/data
folder in the virtual environment.
There are two data files with names case.csv
and control.csv
respectively. For the purpose of these exercises we will define patients who developed heart failure (HF) at some time point as case patients, and those who didn't develop HF as control patients.
Each line of the sample data file consists of a tuple structured as (patient-id, event-id, timestamp, value)
, below are a few lines as an example:
020E860BD31CAC69,DRUG36987254604,968,30.0
020E860BD31CAC69,DRUG64158080642,974,30.0
020E860BD31CAC69,DRUG00440128228,976,60.0
020E860BD31CAC69,DIAG486,907,1.0
020E860BD31CAC69,DIAG7863,907,1.0
020E860BD31CAC69,DIAGV5866,907,1.0
020E860BD31CAC69,DIAG3659,907,1.0
020E860BD31CAC69,DIAGRG199,907,1.0
020E860BD31CAC69,PAYMENT,907,15000.0
020E860BD31CAC69,heartfailure,956,1.0
patient-id
is just a patient identifier (id) in order to differentiate records from different patients. For example, the portion of data we show above is all about the same patient, who has an id of 020E860BD31CAC69
.event-id
encodes all the clinical events that a patient has had. For example, DRUG00440128228
indicates that the patient was taking a drug identified by a National Drug Code of 00440128228
. The numbers in DIAG486
are the first 3 digits of an ICD9 code, which in this case is the code for Pneumonia. For this data an event-id of PAYMENT
means that the patient made a payment with the corresponding dollar amount.timestamp
indicates the date at which the event on that row happened. Here the timestamp is not formatted as a real date but rather as an offset from an unspecified start point. This is done both to improve the simplicity of processing and to protect the privacy of the patients' data.value
is the associated value for an event. See the below table for a detailed description data in the value field.event type | sample event-od | value meaning | example |
---|---|---|---|
diagnostic code | DIAG486 | Will always be 1.0 for diagnose events | 1.0 |
drug consumption | DRUG00440128228 | Dosage of the drug | 30 |
payment | PAYMENT | Amount of payment made on timestamp date | 15000 |
heartfailure | heartfailure | Indicator of heart failure event | 1.0 |