Develop models using python machine learning module.
In this section, you will learn how to build a heart failure (HF) predictive model. You should have finished previous Spark Application section. You will first learn how to train a model using Spark MLlib and save it. Next, you will learn how to achieve same goal using Python Scikit-learn machine learning module for verification purpose.
MLlib
You will first load data and compute some high-level summary
statistics, then train a classifier to predict heart failure.
Load Samples
Loading data from previously saved data can be achieved by
Basic Statistics
Spark MLlib provides various functions to compute summary statistics that are useful when doing machine learning and data analysis tasks.
Split data
In a typical machine learning problem, we need to split data into training (60%) and testing (40%) set.
Train classifier
Let's train a linear SVM model using Stochastic Gradient Descent (SGD) on the training set to predict heart failure
Testing
For each sample in the testing set, output a (prediction, label) pair, and calculate the prediction accuracy. We use the broadcast mechanism to avoid unnecessary data copy.
Save & load model
In real world setting, you may need to save the trained model. You can achieve that by directly serialize you model object using java ObjectOutputStream and save
Scikit-learn
If typical data set is often small enough after feature construction described in previous Spark Application section, you may consider running machine learning predictive model training and testing using your familiar tools like scikit-learn in Python or some R packages. Here we show how to do that in Scikit-learn, a Python machine learning library.
Fetch data
In order to work with Scikit-learn, you will need to take data out of HDFS into a local file system. We can get the samples folder from your home directory in HDFS and merge content into one single file with the command below
Move on with Python
In later steps, you will use python interactive shell. To open a python interactive shell, just type python in bash. You will get prompt similar to the sample below
which show version and distribution of the python installation you are using. Here we pre-installed Anaconda
Load and split data
Now we can load data and split it into training and testing set in similar way as the MLlib approach.
Train classifier
Let's train a linear SVM model again on the training set to predict heart failure
Testing
We can get prediction accuracy and AUC on testing set as
Save & load model
We can save and load the trained model via pickle serialization module in Python like
Sparsity and predictive features
Since we have limited training data but a large number of features, we may consider using L1 penalty on model to regularize parameters.
Before fitting a model, we scaled the data to make sure weights of features are comparable. With the sparse model we get from previous example, we can actually identify predictive features according to their coefficients. Here we assume you did the last exercise of previous section about Spark Application. If not, please do that first.