Learn how to write Hadoop Streaming programs using python.
In this section, you will learn how to work with Hadoop Streaming, a tool to run any executable in Hadoop MapReduce. We will show how to count the frequency of different values of event-id for each patient event sequence file. The examples here are shown in Python code, but you will find that it's straightforward to adapt this concept to other languages.
Mapper and Reducer
Streaming works by passing a data mapper and reducer written in another programming language through standard input and output. Let's have a look at the source code1 for the mapper and reducer one at at time.
Mapper
The source code for the mapper is:
This script reads lines from standard input and with some simple processing outputs to standard output the event_name as the key and 1 as the value.
Reducer
The reducer is a little bit more complex. The output of the mapper will be shuffled by Hadoop framework's shuffle process (a part of MapReduce) and the reduder will get a list of key-value pairs. The MapReduce framework guarantees that all key-value pairs with the same key will go to same reducer instance.
The source code for the reducer is:
This script checks the boundaries of the sorted input and sums up values from same key.
How to run
Local test
Before running it in Hadoop, it's more convenient to test the code in shell using the cat and sort commands. You will need to navigate to the sample/hadoop-streaming folder. Then, run the below command in the shell:
You will see results like:
Now that we've verified that it works as expected, we can run it in Hadoop.
Hadoop
We first need to put the data into HDFS, then run hadoop:
You will need to update both the mapper and reducer, for the mapper:
and the reducer looks like:
Further reading
Streaming is a good machanism to reuse existing code. Wrapping existing code to work with Hadoop can be simplified with framework like mrjob and Luigi for Python. You can find more explanation and description of Streaming in its offical documentation.