Learning Objectives

Being familiar with basic operations of HDFS.

Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed File System. Although Hadoop supports many other filesystems (e.g., Amazon S3), HDFS is the most popular choice and will be used throughout this bootcamp. Therefore, in this section, you will learn how to move data between your local filesystem and HDFS.

HDFS Operations

Hadoop provides a command line utility hdfs to interact with HDFS. Basic operations are placed under hdfs dfs subcommand. Let's play with some basic operations.

Create home directory

When you use HDFS for the first time, it's likely that your home directory in HDFS has not been created yet. Your home directory in HDFS is /user/<username>/ by default. In the environment that we provide, there's a special user hdfs who is an HDFS administrator and has the permission to create new home directories.

First cd into bigdata-bootcamp/vm and run vagrant up followed by vagrant ssh. You will then need to switch to the hdfs user via

> sudo su - hdfs

Then, you can create a directory and change ownership of the newly created folder

> hdfs dfs -mkdir /user/<username>
> hdfs dfs -chown <username> /user/<username>

Please remember to change <username> to your actual linux user name (i.e. user2). Finally switch back to your user with exit. Note: the username needs to correspond to the machine (i.e. whatever is in front of the @). If your machine is vagrant@bigtop1 then set username to vagrant.

Create directory

Similar to creating local directory via linux command mkdir, creating a folder named input in HDFS use

> hdfs dfs -mkdir input

where hdfs is the HDFS utility program, dfs is the subcommand to handle basic HDFS operations, -mkdir means you want to create a directory and the directory name is specified as input. Above commands actually create the input directory in your home directory in HDFS. Of course, you can create it to other place with absolute or relative path.

Copy data in and out

Suppose you followed previous instructions and created an directory named input, you can then copy data from local file system to HDFS using -put. For example,

> cd /bootcamp/data
> hdfs dfs -put case.csv input
> hdfs dfs -put control.csv input

You can find detailed description about these two files in sample data.

Similar to -put, -get operation will copy data out of HDFS to the local folder. For example

hdfs dfs -get input/case.csv local_case.csv

will copy the input/case.csv file out HDFS into the current working directory using a new name local_case.csv. If you didn't specify local_case.csv, the original name case.csv will be kept. You will be able to verify your copy by -ls and -cat operation described below.

List File Information

Just like linux ls command, -ls is the operation to list files and folders in HDFS. For example, the following command list items in your home directory of HDFS (i.e /user/<username>/)

> hdfs dfs -ls
Found 1 items
-rwxr-xr-x   - hadoop supergroup          0 2015-07-11 06:10 input

You can see the newly created input directory is listed. You can also see the files inside a particular directory

>hdfs dfs -ls input
found 2 items
-rw-r--r--   1 hadoop supergroup     536404 2015-07-11 06:10 input/case.csv
-rw-r--r--   1 hadoop supergroup     672568 2015-07-11 06:08 input/control.csv

Fetch file content

Actually you don't need to copy files out local in order to see its content, you can directly use -cat to printing the content of files in HDFS. For example, the following command print out content of the one file you just put into HDFS.

> hdfs dfs -cat input/case.csv 
...
020E860BD31CAC69,DRUG00440128228,976,60.0
020E860BD31CAC69,DIAG486,907,1.0
020E860BD31CAC69,DIAG7863,907,1.0
020E860BD31CAC69,DIAGV5866,907,1.0
020E860BD31CAC69,DIAG3659,907,1.0
020E860BD31CAC69,DIAGRG199,907,1.0
020E860BD31CAC69,PAYMENT,907,15000.0
020E860BD31CAC69,heartfailure,956,1.0
...

You will find wildcard character very useful since output of MapReduce and other Hadoop-based tools tendsto be directory. For example, to print content of all csv files (the case.csv and control.csv) in the input HDFS folder, you can

hdfs dfs -cat input/*.csv

HDFS Basics