This section shows the basic usage of Hadoop Hive. Hive uses a SQL-like language called HiveQL
, and runs on top of Hadoop. Instead of writing raw MapReduce programs, Hive allows you to perform data warehouse tasks using a simple and familiar query language. After completing this section, you will be able to use HiveQL
to query big data.
In the sample code below we will continue to use the same event tuple patient data. Let's start the Hive CLI interactive shell first by typing hive
in the command line.
Note: If you are using docker image, it was pre-cloned to "/bootcamp" , and you can simply go to "bootcamp/sample/hive" instead.
Before loading data, we first need to define a table just like we would if we were working with a database server such as SQL.
If you meet error like:
You can simply create this folder via hdfs and invoke it to your current user.
Currently, we are using 'root' in the docker image.
And you can check existing tables and schema with the commands SHOW TABLES;
and DESCRIBE table_name;
respectively.
Next we'll insert data into the table.
This command will try to load every files from "data/" as csv format, and save to table events. Please have a double check of the directory, make sure there are only *.csv files, and move other unrelated files away.
With the data loaded you can run familiar SQL statements like:
You can also save query results to a local directory (in the local file system):
You can learn more about Hive syntax from the language manual.
Besides running commands with the interactive shell, you can also run a script in batch mode automatically. For example, in the sample/hive
folder, you can run the entire sample.hql
script with the command:
This script simply contains all of the commands that we ran in the shell, with one additional statement to drop the existing table if necessary:
Furthermore, it's also possible to run Hive as a server and connect to the server with JDBC or with its Beeline client.
Hive translates queries into a series of MapReduce jobs. Therefore, it is not suitable for real-time use cases. Alternative tools inspired and influenced by Hive are getting more attention lately, for example, Cloudera Impala and Spark SQL.