For the purpose of the environment normalization, we provide a simple docker image for you, which contains most of the software required by this course. We also provide a few scripts to install some optional packages.
The whole progress would seem as follow:
Since this docker image integrated a lot of related services for the course, it requires at least 4GB RAM for this virtual machine. If your can meet the minimum requirement, the system could randomly kill one or a few process due to resource limitation, which causes a lot of strange errors which is even unable to reproduce.
DON'T TRY TO DO THAT.
You should have enough system resource if you are planning to start a container in your local OS.
You are supposed to reserve at least 4 GB RAM for Docker, and some other memory for the host machine. While, you can still start all the Hadoop related services except Zeppelin, even if you only reserve 4GB for the virtual machine.
Docker is a software technology providing operating-system-level virtualization also known as containers, promoted by the company Docker, Inc.. Docker uses the resource isolation features of the Linux kernel such as cgroups and kernel namespaces, and a union-capable file system such as OverlayFS and others to allow independent "containers" to run within a single Linux instance, avoiding the overhead of starting and maintaining virtual machines (VMs). (from Wikipedia)
Basically, you can treat docker as a lightweight virtual machine hosted on Linux with a pretty high performance.
The principle of setting up a docker environment is pretty straightforward.
There is an official instruction from this link. You can check the official documentation to get the latest news and some detail explanations.
Once the docker installed, you should get a few commands start from docker and able to start your docker service, and launch your docker container.
If we are using VirtualBox + Windows/macOS, the theory is pretty clear: we created a Linux instance in "virtual remote", and control it using docker-machine. If we are supposed to operate the "remote docker service", we are supposed to prepare a set of environment variables. We can list it using command:
This is the reason that why do we have to execute the follow command to access the docker.
If you are using docker-machine, you can not reach the port from virtual machine using ip 127.0.0.1 (localhost). As replacement, you should extract the IP using this command:
And then you should visit 192.168.99.100
instead of 127.0.0.1
to visit the network stream from virtual machine.
If these environment are unsetted, docker will try to connect to the default unix socket file /var/run/docker.sock
.
As a Docker.app user, this file is:
As a Linux user, the situation is slightly different:
A Linux user must add a "sudo" before command docker
since he has no access to docker.sock
as an ordinary user.
The basic start command should be:
In general, the synopsis of docker run
is
Here is a case study to the options:
-p host-port:vm-port
This option is used to map the TCP port vm-port
in the container to port host-port
on the Docker host.
Currently, vm-port
s are reserved to:
Once you started the Zeppelin service, this service will keep listening port 9530
in docker. You should able to visit this service using http://127.0.0.1:9530
or http://DOCKER_HOST_IP:9530
.
This remote IP depends on the Docker Service you are running, which has already described above.
host-port
-v, --volume=[host-src:]container-dest[:<options>]
This option is used to bind mount a volume.
Currently, we are using -v /:/mnt/host
. In this case, we can visit the root of your file system for your host machine. If you are using macOS, /mnt/host/Users/<yourname>/
would be the $HOME
of your MacBook. If you are using Windows, you can reach your C:
disk from /mnt/host/c
in docker.
Variable host-src
accepts absolute path only.
-it
-h bootcamp1.docker
Once you enter this docker environment, you can ping this docker environment itself as bootcamp1.docker
. This variable is used in some configuration files for Hadoop ecosystems.
-m 8192m
Memory limit (format: <number>[<unit>]
). Number is a positive integer. Unit can be one of b
, k
, m
, or g
.
This docker image requires at least 4G RAM, 8G RAM is recommended. However, if your local Physical Machine has ONLY 8G RAM, you are recommended to reduce this number to 4G.
Local machine is not the same as the remote server. If you are launching a remote server with 8G RAM, you can set this number as 7G.
If you are interested in the detail explanation of the args, please visit this link
In generally, when you are in front of the command line interface, you will meet 2 kinds of prompt.
Of course, it is pretty easy to change, you can simply update the environment variable PS1
.
Assumption: every script is executed by root
.
This script will help you start a the services for Hadoop ecosystems. You may meet "Connection Refused" exception if you did something else before started these services.
If you wish to host Zeppelin, you should install it first by using the command:
and start the service by using command:
then, Zeppelin will listen the port 9530
Note: Please keep all the service are running before installing/starting Zeppelin.
If you wish to host Jupyter, you can start it by using command:
Jupyter will listen the port 8888
You can stop services if you want:
To detach instance for keeping it up,
To exit,
If you detached a instance and want to attach again,
check the CONTAINER ID
or NAMES
of it.
If the "STATUS" column is similar to "Exited (0) 10 hours ago", you are supposed to start the container again.
Then attach it by:
Every time you restart your container, you are supposed to start all those services again before any HDFS related operations.
If you want to permanently remove container
If you want to permanently remove images
List images first
Remove them by REPOSITORY or IMAGE ID using command:
Please refer to this link for the introduction of images, containers, and storage drivers.
User hdfs
is the super user in HDFS system. User root
is the super user in Linux system.
In this case, user root has no permission to write data in /
, but it could ask user hdfs to process it.
An absolute or full path points to the same location in a file system, regardless of the current working directory. To do that, it must include the root directory. wiki.
When we are talking about /mnt/host
, it always pointing to the path /mnt/host
. However, if the path is not startswith "/", it means to start from "current working path".
In Linux system, you can get your "current working path" using command
In HDFS system, the "current working path" would be /user/<your-name>
.
A relative path would be the result of cwd
plus your string.
When we are coding in hadoop, we may required to fill in a location pointing to the path of input files. The synopsis of this path is is
[schema://]your-path
An HDFS path hdfs:///hw1/test.csv
is combined by hdfs://
and /hw1/test.csv
. There are 3 slashes over there. If you only filled 2 slashes over there (hdfs://hw1/test.csv
), it is equal to hdfs:///user/root/hw1/test.csv
, which may not be expected.
Ditto for file:///path/to/your/local/file.csv
.
This environment is based on CentOS 7. This course does not requires you have too much knowledge in Linux, but if you can use some basic commands, that would be better.
If you are interested, please refer to this link for a Unix/Linux Command Cheat Sheet.