For the purpose of the environment normalization, we provide a simple docker image for you, which contains most of the software required by this course. We also provide a few scripts to install some optional packages.
The whole progress would seem as follow:
Since this docker image integrated a lot of related services for the course, it requires at least 4GB RAM for this virtual machine. If your can meet the minimum requirement, the system could randomly kill one or a few process due to resource limitation, which causes a lot of strange errors which is even unable to reproduce.
DON'T TRY TO DO THAT.
You should have enough system resource if you are planning to start a container in your local OS.
You are supposed to reserve at least 4 GB RAM for Docker, and some other memory for the host machine. While, you can still start all the Hadoop related services except Zeppelin, even if you only reserve 4GB for the virtual machine.
Docker is a software technology providing operating-system-level virtualization also known as containers, promoted by the company Docker, Inc.. Docker uses the resource isolation features of the Linux kernel such as cgroups and kernel namespaces, and a union-capable file system such as OverlayFS and others to allow independent "containers" to run within a single Linux instance, avoiding the overhead of starting and maintaining virtual machines (VMs). (from Wikipedia)
Basically, you can treat docker as a lightweight virtual machine hosted on Linux with a pretty high performance.
The principle of setting up a docker environment is pretty straightforward.
There is an official instruction from this link. You can check the official documentation to get the latest news and some detail explanations.
Once the docker installed, you should get a few commands start from docker and able to start your docker service, and launch your docker container.
If we are using VirtualBox + Windows/macOS, the theory is pretty clear: we created a Linux instance in "virtual remote", and control it using docker-machine. If we are supposed to operate the "remote docker service", we are supposed to prepare a set of environment variables. We can list it using command:
docker-machine env default
This is the reason that why do we have to execute the follow command to access the docker.
eval $(docker-machine env default)
If you are using docker-machine, you can not reach the port from virtual machine using ip 127.0.0.1 (localhost). As replacement, you should extract the IP using this command:
$ printenv | grep "DOCKER_HOST"
DOCKER_HOST=tcp://192.168.99.100:2376
And then you should visit 192.168.99.100
instead of 127.0.0.1
to visit the network stream from virtual machine.
If these environment are unsetted, docker will try to connect to the default unix socket file /var/run/docker.sock
.
As a Docker.app user, this file is:
$ ls -alh /var/run/docker.sock
lrwxr-xr-x 1 root daemon 55B Feb 10 19:09 /var/run/docker.sock -> /Users/yu/Library/Containers/com.docker.docker/Data/s60
$ ls -al /Users/yu/Library/Containers/com.docker.docker/Data/s60
srwxr-xr-x 1 yu staff 0 Feb 10 19:09 /Users/yu/Library/Containers/com.docker.docker/Data/s60
As a Linux user, the situation is slightly different:
$ ls -al /var/run/docker.sock
srw-rw---- 1 root root 0 Feb 11 11:35 /var/run/docker.sock
A Linux user must add a "sudo" before command docker
since he has no access to docker.sock
as an ordinary user.
The basic start command should be:
docker run -it --privileged=true \
--cap-add=SYS_ADMIN \
-m 8192m -h bootcamp1.docker \
--name bigbox -p 2222:22 -p 9530:9530 -p 8888:8888\
-v /:/mnt/host \
sunlab/bigbox:latest \
/bin/bash
In general, the synopsis of docker run
is
docker run [options] image[:tag|@digest] [command] [args]
Here is a case study to the options:
-p host-port:vm-port
This option is used to map the TCP port vm-port
in the container to port host-port
on the Docker host.
Currently, vm-port
s are reserved to:
Once you started the Zeppelin service, this service will keep listening port 9530
in docker. You should able to visit this service using http://127.0.0.1:9530
or http://DOCKER_HOST_IP:9530
.
This remote IP depends on the Docker Service you are running, which has already described above.
host-port
-v, --volume=[host-src:]container-dest[:<options>]
This option is used to bind mount a volume.
Currently, we are using -v /:/mnt/host
. In this case, we can visit the root of your file system for your host machine. If you are using macOS, /mnt/host/Users/<yourname>/
would be the $HOME
of your MacBook. If you are using Windows, you can reach your C:
disk from /mnt/host/c
in docker.
Variable host-src
accepts absolute path only.
-it
-h bootcamp1.docker
Once you enter this docker environment, you can ping this docker environment itself as bootcamp1.docker
. This variable is used in some configuration files for Hadoop ecosystems.
-m 8192m
Memory limit (format: <number>[<unit>]
). Number is a positive integer. Unit can be one of b
, k
, m
, or g
.
This docker image requires at least 4G RAM, 8G RAM is recommended. However, if your local Physical Machine has ONLY 8G RAM, you are recommended to reduce this number to 4G.
Local machine is not the same as the remote server. If you are launching a remote server with 8G RAM, you can set this number as 7G.
If you are interested in the detail explanation of the args, please visit this link
In generally, when you are in front of the command line interface, you will meet 2 kinds of prompt.
# whoami # this prompt is '#'
#indices you are root aka the administrator of this environment now
root
$ whoami # this promot is '$' indices you are a ordinary user now
yu
Of course, it is pretty easy to change, you can simply update the environment variable PS1
.
Assumption: every script is executed by root
.
/scripts/start-services.sh
This script will help you start a the services for Hadoop ecosystems. You may meet "Connection Refused" exception if you did something else before started these services.
If you wish to host Zeppelin, you should install it first by using the command:
/scripts/install-zeppelin.sh
and start the service by using command:
/scripts/start-zeppelin.sh
then, Zeppelin will listen the port 9530
Note: Please keep all the service are running before installing/starting Zeppelin.
If you wish to host Jupyter, you can start it by using command:
/scripts/start-jupyter.sh
Jupyter will listen the port 8888
You can stop services if you want:
/scripts/stop-services.sh
To detach instance for keeping it up,
ctrl + p, ctrl + q
To exit,
exit
If you detached a instance and want to attach again,
check the CONTAINER ID
or NAMES
of it.
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
011547e95ef5 sunlab/bigbox:latest "/tini -- /bin/bash" 6 hours ago Up 4 seconds 0.0.0.0:8888->8888/tcp, 0.0.0.0:9530->9530/tcp, 0.0.0.0:2222->22/tcp bigbox
If the "STATUS" column is similar to "Exited (0) 10 hours ago", you are supposed to start the container again.
$ docker start <CONTAINER ID or NAMES>
Then attach it by:
$ docker attach <CONTAINER ID or NAMES>
Every time you restart your container, you are supposed to start all those services again before any HDFS related operations.
If you want to permanently remove container
$ docker rm <CONTAINER ID or NAMES>
If you want to permanently remove images
List images first
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
sunlab/bigbox latest bfd258e00de3 16 hours ago 2.65GB
Remove them by REPOSITORY or IMAGE ID using command:
$ docker rmi <REPOSITORY or IMAGE ID>
$ docker pull sunlab/bigbox
Please refer to this link for the introduction of images, containers, and storage drivers.
# cat /proc/meminfo | grep Mem # Current Memory
MemTotal: 8164680 kB # Note: This value shoud no less than 4GB
MemFree: 175524 kB
MemAvailable: 5113340 kB
# cat /proc/cpuinfo | grep 'model name' | head -1 # CPU Brand
model name : Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz
# cat /proc/cpuinfo | grep 'model name' | wc -l # CPU Count
4
# df -h # List Current Hard Disk Usage
Filesystem Size Used Avail Use% Mounted on
overlay 32G 4.6G 26G 16% /
tmpfs 64M 0 64M 0% /dev
...
# ps -ef # List Current Running Process
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 01:38 pts/0 00:00:00 /tini -- /bin/bash
root 7 1 0 01:38 pts/0 00:00:00 /bin/bash
root 77 1 0 01:43 ? 00:00:00 /usr/sbin/sshd
zookeep+ 136 1 0 01:43 ? 00:00:14 /usr/lib/jvm/java-openjdk/bin/java -Dzookeeper.log.dir=/var/log/zookeeper -Dzookeeper.root.logger=INFO,ROLLINGFILE -cp /usr/lib/zookeeper/bin/../build/classes:/
yarn 225 1 0 01:43 ? 00:00:13 /usr/lib/jvm/java/bin/java -Dproc_proxyserver -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=yarn-yarn-pr
...
# lsof -i:9530 # Find the Process Listening to Some Specific Port
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 3165 zeppelin 189u IPv4 229945 0t0 TCP *:9530 (LISTEN)
[root@bootcamp1 /]# su hdfs # This command is used to switch your current user to hdfs
# Note: switch user requires special permission
# You can not switch back using su root again
bash-4.2$ whoami # check current user
hdfs
bash-4.2$ exit # role is a stack, you can quit your role from hdfs to root
[root@bootcamp1 /]#
[root@bootcamp1 /]# sudo -u hdfs whoami # execute a command 'whoami' using user 'hdfs'
hdfs
[root@bootcamp1 /]#
User hdfs
is the super user in HDFS system. User root
is the super user in Linux system.
[root@bootcamp1 /]# sudo -u hdfs hdfs dfs -mkdir /tmp
In this case, user root has no permission to write data in /
, but it could ask user hdfs to process it.
An absolute or full path points to the same location in a file system, regardless of the current working directory. To do that, it must include the root directory. wiki.
When we are talking about /mnt/host
, it always pointing to the path /mnt/host
. However, if the path is not startswith "/", it means to start from "current working path".
In Linux system, you can get your "current working path" using command
# pwd
/root
In HDFS system, the "current working path" would be /user/<your-name>
.
A relative path would be the result of cwd
plus your string.
When we are coding in hadoop, we may required to fill in a location pointing to the path of input files. The synopsis of this path is is
[schema://]your-path
An HDFS path hdfs:///hw1/test.csv
is combined by hdfs://
and /hw1/test.csv
. There are 3 slashes over there. If you only filled 2 slashes over there (hdfs://hw1/test.csv
), it is equal to hdfs:///user/root/hw1/test.csv
, which may not be expected.
Ditto for file:///path/to/your/local/file.csv
.
This environment is based on CentOS 7. This course does not requires you have too much knowledge in Linux, but if you can use some basic commands, that would be better.
If you are interested, please refer to this link for a Unix/Linux Command Cheat Sheet.