Hadoop 2. Running on Ubuntu for Single-Node Cluster - 2018
Linux - system, cmds & shell
- Linux Tips - links, vmstats, rsync
- Linux Tips 2 - ctrl a, curl r, tail -f, umask
- Linux - bash I
- Linux - bash II
- Linux - Uncompressing 7z file
- Linux - sed I (substitution: sed 's///', sed -i)
- Linux - sed II (file spacing, numbering, text conversion and substitution)
- Linux - sed III (selective printing of certain lines, selective definition of certain lines)
- Linux - 7 File types : Regular, Directory, Block file, Character device file, Pipe file, Symbolic link file, and Socket file
- Linux shell programming - introduction
- Linux shell programming - variables and functions (readonly, unset, and functions)
- Linux shell programming - special shell variables
- Linux shell programming : arrays - three different ways of declaring arrays & looping with $*/$@
- Linux shell programming : operations on array
- Linux shell programming : variables & commands substitution
- Linux shell programming : metacharacters & quotes
- Linux shell programming : input/output redirection & here document
- Linux shell programming : loop control - for, while, break, and break n
- Linux shell programming : string
- Linux shell programming : for-loop
- Linux shell programming : if/elif/else/fi
- Linux shell programming : Test
- Managing User Account - useradd, usermod, and userdel
- Linux Secure Shell (SSH) I : key generation, private key and public key
- Linux Secure Shell (SSH) II : ssh-agent & scp
- Linux Secure Shell (SSH) III : SSH Tunnel as Proxy - Dynamic Port Forwarding (SOCKS Proxy)
- Linux Secure Shell (SSH) IV : Local port forwarding (outgoing ssh tunnel)
- Linux Secure Shell (SSH) V : Reverse SSH Tunnel (remote port forwarding / incoming ssh tunnel) /)
- Linux Processes and Signals
- Linux Drivers 1
- tcpdump
- Linux Debugging using gdb
- Embedded Systems Programming I - Introduction
- Embedded Systems Programming II - gcc ARM Toolchain and Simple Code on Ubuntu/Fedora
- LXC (Linux Container) Install and Run
- Linux IPTables
- Hadoop - 1. Setting up on Ubuntu for Single-Node Cluster
- Hadoop - 2. Runing on Ubuntu for Single-Node Cluster
- ownCloud 7 install
- Ubuntu 14.04 guest on Mac OSX host using VirtualBox I
- Ubuntu 14.04 guest on Mac OSX host using VirtualBox II
- Windows 8 guest on Mac OSX host using VirtualBox I
- Ubuntu Package Management System (apt-get vs dpkg)
- RPM Packaging
- How to Make a Self-Signed SSL Certificate
- Linux Q & A
- DevOps / Sys Admin questions
In the previous chapter (Setting up Hadoop on Ubuntu), we set up Hadoop on Ubunut. Now, we want to check if we did it correctly.
$ pwd /usr/local/hadoop/sbin $ ls distribute-exclude.sh start-all.cmd stop-all.sh hadoop-daemon.sh start-all.sh stop-balancer.sh hadoop-daemons.sh start-balancer.sh stop-dfs.cmd hdfs-config.cmd start-dfs.cmd stop-dfs.sh hdfs-config.sh start-dfs.sh stop-secure-dns.sh httpfs.sh start-secure-dns.sh stop-yarn.cmd mr-jobhistory-daemon.sh start-yarn.cmd stop-yarn.sh refresh-namenodes.sh start-yarn.sh yarn-daemon.sh slaves.sh stop-all.cmd yarn-daemons.sh
Run the following command:
$ /usr/local/hadoop/sbin/start-all.sh
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on our machine. My output looks like this:
$ /usr/local/hadoop/sbin/start-all.sh This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh 14/02/24 22:17:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Starting namenodes on [localhost] localhost: starting namenode, logging to /usr/local/hadoop/logs/hadoop-hduser-namenode-K-PC.out localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hduser-datanode-K-PC.out Starting secondary namenodes [0.0.0.0] 0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-hduser-secondarynamenode-K-PC.out 14/02/24 22:18:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable starting yarn daemons starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-hduser-resourcemanager-K-PC.out localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hduser-nodemanager-K-PC.out
To check whether the expected Hadoop processes are running, we can use jps:
$ jps 5036 SecondaryNameNode 4869 DataNode 4183 ResourceManager 5573 Jps 5256 NodeManager
We can also check with netstat if Hadoop is listening on the configured ports:
$ netstat -plten | grep java tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 1001 61428 4869/java tcp 0 0 0.0.0.0:50075 0.0.0.0:* LISTEN 1001 62496 4869/java tcp 0 0 0.0.0.0:50020 0.0.0.0:* LISTEN 1001 62313 4869/java tcp 0 0 0.0.0.0:50090 0.0.0.0:* LISTEN 1001 62866 5036/java tcp6 0 0 :::8030 :::* LISTEN 1001 56753 4183/java tcp6 0 0 :::8031 :::* LISTEN 1001 57468 4183/java tcp6 0 0 :::8032 :::* LISTEN 1001 57504 4183/java tcp6 0 0 :::8033 :::* LISTEN 1001 57544 4183/java tcp6 0 0 :::8040 :::* LISTEN 1001 65972 5256/java tcp6 0 0 :::8042 :::* LISTEN 1001 67666 5256/java tcp6 0 0 :::34229 :::* LISTEN 1001 65960 5256/java tcp6 0 0 :::8088 :::* LISTEN 1001 57456 4183/java
To stop all the daemons running on our machine, run the following command:
$ /usr/local/hadoop/sbin/stop-all.sh This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh 14/02/24 22:22:29 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Stopping namenodes on [localhost] localhost: no namenode to stop localhost: stopping datanode Stopping secondary namenodes [0.0.0.0] 0.0.0.0: stopping secondarynamenode 14/02/24 22:22:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable stopping yarn daemons stopping resourcemanager localhost: stopping nodemanager no proxyserver to stop
Now it's time to run our first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.
We will use three ebooks from Project Gutenberg for this example that can be downloaded from the following links:
- The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
- Ulysses by James Joyce
- The Notebooks of Leonardo Da Vinci
Download each as text files in Plain Text UTF-8 encoding and store the files in a local temporary directory of choice, for example /tmp/gutenberg:
$ ls -l total 3512 -rw-r--r-- 1 hduser hadoop 661806 Feb 24 21:44 pg20417.txt -rw-r--r-- 1 hduser hadoop 1540091 Feb 24 21:44 pg4300.txt -rw-r--r-- 1 hduser hadoop 1391683 Feb 24 21:44 pg5000.txt
We need to restart our Hadoop cluster if it's not running already:
$ /usr/local/hadoop/sbin/start-all.sh
Before we run the actual MapReduce job, we need to copy the files from our local file system to Hadoop's HDFS:
$ pwd /usr/local/hadoop/bin $ ls container-executor hadoop hadoop.cmd hdfs hdfs.cmd mapred mapred.cmd rcc test-container-executor yarn yarn.cmd $ hadoop dfs -copyFromLocal /tmp/gutenberg /home/hduser/gutenberg $ hadoop dfs -copyFromLocal /tmp/gutenberg /home/hduser/gutenberg DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. 14/02/24 22:59:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable copyFromLocal: Call From K-PC/127.0.1.1 to localhost:54310 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
I stopped here because of the connection error. I have to figure it out, I will comeback later.
Reference:
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization