Hadoop 1. Setting up on Ubuntu (Single-Node Cluster) - 2018

bogotobogo.com site search:

Linux - system, cmds & shell

Hadoop on Ubuntu

In this chapter, we'll set up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed File System on Ubuntu. So, what are the required steps?

Hadoop Installation on Ubuntu 14.04

Please visit newer site:
Hadoop Installation on Ubuntu 14.04.

Java 6 install

Hadoop framework is written in Java!!

$ sudo apt-get install python-software-properties
$ sudo add-apt-repository ppa:ferramroberto/java

# Update the source list
$ sudo apt-get update

# The OpenJDK project is the default version of Java 
# that is provided from a supported Ubuntu repository. 
$ sudo apt-get install openjdk-6-jdk

# Set default java
$ sudo update-alternatives --config java
There are 2 choices for the alternative java (providing /usr/bin/java).

  Selection    Path                                            Priority   Status
------------------------------------------------------------
* 0            /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java   1071      auto mode
  1            /usr/lib/jvm/java-6-openjdk-amd64/jre/bin/java   1061      manual mode
  2            /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java   1071      manual mode

Press enter to keep the current choice[*], or type selection number: 1
update-alternatives: using /usr/lib/jvm/java-6-openjdk-amd64/jre/bin/java to provide /usr/bin/java (java) in manual mode

We can check the java setup:

$ java -version
java version "1.6.0_27"
OpenJDK Runtime Environment (IcedTea6 1.12.6) (6b27-1.12.6-1ubuntu2.1)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

Adding a Hadoop user

Though not required, we're going to use a dedicated Hadoop user account for running Hadoop. It is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine.

This will add the user hduser and the group hadoop to our local machine.

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
Adding user `hduser' ...
Adding new user `hduser' (1001) with group `hadoop' ...
Creating home directory `/home/hduser' ...
Copying files from `/etc/skel' ...
Enter new UNIX password: 
Retype new UNIX password: 
passwd: password updated successfully
Changing the user information for hduser
Enter the new value, or press ENTER for the default
	Full Name []: K Hong
	Room Number []: 123
	Work Phone []: 123
	Home Phone []: 123
	Other []: 123
Is the information correct? [Y/n] Y

SSH configuration

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus our local machine. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user.

We need to have SSH up and running on our machine and configured it to allow SSH public key authentication. If not:

$ sudo apt-get install openssh-server
$ sudo ufw allow 22

First, we have to generate an SSH key for the hduser user:

$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa): 
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
3d:1f:31:b4:bf:72:91:40:60:1e:8e:a5:d7:e1:83:99 hduser@K
The key's randomart image is:
+--[ RSA 2048]----+
...

We created an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without our interaction. We don't want to enter the passphrase every time Hadoop interacts with its nodes.

We have to enable SSH access to our local machine with this newly created key:

$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The final step is to test the SSH setup by connecting to our local machine with the hduser user. The step is also needed to save our local machine's host key fingerprint to the hduser user's known_hosts file. If we have any special SSH configuration for our local machine like a non-standard SSH port, we can define host-specific SSH options in $HOME/.ssh/config (see man ssh_config for more information).

$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is 18:3a:d8:a2:2a:12:0d:90:87:b6:36:0a:b1:96:d6:72.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
...

Disabling ipv6

Open /etc/sysctl.conf and add the following lines to the end of the file:

# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

We have to reboot our machine in order to make the changes take effect.

We can check whether IPv6 is enabled on your machine with the following command:

$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

A return value of 0 means IPv6 is enabled, a value of 1 means disabled and that's what we want.

We can also disable IPv6 only for Hadoop. We can do so by adding the following line to /usr/local/hadoop/etc/hadoop/hadoop-env.sh:

export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Hadoop Installation

Download Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop package to /usr/local/hadoop. Then, change the owner of all the files to the hduser user and hadoop group:

$ cd /usr/local
$ sudo tar xvzf hadoop-2.2.0.tar.gz
$ sudo mv hadoop-2.2.0 hadoop
$ sudo chown -R hduser:hadoop hadoop

.bashrc update

Let's work on $HOME/.bashrc file of user hduser.

Set Hadoop-related environment variables:

export HADOOP_HOME=/usr/local/hadoop

Set JAVA_HOME. We're going to configure JAVA_HOME directly for Hadoop later:

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64

Aliases and functions for running Hadoop-related commands:

unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

If we have LZO compression enabled in our Hadoop cluster and# compress job outputs with LZOP - need to install lzop (sudo apt-get install lzop):

lzohead () {
    hadoop fs -cat $1 | lzop -dc | head -1000 | less
}

Add Hadoop bin/ directory to PATH:

export PATH=$PATH:$HADOOP_HOME/bin

Config - Hadoop Distributed File System (HDFS)

hadoop-env.sh

Let's configure JAVA_HOME.

$ find / -name hadoop-env.sh -print
/usr/local/hadoop/etc/hadoop/hadoop-env.sh

Open the hadoop-env.sh and check if the JAVA_HOME environment variable is set to:

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64

*site.xml

In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to, etc.

Our setup will use Hadoop's Distributed File System, HDFS, even though our little "cluster" only contains our single local machine.

We can leave the settings below "as is" with the exception of the hadoop.tmp.dir parameter. This parameter we must change to the directory /app/hadoop/tmp in this tutorial. Hadoop's default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS.

$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
$ sudo chmod 750 /app/hadoop/tmp

Add the following snippets between the <onfiguration> ... </configuration> tags in the respective configuration XML file.

conf/core-site.xml:

<property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

mapred-site.xml.template:

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

hdfs-site.xml:

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>