Hadoop 1. Setting up on Ubuntu (Single-Node Cluster) - 2018
Linux - system, cmds & shell
- Linux Tips - links, vmstats, rsync
- Linux Tips 2 - ctrl a, curl r, tail -f, umask
- Linux - bash I
- Linux - bash II
- Linux - Uncompressing 7z file
- Linux - sed I (substitution: sed 's///', sed -i)
- Linux - sed II (file spacing, numbering, text conversion and substitution)
- Linux - sed III (selective printing of certain lines, selective definition of certain lines)
- Linux - 7 File types : Regular, Directory, Block file, Character device file, Pipe file, Symbolic link file, and Socket file
- Linux shell programming - introduction
- Linux shell programming - variables and functions (readonly, unset, and functions)
- Linux shell programming - special shell variables
- Linux shell programming : arrays - three different ways of declaring arrays & looping with $*/$@
- Linux shell programming : operations on array
- Linux shell programming : variables & commands substitution
- Linux shell programming : metacharacters & quotes
- Linux shell programming : input/output redirection & here document
- Linux shell programming : loop control - for, while, break, and break n
- Linux shell programming : string
- Linux shell programming : for-loop
- Linux shell programming : if/elif/else/fi
- Linux shell programming : Test
- Managing User Account - useradd, usermod, and userdel
- Linux Secure Shell (SSH) I : key generation, private key and public key
- Linux Secure Shell (SSH) II : ssh-agent & scp
- Linux Secure Shell (SSH) III : SSH Tunnel as Proxy - Dynamic Port Forwarding (SOCKS Proxy)
- Linux Secure Shell (SSH) IV : Local port forwarding (outgoing ssh tunnel)
- Linux Secure Shell (SSH) V : Reverse SSH Tunnel (remote port forwarding / incoming ssh tunnel) /)
- Linux Processes and Signals
- Linux Drivers 1
- tcpdump
- Linux Debugging using gdb
- Embedded Systems Programming I - Introduction
- Embedded Systems Programming II - gcc ARM Toolchain and Simple Code on Ubuntu/Fedora
- LXC (Linux Container) Install and Run
- Linux IPTables
- Hadoop - 1. Setting up on Ubuntu for Single-Node Cluster
- Hadoop - 2. Runing on Ubuntu for Single-Node Cluster
- ownCloud 7 install
- Ubuntu 14.04 guest on Mac OSX host using VirtualBox I
- Ubuntu 14.04 guest on Mac OSX host using VirtualBox II
- Windows 8 guest on Mac OSX host using VirtualBox I
- Ubuntu Package Management System (apt-get vs dpkg)
- RPM Packaging
- How to Make a Self-Signed SSL Certificate
- Linux Q & A
- DevOps / Sys Admin questions
In this chapter, we'll set up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed File System on Ubuntu. So, what are the required steps?
Please visit newer site:
Hadoop Installation on Ubuntu 14.04.
Hadoop framework is written in Java!!
$ sudo apt-get install python-software-properties $ sudo add-apt-repository ppa:ferramroberto/java # Update the source list $ sudo apt-get update # The OpenJDK project is the default version of Java # that is provided from a supported Ubuntu repository. $ sudo apt-get install openjdk-6-jdk # Set default java $ sudo update-alternatives --config java There are 2 choices for the alternative java (providing /usr/bin/java). Selection Path Priority Status ------------------------------------------------------------ * 0 /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java 1071 auto mode 1 /usr/lib/jvm/java-6-openjdk-amd64/jre/bin/java 1061 manual mode 2 /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java 1071 manual mode Press enter to keep the current choice[*], or type selection number: 1 update-alternatives: using /usr/lib/jvm/java-6-openjdk-amd64/jre/bin/java to provide /usr/bin/java (java) in manual mode
We can check the java setup:
$ java -version java version "1.6.0_27" OpenJDK Runtime Environment (IcedTea6 1.12.6) (6b27-1.12.6-1ubuntu2.1) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
Though not required, we're going to use a dedicated Hadoop user account for running Hadoop. It is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine.
This will add the user hduser and the group hadoop to our local machine.
$ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hduser Adding user `hduser' ... Adding new user `hduser' (1001) with group `hadoop' ... Creating home directory `/home/hduser' ... Copying files from `/etc/skel' ... Enter new UNIX password: Retype new UNIX password: passwd: password updated successfully Changing the user information for hduser Enter the new value, or press ENTER for the default Full Name []: K Hong Room Number []: 123 Work Phone []: 123 Home Phone []: 123 Other []: 123 Is the information correct? [Y/n] Y
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus our local machine. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user.
We need to have SSH up and running on our machine and configured it to allow SSH public key authentication. If not:
$ sudo apt-get install openssh-server $ sudo ufw allow 22
First, we have to generate an SSH key for the hduser user:
$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hduser/.ssh/id_rsa): Created directory '/home/hduser/.ssh'. Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa.pub. The key fingerprint is: 3d:1f:31:b4:bf:72:91:40:60:1e:8e:a5:d7:e1:83:99 hduser@K The key's randomart image is: +--[ RSA 2048]----+ ...
We created an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without our interaction. We don't want to enter the passphrase every time Hadoop interacts with its nodes.
We have to enable SSH access to our local machine with this newly created key:
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
The final step is to test the SSH setup by connecting to our local machine with the hduser user. The step is also needed to save our local machine's host key fingerprint to the hduser user's known_hosts file. If we have any special SSH configuration for our local machine like a non-standard SSH port, we can define host-specific SSH options in $HOME/.ssh/config (see man ssh_config for more information).
$ ssh localhost The authenticity of host 'localhost (127.0.0.1)' can't be established. ECDSA key fingerprint is 18:3a:d8:a2:2a:12:0d:90:87:b6:36:0a:b1:96:d6:72. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts. ...
Open /etc/sysctl.conf and add the following lines to the end of the file:
# disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1
We have to reboot our machine in order to make the changes take effect.
We can check whether IPv6 is enabled on your machine with the following command:
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
A return value of 0 means IPv6 is enabled, a value of 1 means disabled and that's what we want.
We can also disable IPv6 only for Hadoop. We can do so by adding the following line to /usr/local/hadoop/etc/hadoop/hadoop-env.sh:
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
Download Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop package to /usr/local/hadoop. Then, change the owner of all the files to the hduser user and hadoop group:
$ cd /usr/local $ sudo tar xvzf hadoop-2.2.0.tar.gz $ sudo mv hadoop-2.2.0 hadoop $ sudo chown -R hduser:hadoop hadoop
Let's work on $HOME/.bashrc file of user hduser.
Set Hadoop-related environment variables:
export HADOOP_HOME=/usr/local/hadoop
Set JAVA_HOME. We're going to configure JAVA_HOME directly for Hadoop later:
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
Aliases and functions for running Hadoop-related commands:
unalias fs &> /dev/null alias fs="hadoop fs" unalias hls &> /dev/null alias hls="fs -ls"
If we have LZO compression enabled in our Hadoop cluster and# compress job outputs with LZOP - need to install lzop (sudo apt-get install lzop):
lzohead () { hadoop fs -cat $1 | lzop -dc | head -1000 | less }
Add Hadoop bin/ directory to PATH:
export PATH=$PATH:$HADOOP_HOME/bin
Let's configure JAVA_HOME.
$ find / -name hadoop-env.sh -print /usr/local/hadoop/etc/hadoop/hadoop-env.shOpen the hadoop-env.sh and check if the JAVA_HOME environment variable is set to:
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to, etc.
Our setup will use Hadoop's Distributed File System, HDFS, even though our little "cluster" only contains our single local machine.
We can leave the settings below "as is" with the exception of the hadoop.tmp.dir parameter. This parameter we must change to the directory /app/hadoop/tmp in this tutorial. Hadoop's default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS.
$ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp $ sudo chmod 750 /app/hadoop/tmp
Add the following snippets between the <onfiguration> ... </configuration> tags in the respective configuration XML file.
conf/core-site.xml:
<property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>
mapred-site.xml.template:
<property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property>
hdfs-site.xml:
<property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property>
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization