Apache Hadoop : Hive 2.1.0 install on Ubuntu 16.04
Apache Hive is considered the defacto standard for interactive SQL queries over petabytes of data in Hadoop.
Hadoop was built to organize and store massive amounts of data of all shapes, sizes and formats. Because of Hadoop's "schema on read" architecture, a Hadoop cluster is a perfect reservoir of heterogeneous data, structured and unstructured, from a multitude of sources.
Data analysts use Hive to query, summarize, explore and analyze that data, then turn it into actionable business insight.
-from http://hortonworks.com/apache/hive/
Hive also provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.
HiveQL also allows traditional map/reduce programmers to plug in their custom mappers and reducers.
We can download the Hive: https://hive.apache.org/downloads.html
hduser@laptop:/tmp$ wget http://www-us.apache.org/dist/hive/hive-2.1.0/apache-hive-2.1.0-bin.tar.gz hduser@laptop:/tmp$ sudo tar xvzf apache-hive-2.1.0-bin.tar.gz -C /usr/local
Open ~/.bashrc and set the environment variable HIVE_HOME to point to the installation directory and PATH:
export HIVE_HOME=/usr/local/apache-hive-2.1.0-bin export HIVE_CONF_DIR=/usr/local/apache-hive-2.1.0-bin/conf export PATH=$HIVE_HOME/bin:$PATH export CLASSPATH=$CLASSPATH:/usr/local/hadoop/lib/*:. export CLASSPATH=$CLASSPATH:/usr/local/apache-hive-2.1.0-bin/lib/*:.
Activate the new setting for Hive:
hduser@laptop:/usr/local/apache-hive-2.1.0-bin$ source ~/.bashrc
Hive uses Hadoop, so we must have Hadoop in our path:
$ echo $HADOOP_HOME /usr/local/hadoop
In addition, we must use below HDFS commands to create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w before we can create a table in Hive:
hduser@laptop:~$ hdfs dfs -ls / drwxr-xr-x - hduser supergroup 0 2016-11-23 11:17 /hbase drwx------ - hduser supergroup 0 2016-11-18 16:04 /tmp drwxr-xr-x - hduser supergroup 0 2016-11-18 09:13 /user hduser@laptop:~$ hdfs dfs -mkdir /user/hive/warehouse hduser@laptop:~$ hdfs dfs -chmod g+w /tmp hduser@laptop:~$ hdfs dfs -chmod g+w /user/hive/warehouse hduser@laptop:~$ hdfs dfs -ls / drwxr-xr-x - hduser supergroup 0 2016-11-23 11:17 /hbase drwx-w---- - hduser supergroup 0 2016-11-18 16:04 /tmp drwxr-xr-x - hduser supergroup 0 2016-11-23 17:18 /user hduser@laptop:~$ hdfs dfs -ls /user drwxr-xr-x - hduser supergroup 0 2016-11-18 23:17 /user/hduser drwxr-xr-x - hduser supergroup 0 2016-11-23 17:18 /user/hive
The directory warehouse is the location to store the table or data related to hive, and the temporary directory tmp is the temporary location to store the intermediate result of processing.
To configure Hive with Hadoop, we need to edit the hive-env.sh file, which is placed in the $HIVE_HOME/conf directory. The following commands redirect to Hive conf folder and copy the template file:
hduser@laptop:~$ cd $HIVE_HOME/conf hduser@laptop:/usr/local/apache-hive-2.1.0-bin/conf$ sudo cp hive-env.sh.template hive-env.sh
Edit the hive-env.sh file by appending the following line:
export HADOOP_HOME=/usr/local/hadoop
Hive installation is completed successfully. Now we need an external database server to configure Metastore. We use Apache Derby database.
The following command is used to download Apache Derby:
$ cd /tmp $ wget http://archive.apache.org/dist/db/derby/db-derby-10.13.1.1/db-derby-10.13.1.1-bin.tar.gz $ sudo tar xvzf db-derby-10.13.1.1-bin.tar.gz -C /usr/local
Let's set up the Derby environment by appending the following lines to ~/.bashrc file:
export DERBY_HOME=/usr/local/db-derby-10.13.1.1-bin export PATH=$PATH:$DERBY_HOME/bin export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbytools.jar
We need to create a directory named data in $DERBY_HOME directory to store Metastore data.
$ sudo mkdir $DERBY_HOME/data
Now we completed Derby installation and environmental setup.
Configuring Metastore means specifying to Hive where the database is stored. We want to do this by editing the hive-site.xml file, which is in the $HIVE_HOME/conf directory.
Let's copy the template file using the following command:
hduser@laptop:~$ cd $HIVE_HOME/conf hduser@laptop:/usr/local/apache-hive-2.1.0-bin/conf$ sudo cp hive-default.xml.template hive-site.xml
Make sure the following lines are between the <configuration> and </configuration> tags of hive-site.xml:
<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:derby:;databaseName=metastore_db;create=true</value> <description> JDBC connect string for a JDBC metastore. To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL. For example, jdbc:postgresql://myhost/db?ssl=true for postgres database. </description> </property>
Create a file named jpox.properties and add the following lines into it:
javax.jdo.PersistenceManagerFactoryClass = org.jpox.PersistenceManagerFactoryImpl org.jpox.autoCreateSchema = false org.jpox.validateTables = false org.jpox.validateColumns = false org.jpox.validateConstraints = false org.jpox.storeManagerType = rdbms org.jpox.autoCreateSchema = true org.jpox.autoStartMechanismMode = checked org.jpox.transactionIsolation = read_committed javax.jdo.option.DetachAllOnCommit = true javax.jdo.option.NontransactionalRead = true javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create = true javax.jdo.option.ConnectionUserName = APP javax.jdo.option.ConnectionPassword = mine
We need to set permission to Hive folder:
hduser@laptop:/usr/local$ sudo chown -R hduser:hadoop apache-hive-2.1.0-bin
Starting from Hive 2.1, we need to run the schematool command below as an initialization step. In our case, we use derby as db type:
hduser@laptop:/usr/local/apache-hive-2.1.0-bin/bin$ schematool -dbType derby -initSchema SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Metastore connection URL: jdbc:derby:;databaseName=metastore_db;create=true Metastore Connection Driver : org.apache.derby.jdbc.EmbeddedDriver Metastore connection User: APP Starting metastore schema initialization to 2.1.0 Initialization script hive-schema-2.1.0.derby.sql Initialization script completed schemaTool completed
To use the Hive command line interface (CLI) from the shell, issue bin/hive command to verify Hive
hduser@laptop:~$ echo $HIVE_HOME /usr/local/apache-hive-2.1.0-bin hduser@laptop:~$ $HIVE_HOME/bin/hive
We may get couple of errors when we try to start hive via bin/hive command. The followings are the errors and corresponding fixes:
- Error #1:
Exception in thread "main" java.lang.RuntimeException: Couldn't create directory ${system:java.io.tmpdir}/${hive.session.id}_resources
Fix #1: edit hive-site.xml:<property> <name>hive.downloaded.resources.dir</name> <!-- <value>${system:java.io.tmpdir}/${hive.session.id}_resources</value> --> <value>/home/hduser/hive/tmp/${hive.session.id}_resources</value> <description>Temporary local directory for added resources in the remote file system.</description> </property>
- Error #2:
java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D
Fix #2: replace ${system:java.io.tmpdir}/${system:user.name} by /tmp/mydir in hive-site.xml (see Confluence - AdminManual Configuration):<property> <name>hive.exec.local.scratchdir</name> <!-- <value>${system:java.io.tmpdir}/${system:user.name}</value> --> <value>/tmp/mydir</value> <description>Local scratch space for Hive jobs</description> </property>
Now that we fixed the errors, let's start Hive CLI:
hduser@laptop:/usr/local/apache-hive-2.1.0-bin/bin$ hive SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Logging initialized using configuration in jar:file:/usr/local/apache-hive-2.1.0-bin/lib/hive-common-2.1.0.jar!/hive-log4j2.properties Async: true Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
To display all the tables:
hive> show tables; OK Time taken: 4.603 seconds
We can exit from that Hive shell by using exit command:
hive> exit; hduser@laptop:/usr/local/apache-hive-2.1.0-bin/bin$
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization