Apache Drill - Query Hive

bogotobogo.com site search:

Install Hive

Apache Hive is a client side library providing a table like abstraction on top of the data in HDFS for data processing. Hive jobs are converted into a MR plan which is then submitted to the Hadoop cluster for execution.

The Hive table definitions and mapping to the data are stored in a metastore. Metastore constitutes of

meta store service.
database.

The metastore service provides the interface to the Hive and the database stores the data definitions, mappings to the data.

We can run Hive queries in two ways by configuring the Hive storage plugin.

Connect to the Hive embedded metastore, and this is the default Hive configuration (as is from Apache Hive without any configuration changes). With this configuration, the Hive driver, metastore interface and the db (derby) all use the same JVM. This is good only for the development stage and it is mainly used for unit tests.
Since only one process can connect to the metastore database at a time, it is not really a practical solution and it won't scale to a production environment.
Derby is the default database for the embedded metastore and the configuration looks like this:
```
{
  "type": "hive",
  "enabled": true,
  "configProps": {
    "hive.metastore.uris": "",
    "javax.jdo.option.ConnectionURL": "jdbc:derby:;databaseName=../sample-data/drill_hive_db;create=true",
    "hive.metastore.warehouse.dir": "/tmp/drill_hive_wh",
    "fs.default.name": "file:///",
    "hive.metastore.sasl.enabled": "false"
  }
}
```
Connect Drill to the Hive local metastore. It uses an external database which is JDBC compliant.

Connect Drill to the Hive remote metastore. In this configuration, we would use a traditional standalone RDBMS server. The Hive driver and the metastore interface would be running in a different JVM (which can run on different machines also). This way the database can be fire-walled from the Hive user and also database credentials are completely isolated from users of Hive.

In this tutorial, we'll learn how to create a Hive table and insert data that we can query using Drill.

For more info about Hive and how to install it, visit Install Apache Hive on Ubuntu 16.04.

Prerequisite - hadoop & hive

We need hadoop and hive. Here is the tutorials how to install and configure those:

Hadoop 2.6.5 - Installing on Ubuntu 16.04 (Single-Node Cluster)

Apache Hive 2.1.0 install on Ubuntu 16.04

Connect Drill to an Hive - Hive storage plugin

To access Hive tables using custom SerDes (Serializer, Deserializer) or InputFormat/OutputFormat, all nodes running Drillbits must have the corresponding JAR files in the <drill_installation_directory>/jars/3rdparty folder.

To view or change the Hive storage plugin configuration, we want to use the Drill Web Console.

Run Drill:

root@laptop:/usr/local/apache-drill-1.8.0# bin/drill-embedded
0: jdbc:drill:zk=local>

In the Web Console (localhost:8047), select the Storage tab, and then click the Update button for the hive storage plugin configuration:

The following is the default HBase storage plugin, and we'll use it as it is:

Click "Enable" and "Back", then we can see the HBase plugin has been enabled:

After configuring a Hive storage plugin, we can query Hive tables.

Create a Hive table

To create a Hive table by following the the steps below:

We need to download customers.csv

First, we need to start a Hive shell:

hduser@laptop:/usr/local/apache-hive-2.1.0-bin/bin$ hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
...
hive>

Let's create a table schema:

hive> create table customers(FirstName string, LastName string, Company string, Address string, City string, County string, State string, Zip string, Phone string, Fax string, Email string, Web string) row format delimited fields terminated by ',' stored as textfile;
OK
Time taken: 10.599 seconds
hive>

Load the customer data into the customers table:

hive> load data local inpath '/tmp/customers.csv' overwrite into table customers;
Loading data to table default.customers
OK
Time taken: 5.913 seconds

The describe command provides information about the schema of the table:

hive> describe default.customers;
OK
firstname           	string              	                    
lastname            	string              	                    
company             	string              	                    
address             	string              	                    
city                	string              	                    
county              	string              	                    
state               	string              	                    
zip                 	string              	                    
phone               	string              	                    
fax                 	string              	                    
email               	string              	                    
web                 	string              	                    
Time taken: 3.161 seconds, Fetched: 12 row(s)

Note that we can use just the table name without the "default" : describe customers;

To see some of the contents of the table:

hive> select FirstName, LastName from customers limit 5;
OK
FirstName	LastName
Essie	Vaill
Cruz	Roudabush
Billie	Tinnes
Zackary	Mockus
Time taken: 11.308 seconds, Fetched: 5 row(s)

Exit to leave the Hive shell:
```
hive> exit;
```

Start the Drill shell:

root@laptop:/usr/local/apache-drill-1.8.0# bin/drill-embedded
...
apache drill 1.8.0
...
0: jdbc:drill:zk=local>

Lets' see how to use a file system and a hive schema to query a file and table in Drill.
Issue the show databases or show schemas command to see a list of the available schemas that we can use. Both commands return the same results:

0: jdbc:drill:zk=local> show databases;
+---------------------+
|     SCHEMA_NAME     |
+---------------------+
| INFORMATION_SCHEMA  |
| cp.default          |
| dfs.default         |
| dfs.root            |
| dfs.tmp             |
| hbase               |
| hive.default        |
| sys                 |
+---------------------+
8 rows selected (0.596 seconds)


0: jdbc:drill:zk=local> show schemas;
+---------------------+
|     SCHEMA_NAME     |
+---------------------+
| INFORMATION_SCHEMA  |
| cp.default          |
| dfs.default         |
| dfs.root            |
| dfs.tmp             |
| hbase               |
| hive.default        |
| sys                 |
+---------------------+
8 rows selected (1.494 seconds)

0: jdbc:drill:zk=local>

Type the command use hive to make Drill use Hive schema:

0: jdbc:drill:zk=local> use hive;
+-------+-----------------------------------+
|  ok   |              summary              |
+-------+-----------------------------------+
| true  | Default schema changed to [hive]  |
+-------+-----------------------------------+
1 row selected (0.421 seconds)
0: jdbc:drill:zk=local>

Type the command show tables to see the tables that exist within the schema:

0: jdbc:drill:zk=local> show tables;
+---------------+-------------+
| TABLE_SCHEMA  | TABLE_NAME  |
+---------------+-------------+
+---------------+-------------+
No rows selected (1.058 seconds)
0: jdbc:drill:zk=local>

Here, we're supposed to see "customers" table but no tables in "hive". We need to fix this.

Let's fix the metastore issue.
Add the line highlighted in green to conf/hive-site.xml file:

  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://localhost:9083</value>
    <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
  </property>

Then, start the Hive Metastore Server using hive --service metastore command:

hduser@laptop:/usr/local/apache-hive-2.1.0-bin$ hive --service metastore
Starting Hive Metastore Server
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Note that we are running hive metastore as a separate service (outside hive).
While the Hive Metastore Server is running, let's configure Drill for hive (http://localhost:8047/storage/hive):

{
  "type": "hive",
  "enabled": true,
  "configProps": {
    "hive.metastore.uris": "thrift://localhost:9083",
    "fs.default.name": "hdfs://localhost:54313/",
    "hive.metastore.sasl.enabled": "false"
  }
}

Click "Update".
Here, we can find the fs.default.name using either one of the two commands as shown below:

hduser@laptop:~$ hdfs getconf -confKey fs.default.name
hdfs://localhost:54310

hduser@laptop:~$ hdfs getconf -confKey fs.defaultFS
hdfs://localhost:54310

Then, we may want to go back to our Drill shell and check if hive schema is there either via show databases or show schemas command:

0: jdbc:drill:zk=local> show databases;
+---------------------+
|     SCHEMA_NAME     |
+---------------------+
| INFORMATION_SCHEMA  |
| cp.default          |
| dfs.default         |
| dfs.root            |
| dfs.tmp             |
| hbase               |
| hive.default        |
| sys                 |
+---------------------+

Now, it should work. Type the command show tables to see the tables that exist within the schema:

0: jdbc:drill:zk=local> use hive;
+-------+-----------------------------------+
|  ok   |              summary              |
+-------+-----------------------------------+
| true  | Default schema changed to [hive]  |
+-------+-----------------------------------+
1 row selected (0.44 seconds)


0: jdbc:drill:zk=local> show tables;
+---------------+-------------+
| TABLE_SCHEMA  | TABLE_NAME  |
+---------------+-------------+
| hive.default  | customers   |
+---------------+-------------+
1 row selected (5.71 seconds)

Finally, we're ready to make a query with Drill:

0: jdbc:drill:zk=local> select FirstName, LastName from customers limit 5;
+------------+------------+
| FirstName  |  LastName  |
+------------+------------+
| FirstName  | LastName   |
| Essie      | Vaill      |
| Cruz       | Roudabush  |
| Billie     | Tinnes     |
| Zackary    | Mockus     |
+------------+------------+
5 rows selected (12.304 seconds)