Apache Hadoop HBase : Map, Persistent, Sparse, Sorted, Distributed and Multidimensional
HBase uses a data model very similar to that of Bigtable. Users store data rows in labelled tables. A data row has a sortable key and an arbitrary number of columns. The table is stored sparsely, so that rows in the same table can have crazily-varying columns, if the user likes.
"A Bigtable is a sparse, distributed, persistent multidimensional sorted map"
HBase is an open source, non-relational, distributed database modeled after Google's BigTable. Here is the first sentence of the "Data Model" section of the BigTable.
So, the best way to know about HBase is to understand the six colorful concepts.
Map is the first one that characterized HBase, and one of the easiest concepts. Simply put, HBase/BigTable is a map, in its core, key-value store, like this:
{ "xyz" : "Kingdom", "zzz" : "lucky", "aab" : "hello", "1" : "ooo", "aaa" : "UK" }
Understanding the meaning of persistence is important for evaluating different data store systems. Persistence is "the continuance of an effect after its cause is removed". In the context of storing data in a computer system, this means that the data survives after the process with which it was created has ended.
HBase/BigTable the key/value pairs are kept as strictly sorted. In other words, the row for the key "aaa" should be right next to the row with key "aab" and very far from the row with key "zzz".
The sorted version looks like the following in JSON format:
{ "1" : "ooo", "aaa" : "UK", "aab" : "hello", "xyz" : "Kingdom", "zzz" : "lucky" }
This sorting feature is actually very important since these systems tend to be so huge and distributed. The sorting ensures that when you must scan the table, the items that closely related are near each other.
Note that the term "sorted" when applied to HBase/BigTable does not mean that "values" are sorted. The sorted ones are the keys.
Now we need to introduce the concept of "columns". We may think of "table" as a multidimensional map. In other words, multimensional map is a map of maps. To make it multidimensional, we can add another dimension to our JSON example like this:
{ "1" : { "A" : "ooo", "B" : "uuu" }, "aaa" : { "A" : "UK", "B" : "Britain" }, "aab" : { "A" : "hello", "B" : "world" }, "xyz" : { "A" : "Kingdom", "B" : "Dynasty" }, "zzz" : { "A" : "lucky", "B" : "charm" } }
Note that each key points to a map with exactly two keys: "A" and "B". We'll start refer to the top-level key/map pair ("1", "aaa', "aab", "xyz", "zzz") as a row. Also, in BigTable/HBase nomenclature, the "A" and "B" mappings would be called Column Families.
$ hbase shell hbase(main):001:0> create 'test','cf' => Hbase::Table - test hbase(main):002:0> list 'test' TABLE test => ["test"] hbase(main):003:0> put 'test', '1', 'cf:A', 'ooo' hbase(main):004:0> put 'test', '1', 'cf:B', 'uuu' hbase(main):005:0> put 'test', 'aaa', 'cf:A', 'UK' hbase(main):006:0> put 'test', 'aaa', 'cf:B', 'Britain' hbase(main):007:0> put 'test', 'aab', 'cf:A', 'hello' hbase(main):008:0> put 'test', 'aab', 'cf:B', 'world' hbase(main):009:0> put 'test', 'xyz', 'cf:A', 'Kingdom' hbase(main):010:0> put 'test', 'xyz', 'cf:B', 'Dynasty' hbase(main):011:0> put 'test', 'zzz', 'cf:A', 'lucky' hbase(main):012:0> put 'test', 'zzz', 'cf:B', 'cha
We can get data from HBase using scan. We can limit our scan, but for now, all data is fetched:
hbase(main):013:0> scan 'test' ROW COLUMN+CELL 1 column=cf:A, timestamp=1431909058898, value=ooo 1 column=cf:B, timestamp=1431909081541, value=uuu aaa column=cf:A, timestamp=1431909109532, value=UK aaa column=cf:B, timestamp=1431909155066, value=Britain aab column=cf:A, timestamp=1431909196407, value=hello aab column=cf:B, timestamp=1431909221376, value=world xyz column=cf:A, timestamp=1431909244476, value=Kingdom xyz column=cf:B, timestamp=1431909274177, value=Dynasty zzz column=cf:A, timestamp=1431909317403, value=lucky zzz column=cf:B, timestamp=1431909337133, value=charm 5 row(s) in 0.0310 seconds
A table's column families are specified when the table is created, and are almost immutable. It can also be expensive to add new column families, so it's a good idea to specify all the ones we'll need up front.
However, a column family may have any number of columns, denoted by a column "qualifier" or "label". Here's a subset of our JSON example again, this time with the column qualifier dimension built in:
{ ... "aaa" : { "A" : { "foo" : "UK", "bar" : "uk" }, "B" : { "" : "Britain" } }, "aab" : { "A" : { "foo" : "world", "bar" : "hollo" }, "B" : { "" : "world" } }, ... }
Notice that in the two rows shown above, the "A" column family has two columns: "foo" and "bar" while the "B" column family has just one column whose qualifier is the empty string ("").
When asking HBase for data, we must provide the full column name in the form "<family>:<qualifier>". So for example, both rows in the above example have three columns: "A:foo", "A:bar" and "B:".
Note that although the column families are static, the columns themselves are not. We can expand the row:
{ ... "zzz" : { "A" : { "cookies" : "lucky", } } }
Note that the "zzz" row has exactly one column, "A:cookies". Because each row may have any number of different columns, there's no built-in way to query for a list of all columns in all rows. To get that information, we need to do a full table scan. We can, however, query for a list of all column families since these are immutable.
Now, let's talk about time which is another dimension. All data is versioned either using an integer timestamp (seconds since the epoch), or another integer of our choice. The client may specify the timestamp when inserting data.
The following example is using arbitrary integral timestamps:
{ ... "aaa" : { "A" : { "foo" : { 21 : "UK", 3 : "United Kingdom" }, "bar" : { 9 : "uk", } }, "B" : { "" : { 8 : "b" 7 : "b" 1 : "c" } } }, ... }
Each column family may have its own rules regarding how many versions of a given cell to keep (a cell is identified by its rowkey/column pair) In most cases, applications simply asks for a given cell's data while not specifying a timestamp. In that common case, HBase returns the most recent version (the one with the highest timestamp) since it stores these in reverse chronological order.
If an application asks for a given row at a given timestamp, HBase will return cell data where the timestamp is less than or equal to the one provided. Using our imaginary HBase table, querying for the row/column of "aaa"/"A:foo" will return "UK" while querying for the row/column/timestamp of "aaa"/"A:foo"/15 will return "United Kindom". Querying for a row/column/timestamp of "aaa"/"A:foo"/1 will return a null result.
Sparse! A given row can have any number of columns in each column family, or none at all.
HBase is built upon Hadoop's Distributed File System (HDFS) so that the underlying file storage can be spread among an array of independent machines. Data is replicated across a number of participating nodes in an analogous manner to how data is striped across discs in a RAID system.
I recommend the following video who is addicted to SQL.
I believe the title should be "Treating HBase like relational database will lead to abject failure!"
Here is the summary:
- Architecting for a RDBMS is about relationships or normalizing data.
- Architecting for HBase is about access patterns or denormalizing data.
Question to ask:
- How is data being accessed?
- What is the fastest way to read/write data?
- What is the optimal way to organize data?
Best practices of HBase schema design:
- Fewer, bigger (denormalized) tables.
- Spend more time in designing upfront.
- Use bulk loading for incremental or time series data.
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization