Hive Indexing

People coming from RDBMS background might know the benefit of Indexing.Indexes are useful for faster access to rows in a table. If we want to apply indexing using Hive then the first expectation might be that with indexing it should take less time to fetch records and it should not launch a map reduce job. Whereas in practice a map reduce job would still be launched on a Hive query even though an index is created on ahive table.Map/reduce job runs on the table that holds the index data to get all the relevant offsets into the main table and then using those offsets it figures out which blocks to read from the main table. So you will not see map/reduce go away even when you are running queries on tables with indexes on them. The biggest advantage of having index is that it does not require a full table scan and it would query only the HDFS blocks required.

The difference b/w compact and bitmap indexes(Hive 0.8) is how they store the mapping from values to the rows in which the value occurs (Compact Index seems to store (value, block-id) pairs while Bitmap Index stores (value , list of rows as a bitmap)).

For example on a hadoop cluster if we run the following query without index

select * from test_table where test_column=02196666033;

It takes 71 seconds to run as it would perform a complete table scan. Note this table has storage in RCFile fromat and ~ 1 billion records.

The cluster is a 8 node hadoop cluster.

Now we can create the index with following commands

hive> create INDEX test_column_index ON TABLE test_table(test_column) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD;

Time taken: 3.134 seconds

hive> ALTER INDEX test_column_index ON test_table REBUILD;

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks not specified. Estimated from input data size: 26

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapred.reduce.tasks=<number>

Starting Job = job_201208160845_0121, Tracking URL = http://test-cluster-1:50030/jobdetails.jsp?jobid=job_201208160845_0121

Kill Command = /home/hadoop/hadoop-0.20.2-hdh3u3/bin/../bin/hadoop job -Dmapred.job.tracker=test-cluster-1:50300 -kill job_201208160845_0121

2012-08-20 12:43:32,481 Stage-1 map = 0%, reduce = 0%

2012-08-20 12:43:36,503 Stage-1 map = 1%, reduce = 0%

…………………………………………………………………………………………….

2012-08-20 12:44:34,797 Stage-1 map = 100%, reduce = 100%

Ended Job = job_201208160845_0121

Loading data to table default.default__test_table_test_column_index__

Deleted hdfs://test-cluster-1:54310/user/hive/warehouse/default__test_table_test_column_index__

Table default.default__test_table_test_column_index__ stats: [num_partitions: 0, num_files: 26, num_rows: 0, total_size: 179836098]

Time taken: 68.077 seconds

Note this has now created an index table with following name under default Database “default__test_table_test_column_index__”

hive> INSERT OVERWRITE DIRECTORY "/tmp/index_test_result" SELECT `_bucketname` , `_offsets` FROM default__test_table_test_column_index__ where test_column=02196666033 ;

……………..

Moving data to: /tmp/index_test_result

1 Rows loaded to /tmp/index_test_result

Time taken: 2.683 seconds

hive> SET hive.index.compact.file=/tmp/index_test_result;

hive> SET hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;

hive> SELECT * FROM test_table WHERE test_column=02196666033;

15836400000 21 02196666033 2813 13 11 3 0 61075 1 0 0 2078 01

15836580000 21 02196666033 2813 13 11 3 0 61075 1 0 0 2078 01

15836760000 21 02196666033 2813 13 11 3 0 61075 1 0 0 2078 01

………………………………

15836760000 21 02196666033 2813 13 11 3 0 61075 1 0 0 2078 01

Time taken: 2.479 seconds

Now If you see the difference, the same query would now take 2.4 seconds instead of 71 seconds as it is using indexing.

hive> SELECT count(*) FROM test_table WHERE test_column=02196666033;

602

Time taken: 3.417 seconds

If you want to see what is happened is that now the index table has stored the location of HDFS blocks where the data resides for this query and when query is fired it just reads form those blocks the actual data

hive> describe default__test_table_test_column_index__;

test_column string from deserializer

_bucketname string from deserializer

_offsets array<bigint> from deserializer

hive> select * from default__test_table_test_column_index__ where test_column=02196666033;

02196666033 hdfs://test-cluster-1:54310/user/hive/warehouse/test_table/000033_0 [1653776,9918864,6611866,8265443,28615496,18337446,30330621,21765257,25190947,14908809,26903328,16623458,20052481,194,4958659,3305997,23478272,11571563,13224916]

Tech Insights

About Us

Contact Form