Showing posts from September, 2013

Hadoop Versions

Let's take a moment to explore the different versions of Hadoop. The following are/were the most known versions.
0.21 - Aug 2010 -- A widely used release. This eventually became the Hadoop 1.0 release. 0.23 - Feb 2012 -- Was a branch created to add new features. This branch eventually became Hadoop 2.0 1.0 -- Current production version of Hadoop 2.0 -- Current development version of Hadoop

19.1. Hadoop version 1.0 This is currently the production version of Hadoop. It has been in wide use for a while and has been proven in the field. The following distributions are based on Hadoop 1.0
Cloudera's CDH 3 (Cloudera's Distribution of Hadoop) series HortonWorks's HDP 1 (HortonWorks Data Platform) series 19.2. Hadoop version 2.0 This is a development branch of Hadoop. Hadoop 2 has significant new enhancements. It has been under development for a while…

Hive Indexing

People coming from RDBMS background might know the benefit of Indexing.Indexes are useful for faster access to rows in a table. If we want to apply indexing using Hive then the first expectation might be that with indexing it should take less time to fetch records and it should not launch a map reduce job. Whereas in practice a map reduce job would still be launched on a Hive query even though an index is created on ahive table.Map/reduce job runs on the table that holds the index data to get all the relevant offsets into the main table and then using those offsets it figures out which blocks to read from the main table. So you will not see map/reduce go away even when you are running queries on tables with indexes on them. The biggest advantage of having index is that it does not require a full table scan and it would query only the HDFS blocks required. The difference b/w compact and bitmap indexes(Hive 0.8) is how they store the mapping from values to the rows in which …