Showing posts from July, 2013

Copying between two clusters that are running different versions of Hadoop

While copying between two clusters that are running different versions of Hadoop, it is
generally recommended to use HftpFileSystem as the source. HftpFileSystem is
a read-only filesystem. The distcp command has to be run from the destination server:

hadoop distcp hftp://namenodeA:port/data/weblogs hdfs://namenodeB/data/

In the preceding command, port is defined by the dfs.http.address property in the
hdfs-site.xml configuration file.

How to transfer data between different HDFS clusters

Overview DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses Map/Reduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. Its Map/Reduce pedigree has endowed it with some quirks in both its semantics and execution. Usage Basic The most common invocation of DistCp is an inter-cluster copy: bash$ hadoop distcp hdfs://nn1:8020/foo/bar \
                    hdfs://nn2:8020/bar/foo This will expand the namespace under /foo/bar on nn1 into a temporary file, partition its contents among a set of map tasks, and start a copy on each TaskTracker from nn1 to nn2. Note that DistCp expects absolute paths. One can also specify multiple source directories on the command line: bash$ hadoop distcp hdfs://nn…

Hadoop Hive Sample Reports with iReport

These sample reports are available to help users who are new to Jaspersoft technologies to be able to connect to data in Hadoop. Loading Sample Data to Hadoop Hive
Copy the data file to HDFS filesystem, change the paths according to your filesystem.
hadoop fs -copyFromLocal $LOCAL_PATH/accounts.csv/user/hdfs Start hive as hdfs user
sudo -u hdfs hive Create the table accounts on Hive
CREATE TABLE accounts ( id STRING, date_entered STRING, date_modified STRING, modified_user_id STRING, assigned_user_id STRING, created_by STRING, name STRING, parent_id STRING, account_type STRING, industry STRING, annual_revenue STRING, phone_fax STRING, billing_address_street STRING, billing_address_city STRING, billing_address_state STRING, billing_address_postalcode STRING, billing_address_country STRING, description STRING, rating STRING, phone_office STRING, phone_alternate STRING, email1 STRING, email2 STRING, website STRING, ownership STRING, employees STRING, sic_code STRING, ticker_symbol STRING, …


Hadoop-LZO is a project to bring splittable LZO compression to Hadoop. LZO is an ideal compression format for Hadoop due to its combination of speed and compression size. However, LZO files are not natively splittable, meaning the parallelism that is the core of Hadoop is gone. This project re-enables that parallelism with LZO compressed files, and also comes with standard utilities (input/output streams, etc) for working with LZO files.

Hadoop and LZO, Together at Last LZO is a wonderful compression scheme to use with Hadoop because it's incredibly fast, and (with a bit of work) it's splittable. Gzip is decently fast, but cannot take advantage of Hadoop's natural map splits because it's impossible to start decompressing a gzip stream starting at a random offset in the file. LZO's block format makes it possible to start decompressing at certain specific offsets of the file -- those that start new LZO block boundaries. In addition to providing LZO…

Performance tuning of hive queries

Hive performance optimization is a larger topic on its own and is veryspecific to the queries you are using. Infact each query in a query file needs separate performance tuning to get the most robust results.
I'll try to list a few approaches in general used for performance optimization

Limit the data flow down the queries When you are on a hive query the volume of data that flows each level down is the factor that decides performance. So if you are executing a script that contains a sequence of hive QL, make sure that the datafiltration happens on the first few stages rather than bringing unwanted datato bottom. This will give you significant performance numbers as the queries down the lane will have very less data to crunch on. This is a common bottle neck when some existing SQLjobs are ported to hive, we just try to execute the same sequence of SQL steps in hive as well which becomes a bottle neck on the performance. Understand the requirement or the existing SQL script and design …

How to recover deleted files from hdfs/ Enable trash in hdfs

If you enable thrash in hdfs, when an rmr is issued the file will be still available in trash for some period. There by you can recover accidentally deleted ones. To enable hdfs thrash
set fs.trash.interval > 1

This specifies the time interval a file deleted would be available in trash. There is a property (fs.trash.checkpoint.interval) that specifies the checkpoint interval NN checks the trash dir at every intervals and deletes all files older than specified fs.trash.interval . ie say you have your fs.trash.interval as 60 mins and fs.trash.checkpoint.interval as 30 mins, then in every 30 mins a check is performed and deletes all files that are more than 60 mins old.

fs.trash.checkpoint.interval should be equal to or less than fs.trash.interval

The value of fs.trash.interval  is specified in minutes.

fs.trash.interval should be enabled in client node as well as Name Node. Name Node it should be present for check pointing purposes. Based the value in client node it is decided whether to …

Hive Hbase integration/ Hive HbaseHandler : Common issues and resolution

It is common that when we try out hive hbase integration it leads to lots of unexpected errors even though hive and hbase are running individually without any issues. Some common issues are 1) jars not available The following jars should be available on hive auxpath usr/lib/hive/lib/hive-hbase-handler-0.7.1-cdh3u2.jar/usr/lib/hive/lib/hbase-0.90.4-cdh3u2.jar/usr/lib/hive/lib/zookeeper-3.3.1.jarThese jars vary with the hbase , hive and zookeeper version running on your cluster
2)zookeeper quorum not available for hive client
Should specify the zoo keeper server host names so that the leader hbase master server can be chosen for the hbase connection


Where zk1,zk2,zk3 should be the actual hostnames of the ZooKeeper servers

These values can be set either on your hive session or permanently on your hive configuration files

1) Setting in hive client session
$ hive -auxpath /usr/lib/hive/lib/hive-hbase-handler-0.7.1-cdh3u2.jar:/usr/lib/hive/lib/hbase-0.90.4…

Query Hbase tables using hive/ Mount an Hbase table into hive

You can use Hbase as the data store for your hive table. On hive table creation we need to specify a mapping for the same. What all needs to be provided 1.The hbase table name 2.The mapping between hbase Column Family:Qualifier to hive Columns If you are mapping a hbase column Family itself to a hive column then the data type of that hive column has to be Map. Also in the DDL the table has to me specified as External
Example of mapping a hbase table (employee_hbase) to a hive table employee_hive
CREATE EXTERNAL TABLE employee_hive(key INT, value1 STRING,value2 Map<STRING,STRING>) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val,cf2:") TBLPROPERTIES("" = "employee_hbase");

Open source solutions for processing big data

Following are some of the open source solutions for processing big data. Hadoop : Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop includes these sub-projects Hadoop ecosystem consists. HDFS - Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. Map Reduce – MapReduce is a software framework introduced by Google to support distributed computing on large data sets on clusters of computers. Pig – Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data se…

How to decommission nodes?

You want to take out some data nodes from your cluster, what is the graceful way to remove nodes without corrupting file system. On a large cluster removing one or two data-nodes will not lead to any data loss, because name-node will replicate their blocks as long as it will detect that the nodes are dead. With a large number of nodes getting removed or dying the probability of losing data is higher.

Hadoop offers the decommission feature to retire a set of existing data-nodes. The nodes to be retired should be included into the exclude file, and the exclude file name should be specified as a configuration parameter dfs.hosts.exclude. This file should have been specified during namenode startup. It could be a zero length file. You must use the full hostname, ip or ip:port format in this file. Then the shell command

bin/hadoop dfsadmin -refreshNodes

should be called, which forces the name-node to re-read the exclude file and start the decommission process.

Decommission does …

How to run multiple hadoop data nodes on one machine

Although Hadoop is designed and developed for distributed computing it can be run on a single node in pseudo distributed mode and with multiple data node on single machine . Developers often run multiple data nodes on single node to develop and test distributed features,data node behavior, Name node interaction with data node and for other reasons.

If you want to feel Hadoop's distributed data node - name node working and you have only one machine then you can run multiple data nodes on single machine. You can see how Name node stores it's metadata , fsimage,edits , fstime and how data node stores data blocks on local file system.


To start multiple data nodes on a single node first download / build hadoop binary.
Download hadoop binary or build hadoop binary from hadoop source.
Prepare hadoop configuration to run on single node (Change Hadoop default tmp dir location from /tmp to some other reliable location)
Add following script to the $HADOOP_HOME/bin dire…

Hadoop Cookbook : How to interview for hadoop admin job?

These are few problems whose solution a good hadoop admin should know.
List  3 hadoop fs shell commands to perform copy operation fs -copyToLocalfs -copyFromLocalfs -putHow to decommission nodes from HDFS cluster?  - Remove list of nodes from slaves files and execute -refreshNodes.
How to add new nodes to the HDFS cluster ? - Add new node hostname to slaves file and start data node & task tracker on new node.
How to perform copy across multiple HDFS clusters. - Use distcp to copy files across multiple clusters.

How to verify if HDFS is corrupt?  Execute Hadoop fsck to check for missing blocks.
What are the default configuration files that are used in Hadoop  As of 0.20 release, Hadoop supported the following read-only default configurations - src/core/core-default.xml - src/hdfs/hdfs-default.xml - src/mapred/mapred-default.xml
 How will you make changes to the default configuration files  Hadoop does not recommends changing the default configuration files, instead it recommends makin…

Namenode editlog ( multiple storage directories

Namenode editlog ( multiple storage directories Recently there was a question posted on the Apache Hadoop User mailing list regarding the behavior of specifying multiple storage directories fir property "". The documentation on the official site is not clear on what happens if one of the specified storage directories become inaccessible. While the documentation for other properties such as "" and "mapred.local.dir" clearly states that the system will ignore any directory that is inaccessible in case some of the specified directories are unavailable. See link below for reference

There were many comments on the post. See below the link for the complete post

In this post I am basically going to summarize my tests to prove that it …

Hadoop Compression Analysis Experimental Results

1. Filesystem counters Filesystem counters are used to analysis experimental results. The following are the typical built-in filesystem counters. Local file system FILE_BYTES_READ FILE_BYTES_WRITTEN HDFS file system HDFS_BYTES_READ  HDFS_BYTES_WRITTEN FILE_BYTES_READ is the number of bytes read by local file system. Assume all the map input data comes from HDFS, then in map phase FILE_BYTES_READ should be zero. On the other hand, the input file of reducers are data on the reduce-side local disks which are fetched from map-side disks. Therefore, FILE_BYTES_READ denotes the total bytes read by reducers.

FILE_BYTES_WRITTEN consists of two parts. The first part comes from mappers. All the mappers will spill intermediate output to disk. All the bytes that mappers write to disk will be included in FILE_BYTES_WRITTEN. The second part comes from reducers. In the shuffle phase, all the reducers will fetch intermediate data from mappers and merge and spill to reducer-side disks. Al…