Showing posts from April, 2014

HBase – Overview of Architecture and Data Model

Introduction HBase is a column-oriented database that’s an open-source implementation of Google’s Big Table storage architecture. It can manage structured and semi-structured data and has some built-in features such as scalability, versioning, compression and garbage collection. Since its uses write-ahead logging and distributed configuration, it can provide fault-tolerance and quick recovery from individual server failures. HBase built on top of Hadoop / HDFS and the data stored in HBase can be manipulated using Hadoop’s MapReduce capabilities. Let’s now take a look at how HBase (a column-oriented database) is different from some other data structures and concepts that we are familiar with Row-Oriented vs. Column-Oriented data stores. As shown below, in a row-oriented data store, a row is a unit of data that is read or written together. In a column-oriented data store, the data in a column is stored together and hence quickly retrieved. Row-oriented data stores – Data is stor…

ElasticSearch 101 – a getting started tutorial

ElasticSearch is a highly scalable open source search engine with a REST API that is hard not to love. In this tutorial we'll look at some of the key concepts when getting started with ElasticSearch. Downloading and running ElasticSearch ElasticSearch can be downloaded packaged in various formats such as ZIP and TAR.GZ from After downloading and extracting a package running it couldn't be much easier, at least if you already have a Java runtime installed. Running ElasticSearch on Windows To run ElasticSearch on Windows we run elasticsearch.bat located in the bin folder from a command window. This will start ElasticSearch running in the foreground in the console, meaning we'll see errors in the console and can shut it down using CTRL+C. If we don't have a Java runtime installed or not correctly configured we'll not see output like the one above but instead a message saying "JAVA_HOME environment variable must be set!". To fix…

How to run Hive queries through Hive Web Interface.

One of the good things about Hadoop, and related projects, which I really like is the WebUI provided to us. It makes our life a lot easier. Just point your web browser to the appropriate URL and quickly perform the desired action. Be it browsing through HDFS files or glancing over HBase tables. Otherwise you need to go the shell and issue the associated commands one by one for each action

Hive is no exception and provides us a WebUI, called as Hive Web Interface, or HWI in short. But, somehow I feel it is less documented and talked about as compared to HDFS and HBase WebUI. But that doesn't make it any less useful. In fact I personally find it quite helpful. With its help you can do various operations like browsing your DB schema, see your sessions, query your tables etc. You can also see the System and User variables like Java Runtime, your OS architecture, your PATH etc etc.

OK, enough brand building. Let's get started and see how to use HWI. The process is quite …

Fun with HBase shell

HBase shell is great, specially while getting yourself familiar with HBase. It provides lots of useful shell commands using which you can perform trivial tasks like creating tables, putting some test data into it, scanning the whole table, fetching data from a specific row etc etc. Executing help on HBase shell will give you the list of all the HBase shell commands. If you need help on a specific command, type help "command". For example, help "get" will give you a detailed explanation of the get command.

But this post is not about the above said stuff. We will try to do something fun here. Something which is available, but less known. So, get ready, start your HBase daemons, open HBase shell and get your hands dirty.

For those of us who are unaware, HBase shell is based on JRuby, the Java Virtual Machine-based implementation of Ruby. More specifically, it uses the Interactive Ruby Shell (IRB), which is used to enter Ruby commands and get an immediate resp…


Monitoring plays an important role in running our systems smoothly. It is always better to diagnose the problems and take some measures as early as possible, rather than waiting for things to go worse.

Nagios is a powerful monitoring system that enables organizations to identify and resolve IT infrastructure problems before they affect critical business processes. For a detailed information on Nagios you can visit the official documentation page here. I'll just cover the steps to install and get Nagios working on your Ubuntu box.

First of all install Nagios on your Ubuntu box using the following command :
$ sudo apt-get install -y nagios3

It will go through, and ask you about what mail server you want to use. You'll see something like this on your screen.

Pick one as per your requirements.

It will then ask you about the domain name you want to have email sent from. Again, fill that out based upon your needs.

It will ask you what password you want to use - put in a secure …

Flume-ng configuration with an HDFS sink

I’ve been playing around with flume-ng and its HDFS sink recently to try to understand how I can stream data into HDFS and work with it using Hadoop. The documentation for flume-ng is unfortunately lacking, so I’ve typed up some quick notes on how to configure and test the HDFS sink.

This document assumes that you have Hadoop installed and running locally, with flume-ng version 1.2.0 or above.

In this example, the name of our agent is just agent. First, let’s define a channel for agent named memory-channel of type memory.

# Define a memory channel on agent called memory-channel.agent.channels.memory-channel.type = memory Next, let’s configure a source for agent, called tail-source, which watches the system.log file. Let us also assign it to the memory-channel.

# Define a source on agent and connect to channel memory-channel.agent.sources.tail-source.type = exec agent.sources.tail-source.command = tail -F /var/log/system.logagent.sources.tail-source.channels = memory-channel Now, configure t…


The first Hbase sink was commited to the Flume 1.2.x trunk few days ago. In this post we'll see how we can use this sink to collect data from a file stored in the local filesystem and dump this data into an Hbase table. We should have Flume built from the trunk in order to achieve that. If you haven't built it yet and looking for some help, you can visit my other post that shows how to build and use Flume-NG at this link :

First of all we have to write the configuration file for our agent. This agent will collect data from the file and dump it into the Hbase table. A simple configuration file might look like this :

hbase-agent.sources = tail
hbase-agent.sinks = sink1
hbase-agent.channels = ch1
hbase-agent.sources.tail.type = exec
hbase-agent.sources.tail.command = tail -F /home/mohammad/demo.txt
hbase-agent.sources.tail.channels = ch1
hbase-agent.sinks.sink1.type = org.apache.flume.sink.hbase.HBaseSink

HDFSEventSink. process failed

Using Hadoop 2.2 as a sink in Flume 1.4

Google really screwed the pooch with their protobuf 2.5 release. Code generated with protobuf 2.5 is binary incompatible with older protobuf libraries (I guess Google missed the semantic versioning boat on this release). Unfortunately the current stable release of Flume 1.4 packages protobuf 2.4.1 and if you try and use HDFS on Hadoop 2.2 as a sink you’ll be smacked with the following exception:
java.lang.VerifyError: class$GetDelegationTokenRequestProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet; at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClassCond( ... at org.apache.hadoop.ipc.ProtobufRpcEngine.getProxy( at org.apache.hadoop.ipc.RPC.getProtocolProxy( at org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol( at org…