Posts

Showing posts from 2013

Apache Oozie 3.3.1 installation on Apache Hadoop 0.23.0

Hi
I have been trying to install apache oozie 3.3.1 on Hadoop 0.23.0. from last few days. 
The documentation provided in apache website is not very clear and very less documentation is provided for the hadoop with new (MRv2 / Yarn) architecture. So I hope my blog will help to some extent in configuring oozie 3.3.1 on Hadoop 0.23.0 

Here we go,

link to apache oozie quick start http://oozie.apache.org/docs/3.3.1/DG_QuickStart.html
and apache hadoop http://hadoop.apache.org/docs/r0.23.0
My testing environment
 4 node cluster ( 1 master , 3slaves )Apache Hadoop 0.23.0Apache oozie 3.3.1 java 1.6.0_26Maven 3.0.4
Oozie server installation
Download oozie 3.3.1.tar.gz  from the nearest mirror site apache/oozie/3.3.1 ( i downloaded from nus.edu.sg mirror) Unpack the oozie-3.3.1 tar.gz file under some /home/srikanthThe following two properties are required in Hadoop core-site.xml:
<!-- OOZIE --> <property> <name>hadoop.proxyuser.[OOZIE_SERVER_USER].hosts</name> &l…

50 Top Open Source Tools for Big Data

Big Data Analysis Platforms and Tools

1. Hadoop

You simply can't talk about big data without mentioning Hadoop. The Apache distributed data processing software is so pervasive that often the terms "Hadoop" and "big data" are used synonymously. The Apache Foundation also sponsors a number of related projects that extend the capabilities of Hadoop, and many of them are mentioned below. In addition, numerous vendors offer supported versions of Hadoop and related technologies. Operating System: Windows, Linux, OS X.

2. MapReduce

Originally developed by Google, the MapReduce website describe it as "a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes." It's used by Hadoop, as well as many other data processing applications. Operating System: OS Independent.

3. GridGain

GridGrain offers an alternative to Hadoop's MapReduce that is compatible with the…

Hadoop Versions

Image
Let's take a moment to explore the different versions of Hadoop. The following are/were the most known versions.
0.21 - Aug 2010 -- A widely used release. This eventually became the Hadoop 1.0 release. 0.23 - Feb 2012 -- Was a branch created to add new features. This branch eventually became Hadoop 2.0 1.0 -- Current production version of Hadoop 2.0 -- Current development version of Hadoop

19.1. Hadoop version 1.0 This is currently the production version of Hadoop. It has been in wide use for a while and has been proven in the field. The following distributions are based on Hadoop 1.0
Cloudera's CDH 3 (Cloudera's Distribution of Hadoop) series HortonWorks's HDP 1 (HortonWorks Data Platform) series 19.2. Hadoop version 2.0 This is a development branch of Hadoop. Hadoop 2 has significant new enhancements. It has been under development for a while…

Hive Indexing

People coming from RDBMS background might know the benefit of Indexing.Indexes are useful for faster access to rows in a table. If we want to apply indexing using Hive then the first expectation might be that with indexing it should take less time to fetch records and it should not launch a map reduce job. Whereas in practice a map reduce job would still be launched on a Hive query even though an index is created on ahive table.Map/reduce job runs on the table that holds the index data to get all the relevant offsets into the main table and then using those offsets it figures out which blocks to read from the main table. So you will not see map/reduce go away even when you are running queries on tables with indexes on them. The biggest advantage of having index is that it does not require a full table scan and it would query only the HDFS blocks required. The difference b/w compact and bitmap indexes(Hive 0.8) is how they store the mapping from values to the rows in which …

MySQL Applier for Hadoop

The MySQL Applier for Hadoop enables the real-time replication of events from MySQL to Hive / HDFS. This video tutorial demonstrates how to install, configure and use the Hadoop Applie

Video Tutorial : http://www.youtube.com/watch?v=mZRAtCu3M1g&feature=youtu.be

CENTOS 6 - XEN installtion

Install Xen 4 with Libvirt / XL on CentOS 6 (2013)
Update: Xen is now part of CentOS 6, as part of the Xen4CentOS6 project.
It can be installed on your CentOS 6 machine via running the following commands:
yum install centos-release-xen && yum install xen libvirt python-virtinst libvirt-daemon-xen sh /usr/bin/grub-bootxen.sh reboot The above commands will install the official Xen 4 packages along with the libvirt toolstack, load the correct kernel into your GRUB boot-loader, and reboot into your Xen kernel.
Once your system boots, ensure that you are running the Xen 4 Kernel via:
uname -r Now that Xen 4 has been installed, you can skip to section 6 at the bottom of this guide for installing your first Virtual Machine (VM) on CentOS Xen.

This article will guide you through the successful installation of the latest Xen on CentOS 6.x.
First things first, update your CentOS install via the following command:
yum -y update 1. Disable SElinuxSElinux can really interfere with X…

How Scaling Really Works in Apache HBase

Image
This post was originally published via blogs.apache.org, we republish it here in a slightly modified form for your convenience:
At first glance, the Apache HBase architecture appears to follow a master/slave model where the master receives all the requests but the real work is done by the slaves. This is not actually the case, and in this article I will describe what tasks are in fact handled by the master and the slaves.
Regions and Region Servers HBase is the Hadoop storage manager that provides low-latency random reads and writes on top of HDFS, and it can handle petabytes of data. One of the interesting capabilities in HBase is auto-sharding, which simply means that tables are dynamically distributed by the system when they become too large.
The basic unit of horizontal scalability in HBase is called a Region. Regions are a subset of the table’s data and they are essentially a contiguous, sorted range of rows that are stored together.
Initially, there is only one region…

Oracle Big Data Connectors 2.1

Oracle Big Data Connectors 2.1
Oracle Big Data Connectors 2.1 is now available.  
Oracle Loader for Hadoop and Oracle SQL Connector for HDFS add certification with CDH 4.2 and Apache Hadoop 1.1.1 in this release.
Enhancements to Oracle Loader for Hadoop: 
 - Ability to load from Hive partitioned tables
 - Improved usability and error handling
 - Sort by user-specified key before load

Hadoop Default Ports Quick Reference

Define your choice of ports by setting properties dfs.http.address for Namenode and mapred.job.tracker.http.address for Jobtracker in conf/core-site.xml:
<configuration> <property> <name>dfs.http.address</name> <value>50070</value> </property> <property> <name>mapred.job.tracker.http.address</name> <value>50030</value> </property> </configuration>
Web UIs for the Common User The default Hadoop ports are as follows:

DaemonDefault PortConfiguration ParameterHDFSNamenode50070dfs.http.addressDatanodes50075dfs.datanode.http.addressSecondarynamenode50090dfs.secondary.http.addressBackup/Checkpoint node?50105dfs.backup.http.addressMRJobracker50030mapred.job.tracker.http.addressTasktrackers50060mapred.task.tracker.http.address? Replaces secondarynamenode in 0.21. Hadoop daemons expose some information over HTTP. All Hadoop daemons expose the following:
/logsExpo…

Copying between two clusters that are running different versions of Hadoop

While copying between two clusters that are running different versions of Hadoop, it is
generally recommended to use HftpFileSystem as the source. HftpFileSystem is
a read-only filesystem. The distcp command has to be run from the destination server:

hadoop distcp hftp://namenodeA:port/data/weblogs hdfs://namenodeB/data/
weblogs

In the preceding command, port is defined by the dfs.http.address property in the
hdfs-site.xml configuration file.

How to transfer data between different HDFS clusters

Overview DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses Map/Reduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. Its Map/Reduce pedigree has endowed it with some quirks in both its semantics and execution. Usage Basic The most common invocation of DistCp is an inter-cluster copy: bash$ hadoop distcp hdfs://nn1:8020/foo/bar \
                    hdfs://nn2:8020/bar/foo This will expand the namespace under /foo/bar on nn1 into a temporary file, partition its contents among a set of map tasks, and start a copy on each TaskTracker from nn1 to nn2. Note that DistCp expects absolute paths. One can also specify multiple source directories on the command line: bash$ hadoop distcp hdfs://nn…

Hadoop Hive Sample Reports with iReport

Image
These sample reports are available to help users who are new to Jaspersoft technologies to be able to connect to data in Hadoop. Loading Sample Data to Hadoop Hive
Copy the data file to HDFS filesystem, change the paths according to your filesystem.
hadoop fs -copyFromLocal $LOCAL_PATH/accounts.csv/user/hdfs Start hive as hdfs user
sudo -u hdfs hive Create the table accounts on Hive
CREATE TABLE accounts ( id STRING, date_entered STRING, date_modified STRING, modified_user_id STRING, assigned_user_id STRING, created_by STRING, name STRING, parent_id STRING, account_type STRING, industry STRING, annual_revenue STRING, phone_fax STRING, billing_address_street STRING, billing_address_city STRING, billing_address_state STRING, billing_address_postalcode STRING, billing_address_country STRING, description STRING, rating STRING, phone_office STRING, phone_alternate STRING, email1 STRING, email2 STRING, website STRING, ownership STRING, employees STRING, sic_code STRING, ticker_symbol STRING, …

Hadoop-LZO

Hadoop-LZO is a project to bring splittable LZO compression to Hadoop. LZO is an ideal compression format for Hadoop due to its combination of speed and compression size. However, LZO files are not natively splittable, meaning the parallelism that is the core of Hadoop is gone. This project re-enables that parallelism with LZO compressed files, and also comes with standard utilities (input/output streams, etc) for working with LZO files.

Hadoop and LZO, Together at Last LZO is a wonderful compression scheme to use with Hadoop because it's incredibly fast, and (with a bit of work) it's splittable. Gzip is decently fast, but cannot take advantage of Hadoop's natural map splits because it's impossible to start decompressing a gzip stream starting at a random offset in the file. LZO's block format makes it possible to start decompressing at certain specific offsets of the file -- those that start new LZO block boundaries. In addition to providing LZO…