Posts

Showing posts from May, 2013

Sqoop with Postgresql

Sqoop with Postgresql  Download the Postgresql connector jar and store in lib directory present in sqoop home folder.
http://jdbc.postgresql.org/download.html
List the database , sqoop-list-databases --connect jdbc:postgresql://localhost:5432/ --username postgres --password ""
List the Tables from database ,
sqoop-list-tables --connect jdbc:postgresql://localhost:5432/ --username postgres --password ""
Import Postgresql database in to HIVE root@mahesh:/data# sqoop-import  --connect jdbc:postgresql://localhost:5432/db1 -username postgres -password "" --table tb1  --hive-table tablename --create-hive-table --hive-import -m 1List the data in fileroot@mahesh:/data# hadoop fs -cat /user/hive/warehouse/tb1/part-m-00000 * Suppose multiple schema's tables in postgresql database you need set your search_path in Postgres,
ALTER ROLE sqoopuser SET search_path TO customschema,public;


Hive Tutorial - Part 2( Internal Table and Extenal Table)

Hive Tutorial - Part 2Hive tables can be created as EXTERNAL or INTERNAL. This is a choice that affects how data is loaded, controlled, and managed. Use EXTERNAL tables when: The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn't lock the files.Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.You want to use a custom location such as ASV.Hive should not own data and control settings, dirs, etc., you have another program or process that will do those things.You are not creating table based on existing table (AS SELECT).Use INTERNAL tables when: The data is temporary.You want Hive to completely manage the lifecycle of the table and data. Load some data (hadoop file system put) and then verify it loaded (hadoop file system list recursively):

r…

Hive Tutorial

Data Types in Hive Hive data types are categorized into two types. They are the primitive and complex data types. The primitive data types include Integers, Boolean, Floating point numbers and strings. The below table lists the size of each data type:
Type                   Size
----------------------
TINYINT           1 byte
SMALLINT       2 byte
INT                    4 byte
BIGINT             8 byte
FLOAT              4 byte (single precision floating point numbers)
DOUBLE           8 byte (double precision floating point numbers)
BOOLEAN        TRUE/FALSE value
STRING             Max size is 2GB. The complex data types include Arrays, Maps and Structs. These data types are built on using the primitive data types.
Arrays: Contain a list of elements of the same data type. These elements are accessed by using an index. For example an array, “fruits”, containing a list of elements [‘apple’, ’mango’, ‘orange’], the element “apple” in the array can be accessed by specifying fruit…

HIVE Introduction and Install

Image
What Hive Does Hadoop was built to organize and store massive amounts of data. A Hadoop cluster is a reservoir of heterogeneous data, from multiple sources and in different formats. Hive allows the user to explore and structure that data, analyze it, and then turn it into business insight.
How Hive Works The tables in Hive are similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units. Databases are comprised of tables, which are made up of partitions. Data can be accessed via a simple query language, called HiveQL, which is similar to SQL. Hive supports overwriting or appending data, but not updates and deletes.
Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Data within partitions…

SQOOP Import and Export

A tool used to transfer data between Hadoop and relational databases. Sqoop reads the column information from the database and generates Java classes that represent that data for you to use in Map/Reduce jobs.

Prerequisites
Gnu/LinuxJava 1.6.x (preferred)SSH Download the latest version of Sqoop for the version of Hadoop you downloaded.

http://www.apache.org/dyn/closer.cgi/sqoop/

Also, download the JDBC drivers for your database. You will need these later. Oracle: http://www.oracle.com/technetwork/database/features/jdbc/index-091264.html MySQL: http://dev.mysql.com/downloads/connector/j/

Download the mysql j connector jar and store in lib directory present in sqoop home folder.

Just test your installation by typing

$ sqoop help List the MySQL Databases in SQOOP, > bin/sqoop list-databases --connect http://localhost/ --username dbusername --password "" List the MySQL Tables in SQOOP,> bin/sqoop list-tables --connect http://localhost/databasename --username dbusername …

Hadoop Introduction

Image
What is Hadoop ? Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System and of MapReduce. HDFS is a highly fault-tolerant distributed file system and like Hadoop designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.
Hadoop - Overview •  Hadoop includes:
    –  Distributed File System - distributes data
    –  Map/Reduce - distributes application
•  Open source from Apache
•  Written in Java
•  Runs on
    –  Linux, Mac OS/X, Windows, and Solaris
    –  Commodity hardware

  Hadoop Distributed File System •  Designed to store large files
•  Stores files as large blocks (64 to 128 MB)
•  Each block stored on multiple servers
•  Data is automatically re-replicated on need