Following are some of the open source solutions for processing big data.
Hadoop : Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop includes these sub-projects
Hadoop ecosystem consists.
HDFS - Hadoop Distributed File System (HDFS)
is the primary storage system used by Hadoop applications. HDFS creates
multiple replicas of data blocks and distributes them on compute nodes
throughout a cluster to enable reliable, extremely rapid computations.
Map Reduce – MapReduce is a software framework introduced by Google to support distributed computing on large data sets on clusters of computers.
Pig – Pig is
a platform for analyzing large data sets that consists of a high-level
language for expressing data analysis programs, coupled with
infrastructure for evaluating these programs. The salient property of
Pig programs is that their structure is amenable to substantial
parallelization, which in turns enables them to handle very large data
sets.
Hive – Hive is
a data warehouse infrastructure built on top of Hadoop that provides
tools to enable easy data summarization, adhoc querying and analysis of
large datasets data stored in Hadoop files. It provides a mechanism to
put structure on this data and it also provides a simple query language
called Hive QL which is based on SQL and which enables users familiar
with SQL to query this data. At the same time, this language also allows
traditional map/reduce programmers to be able to plug in their custom
mappers and reducers to do more sophisticated analysis which may not be
supported by the built-in capabilities of the language.
Hbase – HBase is
the Hadoop database. Use it when you need random, realtime read/write
access to your Big Data. This project’s goal is the hosting of very
large tables — billions of rows X millions of columns — atop clusters of
commodity hardware.
Voldemart - Voldemort is a distributed key-value storage system
Cassandra -The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo’s fully distributeddesign and Bigtable’s ColumnFamily-based data model.