Showing posts from January, 2014

Big Web Analytic

I had started on a Hadoop based web analytic open source project some time ago. Recently I did some work on it and decided blog about the development I did on the the project. The project is  called visitante and it’s available on github. It’s goal is two fold. First, there are a set of MR jobs for various descriptive analytic metric e.g.,  bounce rate, checkout abandonment etc. I find the blog site of Avinash Kaushik to be the best resource for web analytic. It’s better than reading a book. I am implementing many  of the metrics defined in this post of Avinash. Second, I will have a set of MR jobs for predictive analytic on web log data  e.g., prediction of user conversion, making product recommendation. In this post I will start off with some simple session based analytic. In follow up posts in future I will address more complex metrics, including predictive metrics. Log InputThe input to to most of  MR jobs in visitante is W3C compliant  web server log data.  Some of the MR …

Different ways of configuring Hive metastore

Different ways of configuring Hive metastore Apache Hive is a client side library providing a table like abstraction on top of the data in HDFS for data processing. Hive jobs are converted into a MR plan which is then submitted to the Hadoop cluster for execution.

The Hive table definitions and mapping to the data are stored in a metastore. Metastore constitutes of  (1) the meta store service and (2) the database. The metastore service provides the interface to the Hive and the database stores the data definitions, mappings to the data and others.

The metastore (service and database) can be configured in different ways. The default Hive configuration (as is from Apache Hive without any configuration changes) is that Hive driver, metastore interface and the db (derby) all use the same JVM. This configuration is called embedded metastore and is good for the sake of development and unit testing, but won't scale to a production environment as only a single user can conn…