I had started on a Hadoop based web analytic open source project some time ago. Recently I did some work on it and decided blog about the development I did on the the project. The project is  called visitante and it’s available on github. It’s goal is two fold. First, there are a set of MR jobs for various descriptive analytic metric e.g.,  bounce rate, checkout abandonment etc. I find the blog site of Avinash Kaushik to be the best resource for web analytic. It’s better than reading a book. I am implementing many  of the metrics defined in this post of Avinash. Second, I will have a set of MR jobs for predictive analytic on web log data  e.g., prediction of user conversion, making product recommendation.
In this post I will start off with some simple session based analytic. In follow up posts in future I will address more complex metrics, including predictive metrics.

Log Input

The input to to most of  MR jobs in visitante is W3C compliant  web server log data.  Some of the MR jobs will the consume the raw log data and some will consume output of other MR jobs. Here is some sample data. It happens to be for IIS web server.
2012-04-27  00:07:40  __RequestVerificationToken_Lw__=3GQ426510U4H;+.ASPXAUTH=DJ4XNH6EMMW5CCC5  /product/N19C4MX1 http://www.healthyshopping.com/product/T0YJZ1QH
2012-04-27  00:08:21  __RequestVerificationToken_Lw__=2XXW0J4N117Q;+.ASPXAUTH=X6DUSPR2R53VZ53G  /product/FPR477BM http://www.google.com
2012-04-27  00:08:24  __RequestVerificationToken_Lw__=3GQ426510U4H;+.ASPXAUTH=DJ4XNH6EMMW5CCC5  /product/NL0ZJO2L http://www.healthyshopping.com/product/N19C4MX1
2012-04-27  00:09:31  __RequestVerificationToken_Lw__=3GQ426510U4H;+.ASPXAUTH=DJ4XNH6EMMW5CCC5  /addToCart/NL0ZJO2L /product/NL0ZJO2L
2012-04-27  00:09:35  __RequestVerificationToken_Lw__=2XXW0J4N117Q;+.ASPXAUTH=X6DUSPR2R53VZ53G  /addToCart/FPR477BM /product/FPR477BM
2012-04-27  00:09:45  __RequestVerificationToken_Lw__=UJAQ1TQWAGVL;+.ASPXAUTH=C142FL33KKCV603E  /product/7Y4FP655 http://www.twitter.com
The input consists of the following fields. The cookie field contains userID and sessionID.
  1. Date
  2. Time
  3. Cookie containing sessionID and userID
  4. Page URL
  5. Referrer URL
The goal of the MR jobs is to to produce structured tabular data. Once we have tabular  data, we can wrap them in a Hive table and generate whatever aggregate metrics we are interested in through Hive queries.

Session Extractor Map Reduce

This MapReduce processes log data and does grouping by sessionID with secondary sorting on visit time . In the reducer output we will get data for each session with pages visited in the session in the order of the visit. It generates as output the following
  1. SessionID
  2. UserID
  3. Session start date time (unix epoch time)
  4. Page visit time
  5. Visited page URL
  6. Time spent on a page
You can think of this MR job as an ETL process. Interesting metric can be found by encapsulating the output with a Hive table

Hive in Action

To wrap the output  in a Hive table, you can either take the data as is in the HDFS output directory and define a Hive external table. Another option is to use Hive managed table and use the Hive load command to load the data into the Hive managed table warehouse location.
You could partition the data by visit date, if you are planning to do date specific aggregate queries, e.g. most popular page on a given data. The details about Hive DDL can be found here in Hive Wiki.
Partitioning in Hive gives you an index like feature as in RDBMS. It can give a significant boost to the Hive query performance. In your processing pipe line you could process every 24 hour worth of log data after midnight and store the result into a new Hive date partition.
Sometimes there may be mismatch between the actual data and Hive schema. For example, I have session start time as a long (unix epoch time). However your Hive table may have date and time as string.  The mismatch could be handled in the following way
  1. Load data into a Hive staging table.
  2. Define a Hive view on the staging table.
  3. The view should match the target Hive table.
  4. In the view use UDF to map unix epoch time to date and time as String
  5. Read from the view and insert over write to the target Hive table

How About Some Metrics

There are various metric you could calculate executing Hive queries on the data. Here are some examples
Most popular page in a day or month
Distribution of visit time by hour of day
Distribution of visits by referrer

Wrapping Up

In future posts I will talk about some more complex metrics. It will get really interesting when we get into predictive analytic, applying data mining algorithm on web log data  using Hadoop.
Previous Post Next Post