I
had started on a Hadoop based web analytic open source project some
time ago. Recently I did some work on it and decided blog about the
development I did on the the project. The project is called visitante and it’s available on github.
It’s goal is two fold. First, there are a set of MR jobs for various
descriptive analytic metric e.g., bounce rate, checkout
abandonment etc. I find the blog site of Avinash Kaushik to be the best resource for web analytic. It’s better than reading a book. I am implementing many of the metrics defined in this post of Avinash. Second, I will have a set of MR jobs for predictive analytic on web log data e.g., prediction of user conversion, making product recommendation.
In this post I will start off with some simple session based
analytic. In follow up posts in future I will address more complex
metrics, including predictive metrics.
Log Input
The input to to most of MR jobs in visitante is W3C
compliant web server log data. Some of the MR jobs will the consume
the raw log data and some will consume output of other MR jobs. Here is
some sample data. It happens to be for IIS web server.
2012-04-27 00:07:40 __RequestVerificationToken_Lw__=3GQ426510U4H;+.ASPXAUTH=DJ4XNH6EMMW5CCC5 /product/N19C4MX1 http://www.healthyshopping.com/product/T0YJZ1QH
2012-04-27 00:08:21 __RequestVerificationToken_Lw__=2XXW0J4N117Q;+.ASPXAUTH=X6DUSPR2R53VZ53G /product/FPR477BM http://www.google.com
2012-04-27 00:08:24 __RequestVerificationToken_Lw__=3GQ426510U4H;+.ASPXAUTH=DJ4XNH6EMMW5CCC5 /product/NL0ZJO2L http://www.healthyshopping.com/product/N19C4MX1
2012-04-27 00:09:31 __RequestVerificationToken_Lw__=3GQ426510U4H;+.ASPXAUTH=DJ4XNH6EMMW5CCC5 /addToCart/NL0ZJO2L /product/NL0ZJO2L
2012-04-27 00:09:35 __RequestVerificationToken_Lw__=2XXW0J4N117Q;+.ASPXAUTH=X6DUSPR2R53VZ53G /addToCart/FPR477BM /product/FPR477BM
2012-04-27 00:09:45 __RequestVerificationToken_Lw__=UJAQ1TQWAGVL;+.ASPXAUTH=C142FL33KKCV603E /product/7Y4FP655 http://www.twitter.com
The input consists of the following fields. The cookie field contains userID and sessionID.
- Date
- Time
- Cookie containing sessionID and userID
- Page URL
- Referrer URL
The goal of the MR jobs is to to produce structured tabular data.
Once we have tabular data, we can wrap them in a Hive table and
generate whatever aggregate metrics we are interested in through Hive
queries.
Session Extractor Map Reduce
This MapReduce processes log data and does grouping by sessionID with secondary sorting on visit time .
In the reducer output we will get data for each session with pages
visited in the session in the order of the visit. It generates as output
the following
- SessionID
- UserID
- Session start date time (unix epoch time)
- Page visit time
- Visited page URL
- Time spent on a page
You can think of this MR job as an ETL process. Interesting metric can be found by encapsulating the output with a Hive table
Hive in Action
To
wrap the output in a Hive table, you can either take the data as is in
the HDFS output directory and define a Hive external table. Another
option is to use Hive managed table and use the Hive load command to
load the data into the Hive managed table warehouse location.
You could partition the data by visit date, if you are planning to do
date specific aggregate queries, e.g. most popular page on a given
data. The details about Hive DDL can be found here in Hive Wiki.
Partitioning in Hive gives you an index like feature as in RDBMS. It
can give a significant boost to the Hive query performance. In your
processing pipe line you could process every 24 hour worth of log data
after midnight and store the result into a new Hive date partition.
Sometimes there may be mismatch between the actual data and Hive
schema. For example, I have session start time as a long (unix epoch
time). However your Hive table may have date and time as string. The
mismatch could be handled in the following way
- Load data into a Hive staging table.
- Define a Hive view on the staging table.
- The view should match the target Hive table.
- In the view use UDF to map unix epoch time to date and time as String
- Read from the view and insert over write to the target Hive table
How About Some Metrics
There are various metric you could calculate executing Hive queries on the data. Here are some examples
Most popular page in a day or month |
Distribution of visit time by hour of day |
Distribution of visits by referrer |
Wrapping Up
In future posts I will talk about some more complex metrics. It will
get really interesting when we get into predictive analytic, applying
data mining algorithm on web log data using Hadoop.