Friday, October 14, 2011

E-Commerce Analytics with Open Source

BI for e-commerce is straightforward. There are only several metrics that are truly useful, and there are many advanced visualizations one can do with heatmaps and tracking on your main page. But without getting all fancy (in a later post perhaps), there are a couple steps that should be taken first in order to get a great foundation to build upon later.

This short article discusses ways in which you can develop an in-house analytic solution in a short period of time, using widely available open source tools on the market today. At a later date, I will follow up with some additional guides using some of these products to illustrate these points.

The following products are mentioned: Pentaho, Palo, Rapid-I, and Hadoop. Databases which can be used include Infobright, Postgres and MySQL.

The only difference between a pilot project (proof of concept) and a full-scale BI engagement is the amount and sources of data involved.

1 - Identify available data sources 
2 - Identify key metrics & objectives
3 - Build a data repository
4 - Run analyses
5 - Review the results, and repeat

1 - Data sources
For e-commerce, the main part here is web server logs. The more detailed they are, the more useful they are. It's especially useful to tie in any campaigns, promotions or new product launches with date ranges. Server logs are best left in their original format, and very poor usefulness if they are converted into a relational database.

Secondary sources could include tracking cookies tied to user accounts, pixel tracking, and session durations. All these are data, and if they are in text format, even better.

The solution is to put it into Hadoop, which excels at text analytics, especially very large data sets, and in its original format as well. By adjusting Hadoop MapReduce programs, analysts will discover something new simply by changing the algorithm without worrying about data management or "re-loading" the data warehouse. 

2 - Metrics, objectives
E-commerce analytics can become useless very fast if too many angles are considered. Many BI projects start off with someone saying they don't know anything about BI; give them all the data; they will run some reports by themselves. This is definitely the wrong track to be on!

E-commerce typically wants to look at only the following metrics:

- Customers, demographics 
- How long do people stay?
- Duration of site visit before a purchase
- Shopping carts that have been idle - providing targeted incentives dramatically increase the chances of a sale
- Top purchases, bottom purchases
- Campaigns and promotions 
- Web traffic sources i.e. where are our users coming from?
- Retargeting - show your users the same ads or same information the next time they come back
- Social - Trends are always trending - can we ensure that our site is keeping up? 
- SEM analytics i.e. is our SEM effective? Is it matching expectations?
- Are our competitors taking away our business? i.e. aggressive analytics - scraping competitor websites and doing price/product comparisons including ad placement 
- Showing personalized views of items + what's trending + demographics (very key to allowing users discover more on your site)

What are the objectives?
This is perhaps the most important question you should be asking.

- Increased traffic? (SEM analytics, traffic sources, campaigns)
- Increased conversion (getting people to buy more stuff) 
- Increased sales? (shopping cart analysis, campaigns, promotions, retargeting)
- Competitive analysis (are we being undercut? do users leave our site for others?)

3 - Build a data repository
DO NOT THROW THINGS INTO A DATABASE! This is one of the *last* steps in a BI engagement, not the first! Hadoop is a much better choice. You can put your logs, text data and customer data, unedited, unaltered and unformatted into a HDFS cluster. Then using a combination of MapReduce/Hive programs, you can start summarizing and analyzing your data effectively.

A - Build a Hadoop cluster (it should have at least 5-10 nodes to be of any use)
B - Copy in all the logs that you have
C - Write some initial queries with Hive 
D - Load these summarized results into a database so you can benefit from BI platforms like Pentaho

4 - Run your analyses
Having identified the top 3-5 things you want to look at, now you have a data repository (Hadoop), a summarized database that loads from Hadoop (Postgres, Infobright, MySQL etc.), and objectives - create your reports!

Ad-hoc tools like Pentaho Analyzer, the FOSS alternative Saiku or Palo MOLAP are extremely useful at this stage. Data discovery is only natural now. This is where if you need to revise your strategy, the tools can help you do that. If you find you are missing data, you can always go back and adjust your MapReduce program or Hive queries to be updated without waiting for reloading.

Data mining tools like Pentaho Data Mining or Rapid-I (both in open source) allow for quick visualizations of your data. I would stay away from predictive analysis like neural networks, machine learning, linear regression or behavioral analysis. The reason being is e-commerce is poorly suited - competition is simply a click away, unlike credit card or telecommunications industries. Predictive data mining is usually associated with a "confidence" level, meaning that it is not exact science - having a high confidence also does not guarantee results. Fraud detection is a key example of where predictive analysis shows suggested fraud patterns, someone must then follow up with direct action. Predicting buying behavior is only realistic when something like a loyalty program is implemented.

Useful data mining algorithms include clustering, association, time series, classification, decision models. These can be used effectively and quickly and with a very low learning curve

5 - Reviewing results
BI is an iterative process. You never get the right analyses in the first round, or you might not get them all. Now it is time to look at what you have, and what you want to do next (step #2) and go back and do it.

The most important thing is data. If you don't have the data, you can't do analysis. For a good pilot with useful results, consider the following categorization of data:

Easy data:
- Web logs
- Session cookies
- Shopping carts
- Purchase history
- Campaigns/promotions
- Geospatial (based on IP lists)

Harder data to get:
- Competitive scraping (warning: legal minefield)
- Retargeting
- Heatmaps (where does your user move their mouse? is there a part of your main page which the users spend a lot of time on? this requires installing some special software & code on your site)
- Social - very difficult to retrieve social information which can be utilized properly, and usually at a heavy cost

All of the above is dependent on knowing on what you have to do. BI tools are just tools. Choosing the right tool can save time and money, but nothing compares to having a clear and defined BI strategy in place.