Tuesday, October 25, 2011

Visit our ProcessMaker booth at the VMWare vForum in Beijing!



Our booth setup is finished, come drop by! We are right next to the friendly Zimbra folks.

Tuesday, October 18, 2011

See the all-new ProcessMaker and Zimbra Integration - VMWare vForum 2011, Beijing October 26-27




We will be presenting the all new ProcessMaker Zimlet at the vForum in Beijing China on October 26-27thThis integration allows users to run and manage business workflows entirely from within Zimbra thanks to a seamless integration.  A live demo of the integration will be performed at the conference at the conference. Hope to see you there!

我们将于上述研讨会发布全新的ProcessMaker Zimlet。这亇天衣无缝的结合容许用户在Zimbra内体完全运行及管理业务的流程。

现场演示。  恭候光临!

REGISTRATION info.vmware.com
INFO ProcessMaker


Friday, October 14, 2011

E-Commerce Analytics with Open Source


BI for e-commerce is straightforward. There are only several metrics that are truly useful, and there are many advanced visualizations one can do with heatmaps and tracking on your main page. But without getting all fancy (in a later post perhaps), there are a couple steps that should be taken first in order to get a great foundation to build upon later.

This short article discusses ways in which you can develop an in-house analytic solution in a short period of time, using widely available open source tools on the market today. At a later date, I will follow up with some additional guides using some of these products to illustrate these points.

The following products are mentioned: Pentaho, Palo, Rapid-I, and Hadoop. Databases which can be used include Infobright, Postgres and MySQL.

The only difference between a pilot project (proof of concept) and a full-scale BI engagement is the amount and sources of data involved.

1 - Identify available data sources 
2 - Identify key metrics & objectives
3 - Build a data repository
4 - Run analyses
5 - Review the results, and repeat


1 - Data sources
For e-commerce, the main part here is web server logs. The more detailed they are, the more useful they are. It's especially useful to tie in any campaigns, promotions or new product launches with date ranges. Server logs are best left in their original format, and very poor usefulness if they are converted into a relational database.

Secondary sources could include tracking cookies tied to user accounts, pixel tracking, and session durations. All these are data, and if they are in text format, even better.

The solution is to put it into Hadoop, which excels at text analytics, especially very large data sets, and in its original format as well. By adjusting Hadoop MapReduce programs, analysts will discover something new simply by changing the algorithm without worrying about data management or "re-loading" the data warehouse. 


2 - Metrics, objectives
E-commerce analytics can become useless very fast if too many angles are considered. Many BI projects start off with someone saying they don't know anything about BI; give them all the data; they will run some reports by themselves. This is definitely the wrong track to be on!

E-commerce typically wants to look at only the following metrics:

- Customers, demographics 
- How long do people stay?
- Duration of site visit before a purchase
- Shopping carts that have been idle - providing targeted incentives dramatically increase the chances of a sale
- Top purchases, bottom purchases
- Campaigns and promotions 
- Web traffic sources i.e. where are our users coming from?
- Retargeting - show your users the same ads or same information the next time they come back
- Social - Trends are always trending - can we ensure that our site is keeping up? 
- SEM analytics i.e. is our SEM effective? Is it matching expectations?
- Are our competitors taking away our business? i.e. aggressive analytics - scraping competitor websites and doing price/product comparisons including ad placement 
- Showing personalized views of items + what's trending + demographics (very key to allowing users discover more on your site)

What are the objectives?
This is perhaps the most important question you should be asking.

- Increased traffic? (SEM analytics, traffic sources, campaigns)
- Increased conversion (getting people to buy more stuff) 
- Increased sales? (shopping cart analysis, campaigns, promotions, retargeting)
- Competitive analysis (are we being undercut? do users leave our site for others?)


3 - Build a data repository
DO NOT THROW THINGS INTO A DATABASE! This is one of the *last* steps in a BI engagement, not the first! Hadoop is a much better choice. You can put your logs, text data and customer data, unedited, unaltered and unformatted into a HDFS cluster. Then using a combination of MapReduce/Hive programs, you can start summarizing and analyzing your data effectively.

A - Build a Hadoop cluster (it should have at least 5-10 nodes to be of any use)
B - Copy in all the logs that you have
C - Write some initial queries with Hive 
D - Load these summarized results into a database so you can benefit from BI platforms like Pentaho


4 - Run your analyses
Having identified the top 3-5 things you want to look at, now you have a data repository (Hadoop), a summarized database that loads from Hadoop (Postgres, Infobright, MySQL etc.), and objectives - create your reports!

Ad-hoc tools like Pentaho Analyzer, the FOSS alternative Saiku or Palo MOLAP are extremely useful at this stage. Data discovery is only natural now. This is where if you need to revise your strategy, the tools can help you do that. If you find you are missing data, you can always go back and adjust your MapReduce program or Hive queries to be updated without waiting for reloading.

Data mining tools like Pentaho Data Mining or Rapid-I (both in open source) allow for quick visualizations of your data. I would stay away from predictive analysis like neural networks, machine learning, linear regression or behavioral analysis. The reason being is e-commerce is poorly suited - competition is simply a click away, unlike credit card or telecommunications industries. Predictive data mining is usually associated with a "confidence" level, meaning that it is not exact science - having a high confidence also does not guarantee results. Fraud detection is a key example of where predictive analysis shows suggested fraud patterns, someone must then follow up with direct action. Predicting buying behavior is only realistic when something like a loyalty program is implemented.

Useful data mining algorithms include clustering, association, time series, classification, decision models. These can be used effectively and quickly and with a very low learning curve


5 - Reviewing results
BI is an iterative process. You never get the right analyses in the first round, or you might not get them all. Now it is time to look at what you have, and what you want to do next (step #2) and go back and do it.

The most important thing is data. If you don't have the data, you can't do analysis. For a good pilot with useful results, consider the following categorization of data:

Easy data:
- Web logs
- Session cookies
- Shopping carts
- Purchase history
- Campaigns/promotions
- Geospatial (based on IP lists)

Harder data to get:
- Competitive scraping (warning: legal minefield)
- Retargeting
- Heatmaps (where does your user move their mouse? is there a part of your main page which the users spend a lot of time on? this requires installing some special software & code on your site)
- Social - very difficult to retrieve social information which can be utilized properly, and usually at a heavy cost


Conclusion
All of the above is dependent on knowing on what you have to do. BI tools are just tools. Choosing the right tool can save time and money, but nothing compares to having a clear and defined BI strategy in place.




Tuesday, September 27, 2011

Introduction to Palo

Most everyone has never heard of a powerful open source BI solution called Palo. Started in 2001, it has been in quiet development for almost 7 years, and only started gaining significant adoption in Western Europe and Australia just 3 years ago.

That's all about to change - you will hear more about Palo in the coming months, especially if you are living in the Greater China region.


This is a series of introductory posts about Palo, starting from an introduction (this particular post) and moving into more functional detail in later posts.


So what is Palo?

Palo's Web Interface
Palo is a Business Intelligence solution created by a German company called Jedox. It is currently at version 3.2 SR3 with multiple interfaces including an iPad application that can be downloaded for free in the iTunes App Store. 

It is an MOLAP technology, meaning that it is operating on pre-computed results which is stored in an multidimensional array. There are several advantages of MOLAP over ROLAP, but that's a subject best explained in another context.


I still don't get what Palo really is...

The best way to frame Palo is listing some competitive solutions on the market today: Oracle Hyperion Essbase, IBM Cognos TM1 and Microsoft Analysis/Reporting Services.

Surprisingly enough, the above 3 products are really only the main MOLAP tools on the market. There's not a lot of choice. High penetration exists in the industry sectors of finance, budgeting, planning, marketing and data analytics. Chances are if you work at a company of 200-300+ people, you may have one of these solutions in your company.

Note that 2 of the 3 solutions above were acquisitions in the last 5 years alone.


So how does Palo stack up?

In the partner community, we like to say that Oracle, Microsoft and IBM are our best customers. After a year of adopting one of the above proprietary solutions, end-users and IT departments will be desperately searching for an alternative: Expensive, lots of performance issues and maintenance is very heavy (to the delight of BI consultants like us...).

According to our customers, Palo's best features are:

  • VERY FAST! Palo is all about speed. The first time I tried the solution with a dataset of around 200GB of raw data, the report came back in less than one second. Drill-downs, pivots, ad-hoc calculations are similarly speedy.

    After dealing with ROLAP data warehouses for years, I could feel the wind in my hair...
  • Total integration with Excel - Most BI tools have somewhat of a learning curve, you have to learn the interface and there's a methodology to creating a report. If you have been creating reports in Excel, then Palo will feel like an old glove. You can even use native Excel formulas alongside Palo.

    Many BI tools are marketed as "user friendly". Well, there's really nothing more friendly than Excel...

    Keeping true to its open source roots, Palo also works with OpenOffice.
    Excel and Palo operations side by side
  • Designed for concurrency - Anyone that's ever dealt with Essbase or IBM Clarity will know what I'm talking about. Put ~5 people in the same room, and have them start entering their budgets into the solutions. Make sure that everyone has their smartphones with them, because soon enough these solutions will lock up and everyone will have a lot of free time to do something else, like play Angry Birds.

    Palo was ground up designed to handle concurrency, and add the speed of which Palo processes data, we've had up to 25 people entering data simultaneously without a single hiccup.

One more thing...

Many BI solutions today have some sort of iPad application or mobile interface. These iPad applications are typically a marketing gimmick (I admit that I love doing iPad demos in front of a projector with a black turtleneck and blue jeans...).

With Palo Mobile, the ability to write-back using an iPad increases the usefulness of this app from pretty charts and grids to actually being functional. You can change data directly on the iPad and it goes right back to the Palo server in real-time. Your colleagues at the office will never know you were out of the office!

Palo on the iPad
Download the Palo app today from the iTunes store for free and see for yourself: http://itunes.apple.com/us/app/palo-mobile/id429176062?mt=8


Note: Write-back capability is to be released in late Q3 of 2011. 

Knowledge Sharing in Greater China - September 2011

This month we traveled to China and Taiwan to deliver two customized Pentaho training on several Pentaho topics. One of the benefits of a customized agenda is that we get to go beyond the basics and really dig into topics that aren't generally available. 

Being in open source, many users of Pentaho will have some exposure to the platform already and require less beginners information. That allows us to jump right into the subjects! However if you are new to Pentaho, Pentaho Bootcamps are still a great way to get deep information and walk away with actionable information.

Special thanks to our partners for helping us organize and deliver these engagements: Omniwaresoft in Taiwan and EmbraceSoft in China. 

Pentaho Hadoop & Data Integration - Hsinchu, Taiwan
In Taiwan, we focused on some really new technology with Pentaho and Hadoop. It's not widely circulated information yet, and there is still a bit of confusion as to what Pentaho and Hadoop do, so drop us a line if you would like to know more. Here was our one day agenda:

AM:
- Overview
- Pentaho BI Platform and Data Integration Server
- Pentaho Hadoop Enterprise
PM:
- Pentaho Data Integration & ETL Labs- ETL from log4j log file through FTP
- Pentaho Report Designer
- Labs Reporting with Hive
- Labs Reporting with MySQL
- Conclusion and further discussion

Pentaho Agile BI, Reporting & CDF - Beijing, China
Agile BI is a huge buzzword in BI these days, but in most cases it means next to nothing. Pentaho's approach with Agile BI is about driving results and data visualizations as quick as possible, and with your users sitting right beside you. No more turnaround time for reports and analytics!

Finally, Pentaho's report creation tools are second to none. The technical flexibility of the Pentaho platform, the integration with PDI means there's nothing we can't do with Pentaho Reporting. We also covered integration topics such as using YUI and FusionCharts with Community Dashboard Framework

Here was our two day agenda:

Day 1
- Getting started
- Agile BI
- Workshop: Designing Reports, Pentaho Report Designer

Day 2
- Parametrization
- Authoring Dynamic Reports
- Data Integration for Reporting (Seemless integration of ETL and reporting)
- Yahoo User Interface Library + FusionCharts
- Community Dashboard Framework

And here are some pictures from our trip:

ITRI in Hsinchu, Taiwan


Lido Place, Beijing


The new Taipei terminal!


Landing in Beijing