Business Intelligence

by Tony Hepo

Every single day of our lives, most of us generate data.

These days unless you live on a ranch in the middle of nowhere and deal only in cash it's unavoidable.  Most of this data ends up being personally identifiable, such as an ISP logging Internet activity, making an appointment at the doctor's, purchasing goods on credit cards, making a phone call, etc.

For the past 10 years I've been working as a consultant in the area known as "Business Intelligence" (it's marketing-speak for reporting).  I thought I'd share some of the lingo and techniques of the trade because, whether you like it or not, there are hundreds of thousands of people out there analyzing your data and at least this article might give you some insight as to how and why.

My aim for this article is to explain the process to someone with no technical background but even still I'll leave out all of the project-management aspects, such as requirements gathering, workshops, etc.  I'll focus on the technology related areas.  I also want to point out that most people analyzing your personal data are not evil.

The first step in any Business Intelligence project is to design the data warehouse.

This is a large database that stores all of your data.  It is not just a dumping ground; it must be designed correctly to fit the business and be efficient for reporting.  The most common design for a data warehouse is called a star schema which has a central "fact table" containing the business "facts" (e.g. units sold, revenue, page views, transactions, calls made) and several "dimensions" containing the descriptive information (e.g. dates, financial quarter, customer names, products, geographical location).

For those of you that are database Savvy, the main aim is to reduce the number of joins for any given query, at the expense of allowing duplicate data in certain columns of the dimension tables (known as de-normalization).  If you're used to developing data driven applications, this is counter-intuitive because you would normally try to reduce duplication and make each transaction as efficient as possible in storage terms.

For data warehousing you do the opposite.  The joins used are generally "inner" joins, should be made on integer key fields, and should not use any ID columns that relate to the business (e.g. Customer ID).  Instead you build a unique key in each table known as a surrogate key.  This method is known as dimensional modeling and is used as a standard in the data warehousing and Business Intelligence industry.  The process is also known as the Kimball method, after Ralph Kimball who was one of the early leaders in the field.

Next you have to gather all of the data you need.

More than likely this will come from internal databases that the company already has (e.g. finance systems, HR systems, retail point-of-sale databases) though sometimes it may come from outside (usually from suppliers) and tends to be delivered in "flat files," essentially just large comma-separated or tab-separated text files.

Once you've established what your sources are, you need to get them into the database and, regardless of the source (internal or external), it's very unlikely that the data you start with will resemble the star schema you designed.  This is where you need an ETL tool.  ETL stands for Extract, Transform, and Load.  These tools are purposely built for importing data into data warehouses, though they can be used for other tasks.

They allow you take the source data, extract the elements you need for the fact table, maintain any data in the dimension tables, create or look-up all the keys you need to join the facts to the dimensions, and then load the result into the data warehouse.

Once you've sorted out your data you need a Business Intelligence software suite.  The main purpose of these tools is to allow business users (non-technical people like management, sales, and marketing) to perform complex analysis without having to understand what's going on at the database level.

Usually the person (or team) that designs the warehouse and builds the ETL would be responsible for modeling the data in the Business Intelligence tool.  This is achieved by replicating the star schema, choosing which fields should be visible to the users, and providing appropriate names and descriptions.  This area of the Business Intelligence tool is often referred to as the "Semantic Layer" or "Metadata Layer" as it is where you define what the data means to the users (metadata is just a term for "data about data").  If all of the above tasks (design, ETL, metadata definition) have been performed successfully, then all the users have to do is drag-and-drop.

There are many vendors selling software for Business Intelligence and ETL, and over the last few years there has been a period of consolidation where the larger companies have been buying up the smaller ones.

Most of the large database vendors are building suites of software including all of the necessary components for Business Intelligence.  The major players in Business Intelligence are BusinessObjects (owned by SAP), Cognos (owned by IBM), PerformancePoint (owned by Microsoft), and Oracle Business Intelligence Enterprise Edition.

The major players in the ETL world are Informatica, Oracle Data Integrator, BusinessObjects Data Integrator and SQL Server Integration Services (Microsoft).

From personal experience, it is most common to run these packages on a Windows server platform.  However, many vendors offer some or all of their suite for UNIX and Linux platforms.  I have implemented Oracle Business Intelligence on Linux myself.

There are a handful of open-source Business Intelligence software suites.  The ones most known to me are Jaspersoft and Pentaho but they don't have a great amount of recognition in the industry and are not commonly used by major corporations or consultancies.

If anyone wants to have a play with an enterprise level solution it's well worth registering on the Oracle Technology Network at: www.oracle.com/technology where you can download full unrestricted versions of most of their products, but check the license conditions to make sure they apply to you.

The big questions most people ask (especially 2600 readers, I suspect) are why companies want to analyze their personal data, what are they using it for, and what do they do with that information?

The first thing I'd like to point out is that despite the fact that the data is personally identifiable, in most cases the analysis isn't.  Most of the time people aren't analyzing your activity specifically - you are just a Statistic.

Companies want to segment their customers into groups to determine what people are doing (e.g. "people buying product X and Y often buy product Z") or just to test uniqueness (e.g. "shoppers buying drink X buy an average 3.5 cans per week but shoppers buying brand Y buy an average of 5.7 cans").

This sort of information helps them to plan their distribution (so you don't turn up at your local Kwik-E-Mart and find they've not got any stock of your favorite comestible) or the layout of the stores.

People often buy ham sandwiches and chips together so why don't we put them together and near the front of the store?  So these things do kind of help us out and at no point are we being personally identified.

Now I can just about hear the distant sound of keyboards typing out hate mail because I'm advocating data mining.

Remember all I'm saying is that these acts aren't always evil and it's not always personal as most quality analysis is performed in aggregate across millions of cases.  It's hard these days not to be captured by "the system" but you can do your best by paying in cash, using local farmers' markets instead of national chains (this also helps your local community), using unregistered pre-pay cell phones, payphones, etc.

One way or another, though, you're bound to end up captured in someone's data warehouse as someone's statistic.

You are a number.  You are not a free man.

Return to $2600 Index