The basic strucure of a cube

This week we’ll take a look at the basic structure of a cube from an end user perspective, as opposed to the architectural underpinnings. This is intended as a high level overview, and for brevity contains some generalizations, and focuses on Microsoft Analysis Services cubes.

An OLAP cube consists of several key elements, the most fundamental of which are dimensions and measures

1) Dimensions.

Dimensions are the business elements by which the data can be queried. They can be thought of as the ‘by’ part of reporting. For example “I want to see sales by region, by product by time”. In this case region, product and time would be three dimensions within the cube, and sales would be a measure, below. A cube based environment allows the user to easily navigate and choose elements or combinations of elements within the dimensional structure.

2) Measures

Measures are the units of numerical interest, the values being reported on. Typical examples would be unit sales, sales value and cost.

Note that there are modelling techniques which develop cubes with only one pseudo measure, typically called ‘value’ or similar, and implement what the user would think of as the measures through a dimension. There are performance and navigational reasons which can make this a good approach, but not one we’ll cover here in our introduction.

The diagram below shows a very simple cube which we’ll use for discussion.

This particular cube is for an exports business, and consists of 3 dimensions, Source, Route, and Time. The two measures are Packages, being the number of packages shipped, and Last, being the last shipped date.

Very few real world cubes will have just three dimensions, but I’ve yet to learn how to draw a 12 dimensional cube! The diagram above is enough to illustrate the fundamental principle, that at every intersection of the different dimensions, are stored the value for each of the measures. In larger real world cubes the principle is the same, just the numbers of intersections is larger.

The diagram highlights a few additional features of dimensions which need to be understood.

• Hierarchies

A dimension can contain one or more hierarchies. Hierarchies are really navigation or drill paths through the dimension. They are structured like a family tree, and use some of the same naming conventions (children / parent / descendant). Hierarchies are what brings much of the power to OLAP reporting, because they allow the user to easily select data at different granularity (day / month / year), and to drill down through data to additional levels of detail.

• Hierarchies consist of different levels. For example a time dimension would typically have a year, a month and a day level. A customer hierarchy may consist of Country, State, City, and Name levels.
• The levels are either implied in the case of dates, or exist as ‘attributes’ in the source data. So for example customer number 12324, John Brown, would have additional information recorded such as his address, broken into house number & street, city, state, country. Each of these is an attribute.
• Hierarchies are really ordered navigation paths through the attributes
• In Analysis Services 2005, the user’s view of a dimension will typically consist of both the defined Hierarchies, and also the Attributes. Attributes are ‘flat’, i.e. contain no ordered drill path.

• Members

A member is any single element within a hierarchy. For example in a standard Time hierarchy, 1st January 2008 would be a member, as would 20th February 2008. However January 2008, or 2008 itself could also be members. The latter two would be aggregations of the days which belong to them. Members can be physical or calculated. Calculated members mean that common business calculations and metrics can be encapsulated into the cube, and are available for easy selection by the user, for example in the simplest case Profit = Sales – Cost

• Aggregation

Aggregation is a key part of the speed of cube based reporting. The reason why a cube can be very fast when for example selecting data for an entire year, is because it has already calculated the answer. Whereas a typical relational database would potentially sum millions of day level records on the fly to get an annual total, Analysis Services cubes calculate these aggregations during the cube build and hence a well designed cube can return the answer quickly.

Sum is the most common aggregation method, but it’s also possible to use average, max etc. For example, if storing dates as measures it makes no sense to sum them.

The cube introduces a number of dimensions, hierarchies and measures, modeling the business of interest, and all of which are available to the end user to quickly and easily select, drill, and slice and dice. With a well designed cube the user benefits from a reporting environment which is highly flexible, contains the pre-calculated business metrics they regularly use, and is fast in terms of data retrieval.

/

So, what’s an OLAP Cube, anyway?

As "An Excel User in a Cubed Kingdom" I’m starting my exploration of this new found land with a simple question: what is an OLAP cube? In plain English, please…

I like this simple, non-technical definition: a cube is a set of predefined answers. It’s up to you to select the right questions.

OK, let’s detail this.

Imagine that you have a very large database with the usual business data: orders, customers, sales representatives… Now you want to know how much a customer category ordered over the last year. You query the database and you get the answer. Then you want monthly sales. Query it again. Dig a little deeper to see what products that category ordered. Query it once more.

What is happening behind the curtain? Each time you enter a new query the system looks at each transaction (or a subset) and performs the necessary calculations to answer your query. You’ll get your answers, but it will be painfully slow: depending on your query and the database size, it may take hours. It is not an option.

But you don’t really need to see each individual order, do you? If you only need to know monthly sales, why should your system go through each transaction? If you pre-aggregate that data, you’ll get your answers much, much, faster, because there aren’t five million records, just 100.000. You’ll be able to actually work, instead of staring at your monitor, waiting for an answer.

This is what a cube does. It provides faster answers by eliminating the unnecessary detail for the task at hand. You shouldn’t look at a cube as an unique, condensed version of the database. While you have a virtually infinite number of questions that the database can answer, a cube focus on providing answers to a small set of questions. That’s why you can have different cubes (marketing, fin, sales), all of them getting data from the same source. They all answer different sets of questions.

When designing a cube (it’s your job, not IT’s), resist the temptation of a one-size-fits-all cube. Clearly define a coherent set of questions for your fundamental business needs and make sure they are answered once the cube becomes available.

If you are exploring your data you will not want to wait one hour each time you make a change. On the other end, a fast cube with no data to explore is useless. There is  a fine balance between maximum flexibility and maximum performance.

An OLAP cube not only ensures that you retrieve the right data from the database but also allows you to explore it efficiently. Two good reasons to add OLAP cubes to your toolbox.

Next time we’ll see how plain English can describe the structure of a basic OLAP cube.

/

An Excel User in a Cubed Kingdom

I’ve been using Excel for my entire professional career, most of the time in large corporations where adding a piece of software to the standard IT structure would be some kind of heresy. When I have a business need that can be solve by an out-of-the-box Excel installation that’s the path I follow (I’m also a power user of the company’s formal BI tool, so I know where to draw the line).

Over time, I’ve developed a framework that helps me to solve problems from a very specific point of view: how to minimize file size, how to minimize calculation time, how to deploy, how to update, and so on. This is the logic that you often must follow, and you tend to believe it is the best one. At that point you must start a conversation (a very fashionable word nowadays..) with someone that doesn’t share that logic.

Take OLAP cubes, for example. I never use them. I use the corporate BI tool or I create a 100% pure Excel application. But then Andreas told me about XLCubed and how you could deploy online your always-updated-file. That was a turning point because for me those are two killer features that I’ve been yearning for a long time.

I started to play with the tool, but still using the same logic. And was plain wrong. Sure I could use all my Excel background, but I needed to adjust it to a different logic. Using an OLAP cube you get a new set of functions that simplifies much of your work and you need to reevaluate some Excel functions because some of them will perform better under this new environment. You’ll leverage your Excel background to create a a new logic at a higher level.

I’m an experienced Excel user, probably like you, and this is just Excel on steroids. I’m leaving my comfort zone, one foot at a time, and I’ll document and share with you my learning curve. So come with me to the Cubed Kingdom and we’ll walk through this together.

First stop, next post: what is a cube, anyway?

/