Tag Archives: OLAP cubes

Solve order shenanigans

Today I’m going to blog about a problem we recently solved in a client’s cube, an error in the Mdx script that’s very easy to make if you aren’t careful.

We’ll run a simple example in AdventureWorks (what else?) to demonstrate the issue.

The client had already added a calculation to their cube to show year-on-year growth. The formula is:

Create Member CurrentCube.[Measures].[Delta to PrevYear] as
(
    ([Measures].[Internet Sales Amount])
    -
    ([Measures].[Internet Sales Amount],
        ParallelPeriod(
            [Date].[Calendar].[Calendar Year],
            1,
            [Date].[Calendar].CurrentMember
        )
    )
)
/
    ([Measures].[Internet Sales Amount],
        ParallelPeriod(
            [Date].[Calendar].[Calendar Year],
            1,
            [Date].[Calendar].CurrentMember
        )
    )
, Format_String = "0.00%";

(some error checking removed for clarity)

This screenshot shows a couple of simple XLCubed Grids showing the real value, and below the percentage change. I have added in an Excel calculation to show the results are as expected.

Later during the cube development, the client added a calculated member in their Product dimension, one that gives a total excluding one of the product categories.

To replicate this I’ll add a calculation for “All Ex Bikes”:

Create Member 
CurrentCube.[Product].[Product Model Categories].[All Products].[All Ex Bikes]
as
(
    ([Product].[Product Model Categories].[All Products])
    -
    ([Product].[Product Model Categories].[Category].&[1])
);

And if we run the report again we get the following.

Notice the cell I’ve highlighted. The “All Ex Bikes” calculation works fine on the normal measure, but it gives totally the wrong number for the percentage calculation. What’s going on?

The problem is that in the cell highlighted Analysis Services has two calculations to think about when working out the result.

  • Compare this year to last year
  • Get the “Grand Total”, and subtract “Bikes”

As the number returned is 1.85% we can see that Analysis Services has chosen the second option, “Grand Total” – “Bikes”.

What we really want is for the calculation to be done by getting the subtotal, and then doing the percentage change based on that.

Fortunately the fix was a simple one. Analysis Services will run the calculations in the order they are found in the Mdx Script, so to fix the issue we simply moved the new “All Ex Bikes” definition up above the percentage calculation.

Now the number returned matches our expectations.

Pass/Solve Order can be a complex topic, so you may need to be careful.

In this case the number is totally wrong, so it was easy to spot, but some bugs will be much more subtle, so watch out!

Parent-Child Dimensions in Analysis Services – Performance Walkthrough

Parent-child hierarchies are a good fit for many data structures such as accounts or employees, and while they can speed development in some cases, they can also cause performance problems in large cubes.

We often see customers with these type of performance issues, and thought it worth sharing a simple technique for altering the dimension structure to improve query speed.

The problem

Often parent-child hierarchies are created as this is the structure used in the relational source, so they seem a good fit to model the members. In many cases though data is only at the leaf level of the hierachy, meaning parent-child isn’t really needed.

Performance problems occur because no aggregates are created for parent-child dimensions, as detailed in the Analysis Services performance guide:

Parent-child hierarchies

Parent-child hierarchies are hierarchies with a variable number of levels, as determined by a recursive relationship between a child attribute and a parent attribute. Parent-child hierarchies are typically used to represent a financial chart of accounts or an organizational chart. In parent-child hierarchies, aggregations are created only for the key attribute and the top attribute, i.e., the All attribute unless it is disabled. As such, refrain from using parent-child hierarchies that contain large numbers of members at intermediate levels of the hierarchy. Additionally, you should limit the number of parent-child hierarchies in your cube.

If you are in a design scenario with a large parent-child hierarchy (greater than 250,000 members), you may want to consider altering the source schema to re-organize part or all of the hierarchy into a user hierarchy with a fixed number of levels. Once the data has been reorganized into the user hierarchy, you can use the Hide Member If property of each level to hide the redundant or missing members.

 

The performance guide hints at re-organizing the hierarchy to improve perfomance, but doesn’t say how.

The solution

This article will walkthrough the steps needed to change your parent-child hierarchy structure to have real levels, so that aggregations work, and your performance is as good as you expect with normal hierarchies.

This process is known as flattening or normalizing the parent-child hierarchy.

Firstly, let’s look at the data in our relational source.

Code: Sql Create ScriptSelectShow

Not a large dimension, but enough to demonstrate the technique. As you can see my real products are all at the leaf level.

The strategy is quite simple:

  • Create a view to seperate the members into different levels.
  • Create a new dimension using these real levels.
  • Configure the dimension to appear like the original parent-child dimension, but with the performance of a normal dimension.

Create the view

We want to create a denormalised view of the data. To do this we join the Product to itself once for each level. This does mean we need to know the maximum depth of the hierarchy, but often this is fixed, and we’ll build in some extra levels for safety.

The tricks here are:

  • Use coalesce() so that we always get the lowest level ID below the leaves, never a NULL. This allows us to join to the fact table at the bottom level of our hierarchy.
  • Leave Name columns null below the leaves, this will allow us to stop the hierarchy at the correct leaf level in each part of the hierarchy.

Code: Sql View ScriptSelectShow

Running this we get:

Obviously we can update this view to create more levels as required, but 5 are enough for now.

The Dimension

Next we go to BIDS, and add the view to our Data Source View, and then add a new Dimension based on the view.

The key steps to creating the dimension correctly are:

  • Set the key attribute to Level5ID, and the name to Level5Name.
  • Create an attribute for each Level ID, and on each set the Name Column appropriately.
  • Create a hierarchy using these attributes in order.
  • On each attribute set AttributeHierarchyVisible to False.
  • On each level of the hierarchy set HideMemberIf to NoName.
  • Set up the Attribute Relationships between the levels.

You should end up with the following:

Dimension Structure

 

Attribute Relationships

 

If you browse the dimension you’ll see that it never goes as far as level 5, even though it exists. This is becuase we set up the member hiding option, and returned NULLs in our view.

Conclusion

And that’s it done, you can now join to your fact tables at the lowest level, build your cube as normal and get the performance benefits of aggregation!

See also

A tool to achieve the same result is available from Codeplex, we’ve not personally tried it but may well be a timesaver. This works in a similar way to the example above, but it’s often useful to understand how something works, even if you choose to automate it.

Data sources for Excel dashboards: avoid spreadsheet hell

This is the first of two twin posts where we’ll discuss the alpha and omega of Excel dashboards: data access and dashboard publication. These are two weak areas in Excel, and they should be approached carefully when planning for a new dashboard. Let’s start by reviewing the available data access options.

Copy / Pasting data

Are you or some one in your organization populating the spreadsheet manually? Or are you copy/pasting the data into the spreadsheet? This is the simplest method of getting data into Excel, but it can be dangerous. It should be avoided when better options are available.

When you are dealing with some kind of structured data management (like you do when you create a dashboard) you have to plan ahead and make sure that when data changes it doesn’t break your well crafted dashboard. Each function, each chart, must know where the data is and adjust for these changes when needed.

When you are pasting data there is a a high risk of break something. The number of rows or columns in the new dataset may change, and things like a time series chart may not recognize the new time periods and probably you’ll have to update references manually. Again, plan carefully or you end up in a maintenance hell.

External table

You can create a link to an external table in Access, Oracle or other database tool via a standard ODBC connection. This will ensure that the data is correctly funneled into the spreadsheet, but with real-world data it is very easy to have more records than the Excel 2003 limit of 65,536 rows. You’ll be better off if you link not to the raw data itself but to a query/view that aggregates the data (one of the basic rules for dashboard design in Excel is to avoid calculations and derivative data; the data should come from the source already prepared to be displayed).

Once the data is in Excel, there is not much difference between this and the previous option. You still need to use use lookup functions to retrieve the data and use it in report tables and charts, and data integrity is a stressful thing that you must ensure all the time. When possible, use database functions like DSUM instead of lookup functions (there will be a post discussing this).

Pivot tables

For an out-of-the-box Excel installation you may want to consider pivot tables. They are an interesting option for smaller datasets and they have a nicely flat initial learning curve. Please note that pivot tables will make your file size much larger because they store all the data in the spreadsheet, so scalability can become a major issue. Also, they work best with a strict hierarchical data structure. If your data doesn’t fit exactly in this concept this may be a problem. If you have a larger dataset you should consider an OLAP cube instead.

OLAP Cubes

The concept of an OLAP cube can be something scary for the average Excel user, but once you start using them you’ll never turn back. Specially of you are using what Charley Kid calls an “Excel-friendly OLAP cube”.

Unlike the other methods, an Excel-friendly OLAP cube (like XLCubed) will not store the data in the spreadsheet, thus eliminating the need for the usual data refreshing methods (open the dashboard, refresh, save and close). The cube is automatically updated and you can query it using formulas similar to GETPIVOTDATA. This makes a huge impact on the way you work. You get all the benefits of a regular pivot table plus several life-saving extras. The dashboard will be simpler, cleaner and easier to maintain.

Final Thoughts

You have several methods for data management in Excel, and you must decide what is the best method for each specific dashboard. Scalability is always an issue, so be sure your data don’t outgrow your chosen method. An Excel-friendly OLAP cube may require some immediate investment but will save you a lot of hassle in the long run.

Data management in Excel is a critical factor, and it will discussed in detail in future posts.

The next post discusses the other end of a dashboard project: how to make the dashboard available to the users.