DAX Performance tips– lessons from the field

XLCubed has supported a drag/drop interface for creating reports against Tabular Analysis Services since the first release of the new engine. It lets users easily create reports which run DAX queries on the cube, and we’ve often seen very good performance at customers when MDX against Tabular was a cause of long running reports.

So when we were approached at SQL Pass in Seattle by some attendees who had a SSAS Tabular performance issue we were optimistic we’d be able to help.

In this case the business wanted to retrieve thousands of rows from the cube at the transactional level, and the first approach had been to use PivotTables in Excel. To get to the lowest level they cross-joined the lowest levels of all the hierarchies on the rows section which would give the right result, but performance was terrible, with several queries taking 20 minutes or more and others not returning at all.

We hoped using an XLCubed table running DAX would be the solution and created the same report in the designer. Sadly while performance was a little better it was still far from acceptable; the model was large, and the number or columns combined with their cardinality meant that a lot of work was being done on the server.

XLCubed’s DAX generator was trying to cross-join all the values from each column, which had worked well for our other customers. But when there are a dozen columns including the transaction ID things do not go so well. DAX in itself is not a magic bullet and SSAS Tabular models can hit performance problems on low level data – we needed a new approach.

After some investigation we discussed the issue and our thinking with our friends at SQLBI and determined that instead of cross-join we wanted an option to use Summarize() instead as this only uses the rows in the database, and it can access columns related to the summarized table which were required for the report.

As the customer’s report had the transaction ID in it the result wasn’t aggregated, even though we were using summarize. But we wanted to add true transactional reporting too, using the Related() function.

Finally, SQL 2016 adds a couple of new functions, SummarizeColumns() and SelectColumns(), both of which are useful for this type of reporting, but offer better performance than the older equivalents.

The end result in XLCubed is a new option for DAX tables to allow users to set the type of report they want to run, and some internal changes so that XLCubed will automatically use the most efficient DAX function where they are available.

A beta was sent to the business users and the results were fantastic. The report which had run for several minutes now completed in a few seconds, and 20 minutes was down to 15 seconds – we had some very happy users!

The changes will be in the next release of XLCubed so that all our customers can benefit from the improvements. It’s always nice when a customer request helps improve the product for everyone.

A sample of the syntax change is included below

Before:

 

EVALUATE
FILTER (
    ADDCOLUMNS (
        KEEPFILTERS (
            CROSSJOIN ( VALUES ( 'Customer'[Education] ), VALUES ( 'Product'[Color] ) )
        ),
        "Internet Total Units", 'Internet Sales'[Internet Total Units],
        "Internet Total Sales", 'Internet Sales'[Internet Total Sales]
    ),
    NOT ISBLANK ( [Internet Total Units] )
)
ORDER BY
    'Customer'[Education],
    'Product'[Color]

After:

 

EVALUATE
FILTER (
    ADDCOLUMNS (
        KEEPFILTERS (
            SUMMARIZE ( 'Internet Sales', 'Customer'[Education], 'Product'[Color] )
        ),
        "Internet Total Units", 'Internet Sales'[Internet Total Units],
        "Internet Total Sales", 'Internet Sales'[Internet Total Sales]
    ),
    NOT ISBLANK ( [Internet Total Units] ) || NOT ISBLANK ( [Internet Total Sales] )
)
ORDER BY
    'Customer'[Education],
    'Product'[Color]

Streamlining writeback with XL3DoWriteback

We were recently asked by one of our customers to help them improve their forecasting process. They had originally been using a solution developed using XLCubed Excel Edition v6.0 and our XL3LookupRW formula. The system had been working, but because of a combination of the intricacy of the data model and the slowness of the cube server when performing a writeback, the process was taking much longer than necessary.

As an example, one of the workbooks that was being used contained nearly 7,000 XL3LookupRW formulae, and another contained over 1,000. Many of these lookups could actually have been replaced by a simple Excel formula, such as a sum or a product of other values, but built as it was, the customer was having to type these values into the cells individually: a tedious, time-consuming and error-prone task.

The process before XL3DoWriteback

In the screenshot above, the price, percentage and production figures would be typed in, then a calculation made to calculate their product (in the white cells). This would then be individually copied and pasted into the corresponding cell in the revenue row.

What the customer wanted was a couple of changes to streamline the process:
* the ability to use Excel formulae in the workbook to obtain the final values – without the subsequent copying of values,
* they wanted to be able to get all the calculations lined up, then submit them all at once – this would make the poor server performance a much less important issue, since instead of having to wait to enter the next value, that period could be usefully spent doing other tasks.

What we offered was a different writeback method, which has been available in its current form since XLCubed v6.5: the XL3DoWriteback formula.

Unlike XL3LookupRW, XL3DoWriteback is geared towards the kind of batch writeback approach that the customer had envisioned. Once set up, Excel formulae can be used to do the actual work of calculating the numbers, and the XL3DoWriteback formulae remain dormant until all the values are ready, then are activated in one transaction.

If this sounds useful for you, here’s how to set it up.

The XL3DoWriteback Formula

In addition to the member list required by the XL3LookupRW formula, the XL3DoWriteback formula requires two extra parameters:

  • PerformWriteback: this parameter tells the formula whether it should be in active writeback mode, or should remain dormant
  • Value: this parameter gives the new value that should be written back to the tuple

Following these two parameters are the connection number, and the hierarchy-member pairs that will be familiar to you from the XL3Lookup and XL3LookupRW formulae.

The PerformWriteback parameter is a bit special. If it refers to a cell that contains only a boolean value of TRUE, then when it has finished sending the value, it will set that cell back to FALSE. This means that periods of writing and non-writing are very easy to define. In order to maximise the power of this, we usually point all the XL3DoWriteback formulae at a single PerformWriteback cell, which we can switch using an XL3Link formula. For example:

A1: =XL3Link(XL3Address($A$1),"Write changes",,XL3Address($B$1),TRUE)
B1: FALSE
C3: 1,000
C4: 0.85
C5: 20,132
C6 =C3*C4*C5
D3: =XL3DoWriteback($B$1,C3,1,"[Account]","[Account].[Production]",
     "[Date]","[Date].[Calendar].[January 2011]")
D4: =XL3DoWriteback($B$1,C4,1,"[Account]","[Account].[Our %age]",
     "[Date]","[Date].[Calendar].[January 2011]")
D5: =XL3DoWriteback($B$1,C5,1,"[Account]","[Account].[Price]",
     "[Date]","[Date].[Calendar].[January 2011]")
D6: =XL3DoWriteback($B$1,C6,1,"[Account]","[Account].[Forecast Revenue]",
     "[Date]","[Date].[Calendar].[January 2011]")

In this example, C3, C4 and C5 are cells containing the raw values. Since we know that the forecast revenue is a product of the production, the percentage and the price per unit, C6 is just the product over those three cells. The four XL3DoWriteback formulae in column D refer to these value cells, but because the value in cell B1 is FALSE, nothing is written back yet.

In cell A1 is a XL3Link formula that, when clicked, will change B1 to TRUE. This immediately signals the XL3DoWriteback formulae that they should gather and write back their values. Once that transaction has been sent to the cube, the XL3DoWritebacks set cell B1 back to FALSE, and the workbook is back to the ready state.

The Setup

To make it as easy and efficient as possible, we used:

  • one section for values. These were a mix of XL3Lookup formulae, typed-in values and standard Excel formulae
  • one section for XL3DoWriteback formulae. We pared away any excess XL3DoWriteback formulae, leaving only those cells that we were sure we wanted to be writeable
  • a single cell with the boolean value, set to FALSE
  • an XL3Link in a highly visible place, to switch the boolean cell. In this case, the cell containing the boolean value was B1:
=XL3Link(XL3Address($A$1),"Write changes",,XL3Address($B$1),TRUE)

The final workbook looked a little like this (except, of course, much larger!):

A section from the finished workbook

The customer would then enter all the necessary values on the left section, using whatever combination of Excel formulae, cube lookups and typed-in values he needed, without any wait between entries. A single click of the XL3Link then wrote the values back in a single batch, leaving the customer to do other jobs.

The revised model allows the user to update entries quickly and efficiently, without any ‘write’ delay. The numbers to be written back can be calculated using Excel formulae as needed based on the raw input numbers. When the input process is done and checked in Excel, everything can be committed to the cube with one button press. The end result – a happy customer, with more time to plan and analyse the budget, rather than just input it.

Further reading

XL3DoWriteback formula reference

Warning: Excel can get Volatile

There is a revised version of this post here.

Excel is a great tool for dashboard/report delivery and design (it’s why we created our addin in the first place), but there is a hidden performance trap:

Offset, Now, Today, Cell, Indirect, Info and Rand

If you’ve ever used any of these formulae, you may have noticed that whenever you change a cell, or collapse/expand a data grouping, Excel recalculates. That is because these are VOLATILE formulae, as soon as you use one of these, Excel will enter a mode where everything is always recalculating, and for good reason.

Offset & Now are the formulae we see used most often. Let’s look at each of these in turn and talk about some alternate approaches to avoid this issue.

Offset

This is by far the most common of these danger formulae that we see in use. Here’s the formula definition:

=Offset(reference,rows,cols,height,width)
Returns a reference to a range that is a given number of rows and columns 
from a given reference.

We typically see these as part of a named range definition for driving chart source data – it allows the number of rows/columns driving the chart data to change automatically; a not unusual requirement when it comes to building reports (especially when a report contains some user defined filters or slicers). Here’s an example:

 

 

 

 

 

 

A very simple spreadsheet – we can type the number of months to display in the chart. In reality the number of months to display will probably be driven by the data available for the criteria selected. The screenshot already shows the issue we have –  the chart is setup to display a max of 12 months, but we only have 3 months of data available.

 

The most obvious approach is to use the Offset formula to pick the chart area to use automatically, we could create a named range such as:

 

 

 

 

 

 

Now we just change the chart data source to be the named range:

 

 

 

 

The chart is now plotting 3 months, but will automatically update to show the required number of months:

BUT we have now used a volatile formula –  although this is a simple workbook, we are now in a position where Excel is going to have to recalculate everything all the time. It’s probably a good time to look at why Excel is going to do that. Let’s have a look at very simple formula to understand how Excel recalculates things.

Consider the formula:

C1    =A1 + B1

We can see that C1 is dependent upon A1 & B1 – so whenever a value in either of these cells changes C1 will need to be recalculated to show the correct answer. Excel knows about this dependency because it maintains a dependency tree; it knows which cells need to be recalculated whenever any other cell changes. This is a very efficient way of working, if a workbook has thousands of formula, but only one values changes, and this only needs 10 of these formula to recalculate, then only 10 will be calculated.

If C1 contained:

C1    =Sum(A1:A20)

We know that C1 depends upon any of the cells A1:A20, and so does Excel. But what if C1 was:

C1    =Sum(Offset(A1,0,0,B1,1))

Which cells is C1 dependent upon? At a glance you could say A1 & B1.

 

 

 

 

 

 

but  B1 contains the number 20, so actually C1 is dependent upon A1:A20 and B1 (I’ve highlighted the additional cells that are dependent):

 

 

 

 

 

 

 

Just as we can’t see at a glance which cells C1 needs – Excel also can’t easily decide that. Therefore, Offset is volatile because, if it wasn’t then there is a danger that Excel would take so long to work out if it needs to be calculated that it might as well always calculate it.

There is an easy solution to this, INDEX. Here’s the formula definition (be careful, there are 2 ways to use Index, we want the REFERENCE one):

=Index(reference,row_num,column_num,area_num)
Returns a value of reference of the cell at the intersection of a 
particular row and column, in a given range
The big difference, compared to Offset, is that Index is going to return a single cell reference, so you need to use it as part of a range selection A1:Index(…). Here’s the same “Offset” Sum redefined as an “Index”:
C1    =SUM(A1:INDEX(A1:A20,B1,0))

The formula is simply saying the range we want starts at A1 and goes down the number of rows set in B1. The crucial difference is that the Index functions knows that A1:A20 is the maximum range we are likely to look at and therefore the dependencies are known just by looking at the formula itself:

We can now update the Named Range to use the Index function instead:

=Sheet1!$C$6:INDEX(Sheet1!$C$6:$C$17,Sheet1!$D$2,0)

 

 

Now/Today

The Now and Today functions return the current date to a cell – this is generally used so that when a report is loaded it will always show the data based on “Today”. Whilst this is not an unreasonable thing to want to do,  in reality what most people want is for the report to run for the most recent data, which could actually mean a number of things:

  • Yesterday (if the data is built in a nightly process)
  • The last working day (if the source transactional system is only used during office hours)
  • Current month etc.

The easiest solution is to let the data determine the date to use – if we use an XLCubed Grid or Query Table to retrieve the data we can simply setup a grid to retrieve the days/months where there is data:

And use the Sort option “Reverse” to display the most recent data first:

With the grid set to “Refresh on Open”  we know that A6 will always have the most recent date available in the cube and can base the rest of the report off that cell.

Incidentally, V6.2 of XLCubed introduces a new option to Slicers to automatically select the most recent date member when the report is loaded:

Common Analysis Services Performance Issues

A quick blog post from the Services team here at XLCubed on some performance problems with SSAS that we’ve seen again recently. With the processing power and memory available it’s pretty easy to build a fast cube – both for query performance and processing time. It is also easy to be lax in cube design, ignore the warnings and best practice guidelines, and end up with a cube that’s looks concise, is neat and clever but performs terribly for end users.

We’ve come across a couple of examples of this at client sites in the last month, and there are some common issues that always seem to jump out – rectifying these normally has a very positive impact. The three most common culprits we see are:

Parent-Child dimensions – Parent-Child dimensions are nice and easy to build and use. However, as you can’t build aggregations that include a parent-child dimension it can make for a badly performing cube! Try to flatten dimensions out and evaluate exactly why a parent-child dimension is required and being used. They are not the only option..

Unary operators, Custom-roll ups – we’ve seen cases where these have been included in every dimension in a cube by default. If there isn’t a need for them – leave them out! If you can get around using a custom rollup or unary operator by some simple work in the ETL process it may be better to do that first.

If your query performance is bad – try removing all unary operators and custom rollups then re-test the cube. How’s the performance now? It should be significantly faster – evaluate and review the need for the unary operators and custom rollups and see if the same effect can be achieved differently (e.g. in the ETL layer)

Cache vs. Non-Cache Data – Basically is the cube recalculating and re-querying numbers over and over again or can it re-use results? Use profiler to check for cache or non-cache data when your queries are running. So many times we’ve seen all queries not using the cache because AS hasn’t been given enough available memory or volatile operators such as now() have been used in mdx calcs.

Resolving the issues above had a massive impact – reports taking up to 3 minutes to run were down to a few seconds, users could begin to use the application properly for the first time, however fixing the performance may be only part of the task. The cube of course needs to have been designed to meet the business requirements, but that’s another blog..