A Poor Man’s Data Warehouse
T-SQL Tuesday this month is being hosted by the one and only Jes Borland (blog | twitter), and the topic of the month is aggregation. Powerful and fast aggregation tools are one of the biggest reasons to use a database. Without them, there would be a lot less awesomeness to look forward to – kind of like a meal without bacon.
Even almighty bacon has its downsides (fat, sodium, etc), and as awesome as aggregates are, they have weaknesses too. What can become an issue with aggregates are that by nature, all applicable values must be scanned to generate the aggregate. For instance, you can't figure out the average of a set of numbers without scanning the entire set. The same holds true for the sum, min, max, and other statistics. This means that the time it takes to calculate an average can grow with the size of the data set. As sets get increasingly large, performance can degrade to the point where it's no longer acceptable, in which case other tools may be necessary to deliver the requested aggregate information in an appropriate amount of time.
One such tool for doing this a data warehouse, which can be implemented using SQL Server Analysis Services (SSAS). SSAS has its advantages and disadvantages like any product, and probably the biggest thing it has working against it is its learning curve. There's also the issue that it's only available on standard edition and above, and you don't get all the real goodies unless you're using Enterprise Edition.
A few times I've been able to get the desired results with a much lower-tech solution than SSAS which I call the "Poor Man's Data Warehouse". It's a table of pre-aggregated data. (I really wasn't kidding when I said it was low-tech!) Basically another table is created to store the data in its already-aggregated form. Rather than querying the "primary" tables and performing the costly aggregations on them, these pre-aggregated tables can be used instead, typically with much faster results because fewer records need to be scanned to arrive at the result.
For some sample data I'll be using the Production.TransactionHistory table from the AdventureWorks2008 database. If you don't already have AdventureWorks2008 loaded somewhere, you can download it from here. Querying the entire table shows there's a little over 113,000 rows. Let's say that we're only interested in the total number of each product sold each month. First let's create a table for storing the aggregates:
1CREATE TABLE Production.TransactionHistory_PMDW (
2 MonthStart DATE,
3 ProductID INT,
4 TotalQty INT,
5 CONSTRAINT PK_TransactionHistory_PMDW PRIMARY KEY CLUSTERED (MonthStart, ProductID)
6);
You'll notice this table only contains the bare minimum for storing this aggregate data, the start of the month (stored as a DATE), the Product ID, and the total quantity of that product sold during that month. Aggregating by another field in addition to Month and Product would involve adding another column.
Now that the table has been created it needs to be filled with aggregate data, which can be accomplished with the following:
1INSERT INTO Production.TransactionHistory_PMDW (MonthStart, ProductID, TotalQty)
2SELECT DATEADD(dd, -(DAY(TransactionDate)-1), TransactionDate) AS MonthStart, ProductID, SUM(Quantity) AS n
3FROM Production.TransactionHistory
4GROUP BY DATEADD(dd, -(DAY(TransactionDate)-1), TransactionDate), ProductID;
The above code will aggregate the entire contents of the Production.TransactionHistory table and load it into Production.TransactionHistory_PMDW. Since this is a monthly aggregation, the above statement would only need to be run once per month and with a few modifications it could add to the TransactionHistory_PMDW table only the data from the previous month. From here there's all kinds of different things you can do, such as manipulating the date grouping to get aggregations of a different granularity.
The advantages of pre-aggregating become apparent rather quickly once you start retrieving data. The following batch will query both the original and aggregate tables for total sales of all products during the month of May, 2004:
1-- original table
2SELECT DATEADD(dd, -(DAY(TransactionDate)-1), TransactionDate) AS MonthStart, ProductID, SUM(Quantity) AS n
3FROM Production.TransactionHistory
4WHERE TransactionDate >= '20040501' and TransactionDate < '20040601'
5GROUP BY DATEADD(dd, -(DAY(TransactionDate)-1), TransactionDate), ProductID
6ORDER BY MonthStart, ProductID;
7
8-- aggregate table
9SELECT MonthStart, ProductID, TotalQty
10FROM Production.TransactionHistory_PMDW
11WHERE MonthStart = '20040501'
12ORDER BY MonthStart, ProductID;
Viewing the execution plans in Management Studio shows that 100% of the query cost goes towards grouping and filtering the original table, whereas merely scanning the aggregate table is extremely trivial by comparison. Adding the suggested index doesn't level the playing field either – the aggregate is still significantly faster.
So is this incredibly low-tech approach appropriate? The answer, of course, is "it depends." Having used this method a few times, it can definitely come in handy.
Pros
- If the data has already been aggregated you can usually find your answer by scanning fewer rows, which means your results will most likely be faster and with less of a CPU cost.
- Querying a table that's not "live" (being updated by production processes) translates to fewer locks (hopefully none at all!) on those live tables.
- Since the aggregate table is only being loaded at specific times and is separate from any production processes, you can heavily index this table with little worry of INSERT or UPDATE operations becoming slower as a result.
Cons
- Since the aggregate table is separate from the "original" (live) table, updates to the original table after the data has been loaded into the aggregate table won't automatically occur. The aggregate table would have to be re-loaded for any changes to be reflected.
Tips
- I've found this works best when you're aggregating by date, since data from the past is (hopefully) unlikely to need updating after-the-fact!