Compression Estimates That Don’t Make Sense

"Statistics means never having to say you're certain" is a well-known quote that would often come up in my AP Statistics class back in high school. The same can be implied for estimation, and while estimates need not be perfect, they should at least be logical. While estimating the effects of compressing data I recently came across a behavior that I don't think Mr. Spock would approve of.

The sp_estimate_data_compression_savings procedure is an excellent tool for estimating how compression will affect an index or partition. It works by sampling 5% of the appropriate pages and applies the desired level of compression on them in TempDB. The results are then extrapolated to give an estimated size for the entire index/partition with that level of compression applied. It's not perfect, but again it's an estimate. I recently found however that depending on the datatypes present in the table, that estimate can get very bad. I noticed this in SQL Server 2008R2, but it also shows up in 2012.

For this demo, you'll need the AdventureWorks database or one of its variants. I used AdventureWorks2008R2 and AdventureWorks2012 in my tests. You'll also need tables a bit larger than AdventureWorks provides, so I use Adam Machanic's bigAdventure script to create the dbo.bigTransactionHistory table we'll be working with.

Once dbo.bigTransactionHistory is set up, we're going to focus on it's nonclustered index which is created in the script along with the table. Let's rebuild this index with a fill factor of 100 so there's as little fragmentation as possible.

1ALTER INDEX IX_ProductID_TransactionDate ON dbo.bigTransactionHistory
2REBUILD WITH (FILLFACTOR=100);

Now check the estimated size of it using all 3 compression settings: NONE, ROW, and PAGE.

1EXEC sp_estimate_data_compression_savings 'dbo','bigTransactionHistory', 2, NULL, NONE;
2
3EXEC sp_estimate_data_compression_savings 'dbo','bigTransactionHistory', 2, NULL, ROW;
4
5EXEC sp_estimate_data_compression_savings 'dbo','bigTransactionHistory', 2, NULL, PAGE;
Screenshot1
(click to enlarge)

The index is currently not compressed, and you can see that the first result set's estimated size (with no compression) is close to the current size. It won't be exactly the same and your values will vary, but the main idea here is that using no compression gives an estimate closest to the current value and row and page compression yield smaller values yet.

Now let's change things up by adding another column to bigTransactionHistory and giving it a value:

1ALTER TABLE dbo.bigTransactionHistory ADD vbm VARBINARY(MAX) NULL;
2
3UPDATE dbo.bigTransactionHistory SET vbm = CONVERT(VARBINARY(MAX), '0xAB');

Then check the estimated sizes again just like before:

1EXEC sp_estimate_data_compression_savings 'dbo','bigTransactionHistory', 2, NULL, NONE;
2
3EXEC sp_estimate_data_compression_savings 'dbo','bigTransactionHistory', 2, NULL, ROW;
4
5EXEC sp_estimate_data_compression_savings 'dbo','bigTransactionHistory', 2, NULL, PAGE;
Screenshot2
(click to enlarge)

Look at that first result (with no compression) – it's now way smaller than the current size. It's actually even smaller than the estimated sizes for row and page compression, and it all seems to be due to the VARBINARY(MAX) column. If you remove that column and rebuild the table, the estimated size with no compression will be back to where it should be.

Strange, huh? I can only assume this has something to do with VARBINARY being a type that doesn't get compressed, but I don't see how that should affect the estimation.

BOL states that "If the requested compression setting is same as the current compression setting, the stored procedure will return the estimated size with no data fragmentation and using the existing fill factor." I can't see the assumption of no fragmentation being responsible for the estimated size shrinking so much and being smaller than the estimates using compression. Compression can increase the size of an object in certain cases, but the differences here are just too great.

I have submitted this on MS Connect – feel free to vote up if you can reproduce it!