A Dynamic Duo: Data Compression and Backup Compression

When I'm presenting on SQL Server data compression, I frequently get asked if data compression and backup compression can be used together. I keep meaning to put together a blog post with a demo so I have something to direct people towards for further reading. So here we are – let's give it the old college try and find out!

Compression Primer

Methods for compressing data fall into 2 tags: lossy and lossless. Lossy compression reduces the size of data by strategically removing parts of it that are unlikely to be missed. This data is lost forever, hence the term lossy. Prime examples of lossy compression can be found in digital images, videos, and music. It's very easy to remove a little color depth from an image or reduce the bitrate of a sound or video file in a way that shrinks its size without significantly changing its appearance to an observer. Lossy compression is great in these cases because our eyes and ears are far from perfect, but for things like a document or a database it's not a good choice. Nobody would be happy if data randomly started disappearing to save space (well, maybe the DBA would be, but only if the data that's disappearing are those copies of tables with "old" in the name that nobody has looked at in years…) For types of data where nothing can be lost, there's loss_less compression. Lossless compression can't remove anything, so it generally does things like build dictionaries of repeating patterns and create pointers to the dictionary when appropriate. Centralizing duplicate data in a dictionary means multiple copies no longer need to be stored, so the compressed version will be smaller. Both SQL Server data compression and backup compression use lossless compression methods.

"BIFF", "POW", and "ZONK" not shown
Data Compression and Backup Compression, circa 1966.

SQL Server data compression and backup compression are completely separate features; they do very different things in different ways. At a high level, data compression reduces the size of records in memory (and on disk as well), allowing more records to fit on a data page thereby decreasing the space needed to store the data. Backup compression compresses the database as it's being written to a backup file, which means backups will require less storage space. They build off each other well because data compression can only do so much: it can de-duplicate data on a single page, but not across multiple pages. Even with data compression enabled in page mode, its highest setting, it's very likely that similar data will exist on multiple pages. Backup compression can pick up where data compression leaves off and compress things a bit more by eliminating those duplicates. Like Batman and Superman, they combine their powers for good, not evil. If you have the ability to compress both data and backups, it's definitely worth your time experimenting to see if your environment benefits from using them together.

Demo Time!

Let's take a look at how they compliment each other with a demo. I used the AdventureWorks2012 database, which is available for download if you don't already have it.

The AdventureWorks databases are great for most examples, but none of the included tables are very big, and to make things more apparent I wanted to show some larger tables. Fortunately several tools to increase the size of tables have been written over the past few years. I used the script written by Jonathan Kehayias. This script creates 2 new tables: Sales.SalesOrderDetailEnlarged and Sales.SalesOrderHeaderEnlarged. Running the entire script the first time and then the INSERT block 3 additional times will bring the total size of the AdventureWorks2012 database up to about 4.5 GB.

Once our tables have been enlarged, let's rebuild their clustered indexes and set the fill factor to 100. This will ensure that fragmentation is minimized so we know we're comparing apples to apples when looking at sizes later.

1ALTER INDEX [PK_SalesOrderDetailEnlarged_SalesOrderID_SalesOrderDetailID]
2ON [Sales].[SalesOrderDetailEnlarged]
3REBUILD WITH (SORT_IN_TEMPDB=ON,FILLFACTOR=100);
4
5ALTER INDEX [PK_SalesOrderHeaderEnlarged_SalesOrderID]
6ON [Sales].[SalesOrderHeaderEnlarged]
7REBUILD WITH (SORT_IN_TEMPDB=ON,FILLFACTOR=100);

We'll also want to shrink the transaction log. Enlarging tables and rebuilding indexes grew the log, and the restore portion of this demo will give a more accurate picture, especially in terms of time, if the log is as small as possible.

1DBCC SHRINKFILE('AdventureWorks2012_Log',1);

Now for the data collection part. We'll take a total of 4 backups of the AdventureWorks2012 database. For each backup we want to monitor:

  • the size of the backup file
  • the time the backup took
  • average CPU utilization while the backup was running

The size of the backup file is easy – we can just look at the file's properties in Windows after-the-fact to find it.

The elapsed time of the backup is also fairly easy. When you run a backup in SQL Server Management Studio, the "Messages" window will return some information about the backup, including the time it took to run:

Average CPU utilization is a little more work because SSMS doesn't tell you about it. I use Perfmon since it's built into Windows. When first opening Perfmon, the counter it shows by default is "% Processor Time". This may seem like the counter you'll want to use, but it isn't. % Processor Time includes all operations on the processor, many of which are kernel processes we can't control. There's another counter called "% User Time" that will only show user processes. This still isn't perfect; there will undoubtedly be other user processes running, but with all applications closed this counter is usually pretty close to zero.

The "Average" box will show the average CPU utilization since the counter was last cleared. For the most accurate value you will want to clear the counter immediately before starting a backup by right-clicking inside the chart and selecting "clear".

Now to take some backups and monitor the results using the above methods. The first backup will be the default with backup compression disabled, we will enable it for the second.

1BACKUP DATABASE [AdventureWorks2012]
2TO DISK = '_path_\AW_NoDataCompression_NoBackupCompression.bak';
3
4BACKUP DATABASE [AdventureWorks2012]
5TO DISK = '_path_\AW_NoDataCompression_YesBackupCompression.bak'
6WITH COMPRESSION;

Now let's add data compression to the mix by enabling it on the clustered indexes of the 2 enlarged tables. We could enable it on all indexes, but these 2 represent the vast majority of the database's size. I used page compression to get the maximum possible effect, and once again set the fill factor to 100.

1ALTER INDEX PK_SalesOrderDetailEnlarged_SalesOrderID_SalesOrderDetailID
2ON Sales.SalesOrderDetailEnlarged
3REBUILD WITH (DATA_COMPRESSION=PAGE,SORT_IN_TEMPDB=ON,FILLFACTOR=100);
4
5ALTER INDEX PK_SalesOrderHeaderEnlarged_SalesOrderID
6ON Sales.SalesOrderHeaderEnlarged
7REBUILD WITH (DATA_COMPRESSION=PAGE,SORT_IN_TEMPDB=ON,FILLFACTOR=100);

The rebuild operations necessary for compression will again grow the log, so we will shrink it again for consistency, and then run 2 more backups – with and without backup compression.

1DBCC SHRINKFILE('AdventureWorks2012_Log',1);
2
3BACKUP DATABASE [AdventureWorks2012]
4TO DISK = '_path_\AW_YesDataCompression_NoBackupCompression.bak';
5
6BACKUP DATABASE [AdventureWorks2012]
7TO DISK = '_path_\AW_YesDataCompression_YesBackupCompression.bak'
8WITH compression;

At this point we have 4 backup files:

  • No data or backup compression
  • Backup compression but not data compression
  • Data compression but not backup compression
  • Both data and backup compression

I also restored each of these backups and captured the restore time and average CPU for each. I won't cover that here, but feel free to try restoring on your own.

Now that we've completed our trials, let's look at what we have. Since we're talking about compression, the size of the backup files is what we're most interested in. Here are the results I observed:

As you can see, the backup with no compression at all is the largest. Data compression yields a smaller backup file, backup compression gives us an even smaller file, and combining both data and backup compression is slightly smaller yet. This is great news, as it shows that backup and data compression can work together to produce a result that's smaller than either can do on their own.

As for backup and restore times, I would expect the results to vary based on the hardware being used. I wrote my backups out to an external USB hard disk and restored from there as well.

The backup and restore times correlate pretty closely with the backup sizes above. This is no surprise – the vast majority of time for backup and restore operations is spent waiting on I/O either reading the database from disk or writing the backup to disk. The less data SQL Server has to read or write due to compression, the faster the operations will complete.

Here's the average CPU usage for the duration of the backup and restore operations. This is a bit more interesting, but still makes sense. Backups and restores that don't involve backup compression use very little CPU; these operations are mostly I/O. Processor usage shoots up when backup compression is enabled because there's a lot more work involved with compressing and decompressing the backup. Based on the above results, compressing a backup stream is more CPU-intensive than decompressing it for a restore. It also looks like slightly less CPU is required for backup compression of databases utilizing data compression as opposed to those that don't. I imagine this is because the page-level data compression we used makes a big effort to reduce duplicated values, so there's slightly less work to be done when it comes time to compress the backup. Backup compression will cause elevated CPU utilization only when backups or restores are running, while data compression uses extra CPU power throughout the course of normal DML operations, but not at time of backup.

So what's the catch?

Everything we just did shows us that there's an advantage to using data compression and backup compression together. The differences weren't very significant in this case, only about 100MB for overall backup size, but I tried this with many datasets and the advantage only became more apparent as the sizes increased. It sounds like a total win-win, but the saying goes that there's no free lunch in computing, and that holds true for compression as well.

The first catch is cost. Data compression is only available in SQL Server Enterprise Edition, and it's not cheap.

After that, the biggest thing to look out for is CPU utilization. Both data and backup compression involve extra work for your CPU, and if your CPU utilization is already high, you may find that data and/or backup compression perform poorly.

Final Thoughts

Both backup and data compression can provide tremendous advantages in terms of performance and reduced storage space, but these advantages are highly dependent on the data they are manipulating. Be sure to take some time and test whether your environment comes out ahead by using them either individually or together. Based on what I've seen, it probably will, but it's always better to make sure beforehand rather than turning something on and seeing no benefit or, even worse, a performance hit.