T-SQL Tuesday #27: Why I Love Big Data

Feb 14, 2012 · TSQL Tuesday ·

Happy T-SQL Tuesday to all! (and happy Valentine's Day too!) This month's T-SQL Tuesday comes to us courtesy of Steve Jones (blog | @way0utwest). Steve is asking us to write about "big data" – specifically the problems we've solved or interesting ways we've found to work with it. In honor of the holiday, I'm going to take a slight tangent from that and talk about why I love working with it instead.

First, let's tackle the term. "Big data" is interesting in that I've yet to see a universally-accepted definition for it. I've heard lots of different numbers from lots of different people. To me, a database in the 5-10 terabyte range and larger qualifies as a very large database (VLDB) and thus, big data.

Working with big data presents some very interesting challenges that will allow your skills as a DBA to shine. Whether they shine like gold or like NASA's super-black material will depend on your skillset. VLDBs have a habit of exposing poorly performing queries very effectively due to their size. In smaller databases, queries performing slowly (perhaps due to lack of proper indexing) might be tolerated in some situations if a query completes in a few minutes. When you're dealing with big data, however, table scans might take hours, making issues such as indexing much more critical to acceptable performance. DBA tasks such as backups and consistency checks will also take much longer with large amounts of data, meaning new techniques might be necessary for said tasks to complete in a reasonable amount of time. SQL Server native backups might not be possible due to performance or disk space, thus other methods such as SAN Snapshotting may be required. Integrity checks may take large amounts of time to run or be impossible due to the load they place on a production server, meaning they have to be run on another machine from a restored backup. This is by no means an exhaustive list – there are tons more ways that increasingly large quantities of data can cause issues.

The issues above along with many others crave solutions, and another reason I love big data is that it's brought about a lot of innovation to the information technology field over the past few years. Non-relational databases have existed for a while, but they have recently grown in popularity and number of offerings due to the increasing need for handling big data. As reliable and popular as relational databases are, the need for databases that give up some relational guarantees in exchange for better handling of extremely large datasets gave birth to the NoSQL movement, which will undoubtedly continue to reshape the future of the web and handling of extremely large datasets. Some may worry that NoSQL is going to replace their beloved relational databases, but I don't share that belief. Both database models address different issues and should be able to coexist as they already have for years.

To sum it all up, I love a challenge, and since big data has plenty to offer in that department, I love big data too.