Wednesday, July 31, 2013

Compression Becoming More Important in the Age of Big Data

DBAs and database professionals have been aware of the pros and cons of compressing data for years. The traditional argument goes something like this: with compression you can store more data in less space, but at the cost of incurring CPU to compress the data upon insertion (and modification) and decompress the data upon reading it. Over time, the benefits of compression became greater as compression algorithms became more robust, hardware assist chips became available to augment compression speed, and the distributed model of computing made transmitting data across networks a critical piece of the business transaction (and transmitting compressed data is more efficient than transmitting uncompressed data).
IBM has significantly improved compression in DB2 for z/OS over the years. In the early days of mainframe DB2 no compression capability came with DB2 out-of-the-box -- the only mechanism for compressing data was via an exit routine (EDITPROC). Many software vendors developed and sold compression routines for DB2. Eventually, IBM began shipping a sample compression routine with DB2. And then in DB2 Version 3 (1993) hardware-assisted compression was introduced. Using the hardware assist , the CPU used by DB2 compression is minimal and the cons list gets a little shorter.
Indeed, one piece of advice that I give to most shops when I consult for them is that they probably need to look at compressing more data than they already are. Compressed data can improve performance these days because, in many cases, you can fit more rows per page. And therefore scans and sequential processes can process more data with the same number of I/Os, thereby improving performance. Of course, you should use the DSN1COMP utility to estimate the amount of savings that can accrue via compression before compressing any existing data.
Eventually, in DB2 9 we even get index compression capability (of course, using different technology than data compression). At any rate, compressing data on DB2 for z/OS is no longer the “only-if-I-have-to” task that it once was.
Then along comes the Big Data phenomenon where increasingly large data sets need to be stored and analyzed. Big Data is typified by data sets that are so large and complex that traditional tools and database systems are ill-suited to process them. Clearly, compressing such data could be advantageous… but is it possible to process and compress such large volumes of data?
New alternatives to traditional systems are being made available that offer efficient resource usage based on principles of compressed sensing and other techniques. One example of this new technology is IBM’s BLU Acceleration, which is included in DB2 10.5 for Linux, Unix, and Windows. One feature of BLU Acceleration is extended compression, which eliminates the need for indexes and aggregation and operates on compressed data and can thereby eliminate the CPU time that would be required to decompress the data. Advanced encoding maximizes compression while preserving the order of encoding so compressed data can be quickly analyzed without decompressing it. It is an impressive technology as no changes are required to your existing SQL statements.
IBM reports that some clients using DB2 10.5 for LUW with BLU Acceleration have achieved compression rates 10 times greater than uncompressed tables.
Of course, BLU Acceleration is much more than compression (it combines in-memory, columnar and compression technologies), but for the purposes of today’s blog entry we won’t delve deeper into the technology. If you are interested in a little bit more on BLU read my high-level overview in my coverage of this year’s IDUG DB2 TechnicalConference.

So compression is becoming cool… who’d have thought that back in the 1980s when compression was something we only did when we absolutely had to?

1 comment: