Thursday, November 10, 2005

DB2 Compression: z/OS versus LUW

Space compression for non-mainframe DB2 is quite a bit different than it is for DB2 for z/OS. In mainframe DB2, specifying COMPRESS YES on the CREATE TABLESPACE statement will cause DB2 to implement Ziv-Lempel compression for the table space in question. Data is compressed upon entry to the database and decompressed when it is read.

For DB2 UDB on Linux/Unix/Windows, when creating a table, you can use the optional VALUE COMPRESSION clause to specify that the table is using the space saving row format at the table level and possibly at the column level. There are two ways in which tables can occupy less space when stored on disk:
  • If the column value is NULL, do not set aside the defined, fixed amount of space.
  • If the column value can be easily known or determined (like default values) and if the value is available to the database manager during record formatting and column extraction.
When VALUE COMPRESSION is used, NULLs and zero-length data that has been assigned to defined variable-length data types (VARCHAR, VARGRAPHICS, LONG VARCHAR, LONG VARGRAPHIC, BLOB, CLOB, and DBCLOB) will not be stored on disk. Only overhead values associated with these data types will take up disk space.

If VALUE COMPRESSION is used then the optional COMPRESS SYSTEM DEFAULT parameter can also be specified to further reduce disk space usage. Minimal disk space is used if the inserted or updated value is equal to the system default value for the data type of the column. The default value will not be stored on disk. Data types that support COMPRESS SYSTEM DEFAULT include all numerical type columns, fixed-length character, and fixed-length graphic string data types. This means that zeros and blanks can be compressed.

The two platforms vary dramatically in how they approach "compression." The mainframe actually applies an algorithm to the data to compress it into another format. Every row that is inserted must first be compressed before storing it; every row that is read must be decompressed. On LUW platforms, DB2 compression is simply a way of avoiding the storage of certain types of data that either can be determined easily, or need not be stored.

So, it is highly probable that you will get completely different results on LUW than you do on a mainframe (OS/390, z/OS). Which one is better will depend on the type of data you are storing based on the requirements of your applications.

So, when should you consider using compression? In general, use DB2 for z/OS compression for larger tablespaces where the disk savings can be significant. For very small tables, the amount of space required to store the compression dictionary may exceed the space saved by compressing the data.

What is the compression dictionary? Well, as I mentioned earlier, DB2 for z/OS compression is enabled by specifying COMPRESS YES for the tablespace in your DDL. When compression is specified, DB2 builds a static dictionary to control compression. This will cause from 2 to 17 dictionary pages to be stored in the tablespace. These pages are stored after the header and first space map page.

For partitioned tablespaces, DB2 will create a separate compression dictionary for each tablespace partition. Multiple dictionaries tend to cause better overall compression ratios. In addition, it is more likely that the partition-level compression dictionaries can be rebuilt more frequently than non-partitioned dictionaries. Frequent rebuilding of the compression dictionary can lead to a better overall compression ratio.

Avoid compressing table spaces with multiple tables in them because the compression ratio can be impacted by the different types of data in the multiple tables, and DB2 can only have one compression dictionary per table space.

But why compress data at all? Consider an uncompressed table with a large row size, say 800 bytes. Therefore, five of this table's rows fit on a 4K page. If the compression routine achieves 30 percent compression, on average, the 800-byte row uses only 560 bytes, because (800*0.3)=560. Now, on average, seven rows fit on a 4K page. Because I/O occurs at the page level, the cost of I/O is reduced because fewer pages must be read for tablespace scans, and the data is more likely to be in the bufferpool because more rows fit on a physical page. This can be a significant I/O improvement. Consider the following scenarios. A 10,000-row table with 800-byte rows requires 2,000 pages. Using a compression routine as outlined previously, the table would require only 1,429 pages. Another table also with 800-byte rows but now having 1 million rows would require 200,000 pages without a compression routine. Using the compression routine, you would reduce the pages to 142,858 - a reduction of more 50,000 pages.

Another question I am commonly asked is about overhead. Yes, there is going to be some overhead involved if you turn on compression... CPU is required to apply the Ziv-Lempel algorithm to compress upon insertion - and to de-compress upon access. Of course, this does NOT mean that overall performance will suffer if you turn on compression. Rememeber the trade-off: additional CPU in exchange for possibly improved I/O efficiency. You see, when more compressed rows fit onto a single page fewer I/O operations may be needed to satisfy your query processing needs. If you are performing a lot of sequential access (as opposed to random access) you can get improved performance because fewer I/O operations are required to access the same number of rows.

Of course. there is always the other trade-off to consider, too: disk storage savings in exchange for CPU cost of compressing and decompressing data. Keep in mind, too though, DB2 can use hardware-assisted compression if you have the right type of hardware. Hardware-assisted compression simply speeds up the compression and decompression of data -- it is not a requirement for the inherent data compression features of DB2. So, the overall cost of compression may be minimal with hardware-assisted compression. Indeed, due to I/O issues, overall elapsed time for certain I/O heavy processes may decrease when data is compressed.

You can use the DSN1COMP utility to estimate how much disk space will be saved by compressing a tablespace before deciding whether to turn compression on or not. This utility can be run on full image copy data sets, VSAM data sets that contain DB2 table spaces, or sequential data sets that contain DB2 table spaces (such as DSN1COPY output). DSN1COMP does not estimate savings for data sets that contain LOB table spaces or index spaces. Refer to the IBM Utility Guide and Reference for more information on DSN1COMP.

Of course, before you consider compression be sure to examine all of its details -- and be sure to understand all of the nuances of your particular data and applications. But don't be afraid of investigating its use... compression can be a very handy tool in the DBA's arsenal!

4 comments:

Anonymous said...

I realize that this is a little late but, for what it's worth, (800*0.3)=240 .

The equation should be (800 - (800*0.3)) =560.

Craig S. Mullins said...

Thanks for clarifying that!

Anonymous said...

How do you tell if the compression dictionary has been created?

Tun4Tun said...

DSN1COMP utility can be run to see whether a compression dictionary has been created for a tablespace or not