Thursday, March 22, 2012

Working With Variable Data


One of the key design issues that is addressed in most DB2 applications is how to implement character data that varies in size from row to row. The basic trade-off is “ease-of-use and performance” versus “storage requirements.” It is possible to save storage space by using variable columns instead of placing small amounts of data in a large fixed space. Each variable column carries a 2-byte overhead, however, for storing the length of the data. Additionally, variable columns tend to increase CPU usage and can cause the update process to become inefficient. When a variable column is updated with a larger value, the row becomes larger; if not enough space is available to store the row, it must be moved to another page. This makes the update and any subsequent retrieval slower.

Follow these rules when defining variable character columns:
  • Avoid variable columns if a sufficient disk space is available to store the data using fixed columns. The general rule is: disk is cheap! Trading the cost of disk for more efficient development and subsequent performance is often an easy decision to make.
  • Though this may be obvious, do not define a variable column when the data does not vary from row to row.
  • Do not define a variable columns if its maximum length is less than 30 bytes. Furthermore, do not define a variable column if its maximum length is within 10 bytes of the average length of the column. Adhering to these two rules will cause you to choose VARCHAR data types only when they can potentially provide enough DASD savings to offset other costs.
  • Consider redefining variable columns by placing multiple rows of fixed length columns in another table or by shortening the columns and placing the overflow in another table.

If, after following these guidelines, VARCHAR columns need to be implemented, go ahead and do so. However, it is wise to continuously re-assess the decision to use variable character data.

So how can you monitor the effectiveness of variable columns? Well, it is possible to query the DB2 Catalog to determine the effectiveness of using VARCHAR for a column instead of CHAR. Consider, for example, the PROJNAME column of the DSN81010.PROJ table. It is defined as VARCHAR(24).

To gauge whether VARCHAR is appropriate for this particular column, consider issuing the following SQL query:

     SELECT    COL_LGTH, COUNT(*)
     FROM      (SELECT LENGTH(PROJNAME) AS COL_LENGTH
                FROM   DSN81010.PROJ
               )
     GROUP BY  COL_LGTH
     ORDER BY  COL_LGTH;


This query will produce a report listing the lengths (in this case, from 1 to 24, excluding those lengths which do not occur) and the number of times that each length occurs in the table.  These results can be analyzed to determine the range of lengths stored within the variable column. 

If you are not concerned about this level of detail, the following query can be used instead to summarize the space characteristics of the variable column in question:

    SELECT  24*COUNT(*),
            SUM(2+LENGTH(PROJNAME)),
            24*COUNT(*)-SUM(2+LENGTH(PROJNAME)),
            24,
            AVG(2+LENGTH(PROJNAME)),
            24-AVG(2+LENGTH(PROJNAME))
    FROM    DSN81010.PROJ;



This query produces a report similar to the one shown below:

SPACE      SPACE        TOTAL   AVG.       AVG.         AVG.
USED AS    USED AS      SPACE   SPACE AS   SPACE AS     SPACE
CHAR(24)   VARCHAR(24)  SAVED   CHAR(24)   VARCHAR(24)  SAVED
--------   -----------  ------  ---------  -----------  -----
158058        96515     61543     24          16          8



The following list itemizes the definition for each of the individual columns calculated by this query:

Definition                               
Calculation                                                 
Space Used As CHAR(24)
24*COUNT(*)
Space Used As VARCHAR(24)
SUM(2+LENGTH(PROJNAME))
Total Space Saved Using VARCHAR
24*COUNT(*)-SUM(2+LENGTH(PROJNAME))
Avg. Space Used As CHAR(24)
24
Avg. Space Used As VARCHAR(24)
AVG(2+LENGTH(PROJNAME))
Avg. Space Saved Using VARCHAR
24-AVG(2+LENGTH(PROJNAME))

The query can be modified to be used for any VARCHAR-defined column. The constant 24 can be changed to indicate the maximum length of the variable column as defined in the DDL. Using these tools, you can better judge the actual disk space savings accruing as a result of VARCHAR usage.

Synopsis

There are sound reasons for using variable data types within DB2 databases. Yet, it is important to remember that business conditions change and what may have been a sound reason for using VARCHAR data in the past, may no longer be sound. Use the queries in this article to determine if it is still appropriate to use VARCHAR data. In general, do not use VARCHAR for small columns or for columns whose length does not vary considerably.

2 comments:

Abel Martins said...

Hi, Craig,
I agree with your last sentence but I have defended intensive VARCHAR usage because I do not care about any argument against its use, such as page changing, cheap disk, etc.. because my war is against I/O operations.
My motto is "less disk space usage less I/O operations" and I will have a greater chance to maintain data in memory.
I have the following rule to determine datatypes:
- SMALLINT - Integer from 1 to 4 digits (2 bytes)
- DECIMAL - Integer with 5 digits (3 bytes)
- INTEGER - Integer from 6 to 9 digits (4 bytes)
- BIGINT - Integer from 10 to 18 digits (8 bytes)
- DECIMAL - Integer from 19 to 31 digits (length/2+1 bytes)
- DECIMAL - Decimal from 1 to 31 digits (length/2+1 bytes)
- CHAR - Alphanumeric from 1 to 12 characters (length bytes)
- CHAR - Alphanumeric from 13 to 255 characters with fixed content size (length bytes)
- VARCHAR - Alphanumeric greater than 12 characters (2+content length)
- GRAPHIC and VARGRAPHIC - the same concept of CHAR and VARCHAR for UNICODE characters.
This problem gets worse when it comes to UNICODE characters using two bytes for each character. I want data in memory, I want that data sleeps in memory, this is my main battle and, logically, I also configure large memory areas for bufferpools especially for temporary tablespaces and reference data, EDM, SORT and RID pools, Log buffer and Dataset control.
Best regards.
abel.martins@uol.com.br

Craig S. Mullins said...

Hi Abel:

Good to hear from you. I understand your concerns here and agree that, in general, less disk space used means less data to read and therefore less I/O. However, I would add that VARCHAR adds programming burden to developers who have to use and keep track of the length of the columns programatically. I sometimes advise people to look into compression instead of VARCHAR. That way, you can use CHAR which does not require programmers to code lengths, and still reduce disk usage by compressing the entire row instead of just the one column. Of course, like everything, this is not an always/never type of advice...

Cheers!