Sunday, August 20, 2006

Advice on Using Variable Character Columns in DB2

One of the long-standing, troubling questions in DB2-land is when to use VARCHAR versus CHAR. The high-level advice for when to use VARCHAR instead of CHAR is for larger columns whose length varies considerably from row-to-row. Basically, VARCHAR should be used to save space in the database when your values are truly variable.

In other words, if you have a 10-byte column, it is probably not a good idea to make it variable... unless, of course, 90% of the values are only one or two bytes, then it might make some sense. Have you gotten the idea here that I'm not going to give any hard and fast rules? Hope so, cause I won't - just high-level guidance.

Another situation: say you have an 80 byte column where values range from 10 bytes to the full 80 bytes... and more than 50% of them are less than 60 bytes. Well, that sounds like a possible candidate for VARCHAR to me.

Of course, there are other considerations. Java programmers tend to prefer variable character columns because Java does not have a native fixed length character data type.

For traditional programming languages though, CHAR is preferred because VARCHAR requires additional programmatic handling (to set the length of each column when inserting or modifying the data).

OK, so what if you are trying to determine whether or not the appropriate decision was made when for VARCHAR columns instead of CHAR? You can use information from the DB2 Catalog to get a handle on the actual sizes of each VARCHAR column.

Using views and SQL it is possible to develop a report showing the lengths of the variable column values. First, determine which VARCHAR column you need information about. For the purposes of this example, let's examine the NAME column of SYSIBM.SYSTABLES. This column is defined as VARCHAR(18). Create a view that returns the length of the NAME column for every row, for example:

CREATE VIEW LENGTH_INFO
(COL_LGTH)
AS
SELECT LENGTH(NAME)
FROM SYSIBM.SYSTABLES;

Then, issue the following query using SPUFI to produce a report detailing the LENGTH and number of occurrences for that length:

SELECT COL_LGTH, COUNT(*)
FROM LENGTH_INFO
GROUP BY COL_LGTH
ORDER BY COL_LGTH;

This query will produce a report listing the lengths (in this case, from 1 to 18, excluding those lengths which do not occur) and the number of times that each length occurs in the table. These results can be analyzed to determine the range of lengths stored within the variable column. If you are not concerned about this level of detail, the following query can be used instead to summarize the space characteristics of the variable column in question:

SELECT 18*COUNT(*),
SUM(2+LENGTH(NAME)),
18*COUNT(*)-SUM(2+LENGTH(NAME)),
18,
AVG(2+LENGTH(NAME)),
18-AVG(2+LENGTH(NAME))
FROM SYSIBM.SYSTABLES;

The constant 18 will need to be changed in the query to indicate the maximum length of the variable column as defined in the DDL. This query will produce a report such as the one shown below:

SPACE SPACE TOTAL AVERAGE AVERAGE AVERAGE
USED AS USED AS SPACE SPACE AS SPACE AS SPACE
CHAR(18) VARCHAR(18) SAVED CHAR(18) VARCHAR(18) SAVED
--------- ----------- ------ -------- ----------- -------
158058 96515 61543 18 10 8



This information can then be analyzed to determine if the appropriate decision was made when VARCHAR was chosen. (Of course, the values returned will differ based on your environment and the column(s) that you choose to analyze.) Also, keep in mind that this report will not include the 2 byte prefix stored by DB2 for variable length columns.

I hope this high-level overview with advice on when to use VARCHAR versus CHAR has been helpful. If you have your own guidelines or queries that you use please feel free to post a comment to this blog and share them with everyone.



NOTE: You could skip the creation of the VIEW in the above query and just use a nested table expression (aka in-line view) instead.

1 comment:

Anonymous said...

You have to look at this from a join or a lookup perspective, then you need to do a ltrim and rtrim . Is it worth the effort the developer is going to put and and the needless processing power spent per row per column to do the trimming ???

Who is the end user, how is it going to be used ?