Thursday, August 24, 2006

VARCHAR versus Compression

A couple of days ago I posted a blurb giving advice on using variable character columns in DB2. After thinking about the topic a little bit more, I decided to post a follow-on topic: namely, comparing the use of VARCHAR to DB2 compression.

Even though these are two entirely different "things," they are each probably done for similar reasons - to save disk storage. VARCHAR does this by adjusting the size of the column to fit the actual length of text being stored; compression does this by sending rows of data through an algorithm to minimize its length. For those interested in the details of compression I refer you to Willie Favero's excellent blog where he has written a several-part series on compression -- here are the links to it: part one, part two, and part three.

So, what advice can I give on comparing the two? Well, you might want to consider forgoing the use of variable columns and instead turn on compression. With variable columns you always add overhead: there is a two-byte prefix for every VARCHAR column to store the length of the VARCHAR. If instead you use CHAR and turn on compression you no longer need the extra two bytes per row per variable column.

Also, keep in mind that compression reduces the size of the entire row. So not only will you be compressing the CHAR column (that used to be VARCHAR), but you will also give DB2 the opportunity to compress every other column in that row.

All in all, that means that comrpession can return better disk storage savings than variable columns, and all without the programmatic overhead of having to calculate and store the two-byte prefix for each previously variable column.

Of course, I don't want to give the impression that this should always be done... (remember the DBA's maxim: Almost never say "always or never.") And there are additional things to consider, such as:
  • Compression adds a compression dictionary to the table space so a compressed table space can actually be larger than a non-compressed table space (if it is very small to begin with).
  • Compression requires additional CPU cycles to compress and de-compress the data as it is inserted, modified, and read (of course, I/O can decrease because smaller rows will fit more on each page, so degraded CPU performance can be offset by improved I/O)

This is just an additional "thing to consider" when you are building your DB2 databases and trying to decide whether you should use VARCHAR or CHAR...

Monday, August 21, 2006

IBM Mainframes - Not Just for Big Shops Any More

Just a quick blog today to point you to an interesting article in the latest issue of IBM Systems Magazine - Mainframe Edition. The article, titled A New System for a New Market, points out that the System z9 Business Class (z9 BC) platform, the latest mainframe in IBM's product line announced in April 2006, is suitable for the small and medium business (SMB) space.

This offering brings high performance and scalability to the SMB market at a very reasonable cost (around $100k). With specialty engines that can be added (IFL, zIIP and zAAP), again at a reasonable cost, it looks like the IBM mainframe will not only remain viable for large shops, but it could expand out into smaller ones, too.

So, as most mainframe afficianados know, the mainframe is not dead. But, it may actually be able to grow with the new features and affordability being built into IBM's new mainframes.

Sunday, August 20, 2006

Advice on Using Variable Character Columns in DB2

One of the long-standing, troubling questions in DB2-land is when to use VARCHAR versus CHAR. The high-level advice for when to use VARCHAR instead of CHAR is for larger columns whose length varies considerably from row-to-row. Basically, VARCHAR should be used to save space in the database when your values are truly variable.

In other words, if you have a 10-byte column, it is probably not a good idea to make it variable... unless, of course, 90% of the values are only one or two bytes, then it might make some sense. Have you gotten the idea here that I'm not going to give any hard and fast rules? Hope so, cause I won't - just high-level guidance.

Another situation: say you have an 80 byte column where values range from 10 bytes to the full 80 bytes... and more than 50% of them are less than 60 bytes. Well, that sounds like a possible candidate for VARCHAR to me.

Of course, there are other considerations. Java programmers tend to prefer variable character columns because Java does not have a native fixed length character data type.

For traditional programming languages though, CHAR is preferred because VARCHAR requires additional programmatic handling (to set the length of each column when inserting or modifying the data).

OK, so what if you are trying to determine whether or not the appropriate decision was made when for VARCHAR columns instead of CHAR? You can use information from the DB2 Catalog to get a handle on the actual sizes of each VARCHAR column.

Using views and SQL it is possible to develop a report showing the lengths of the variable column values. First, determine which VARCHAR column you need information about. For the purposes of this example, let's examine the NAME column of SYSIBM.SYSTABLES. This column is defined as VARCHAR(18). Create a view that returns the length of the NAME column for every row, for example:

CREATE VIEW LENGTH_INFO
(COL_LGTH)
AS
SELECT LENGTH(NAME)
FROM SYSIBM.SYSTABLES;

Then, issue the following query using SPUFI to produce a report detailing the LENGTH and number of occurrences for that length:

SELECT COL_LGTH, COUNT(*)
FROM LENGTH_INFO
GROUP BY COL_LGTH
ORDER BY COL_LGTH;

This query will produce a report listing the lengths (in this case, from 1 to 18, excluding those lengths which do not occur) and the number of times that each length occurs in the table. These results can be analyzed to determine the range of lengths stored within the variable column. If you are not concerned about this level of detail, the following query can be used instead to summarize the space characteristics of the variable column in question:

SELECT 18*COUNT(*),
SUM(2+LENGTH(NAME)),
18*COUNT(*)-SUM(2+LENGTH(NAME)),
18,
AVG(2+LENGTH(NAME)),
18-AVG(2+LENGTH(NAME))
FROM SYSIBM.SYSTABLES;

The constant 18 will need to be changed in the query to indicate the maximum length of the variable column as defined in the DDL. This query will produce a report such as the one shown below:

SPACE SPACE TOTAL AVERAGE AVERAGE AVERAGE
USED AS USED AS SPACE SPACE AS SPACE AS SPACE
CHAR(18) VARCHAR(18) SAVED CHAR(18) VARCHAR(18) SAVED
--------- ----------- ------ -------- ----------- -------
158058 96515 61543 18 10 8



This information can then be analyzed to determine if the appropriate decision was made when VARCHAR was chosen. (Of course, the values returned will differ based on your environment and the column(s) that you choose to analyze.) Also, keep in mind that this report will not include the 2 byte prefix stored by DB2 for variable length columns.

I hope this high-level overview with advice on when to use VARCHAR versus CHAR has been helpful. If you have your own guidelines or queries that you use please feel free to post a comment to this blog and share them with everyone.



NOTE: You could skip the creation of the VIEW in the above query and just use a nested table expression (aka in-line view) instead.

Thursday, August 17, 2006

Greatest Software Ever?

I just stumbled across a very interesting article this afternoon and thought I'd share it with everybody through my blog. The article, published in Information Week is titled What's The Greatest Software Ever Written? And isn't that an intriguing question?

Well, I read through the article and other than a few quibbles here and there I'd have to say that the author did a good job of assembling his list. He spends quite a bit of time talking about the IBM 360 project - and well he should. This was one of the first truly huge software projects and it set the bar for what is expected of an operating system. It also was the catalyst for causing one of the best ever books on software development to be written - The Mythical Man Month. Written by Fred Brooks, the manager in charge of the IBM 360 project, this book outlines many of the truisms about software development that we acknowledge even today - more than 40 years later. If you work in IT and you haven't read The Mythical Man Month you really should buy a copy and read it immediately. Anyway, this blog isn't about that book, so let's move on.

I won't spoil it here and publish the list of greatest software - you will have to click on the link for the article and read it yourself (the actual list doesn't start until page 3 of the article, but don't just click right over to that page, read the whole thing).

Suffice it to say, several IBM projects make the list (I'm kinda partial to what came in at #2 -- it would've been my #1 actually). And I think perhaps that VisiCalc belongs on the list instead of the spreadsheet software that is listed - I mean, Dan Bricklin invented the entire spreadsheet category of software when Software Arts published VisiCalc back in the late 1970s.

But the article is good anyway and I'm sure it is almost impossible to publish a list like this without causing disagreement - and perhaps that is its intent any way. So take a moment and click over to the article and give it a read. And feel free to share your thoughts on it here by posting a comment or two.

Thursday, August 10, 2006

SHARE Travelers Take Heed

With the upcoming SHARE conference in Baltimore next week, there are sure to be many of you out there who will be traveling to the nation's capital region over the weekend. As you prepare to travel, be sure to factor in additional time at the airport due to the latest TSA warning.

Basically, in response to a recently thwarted terrorist plot in the UK, the threat level has been raised to High (or Orange) for all commercial aviation operating in or destined for the United States. That means the lines will be longer and the searches more thorough going through security at the airport.

Additionally, please read the TSA announcement and heed what it is saying. I am referring specifically to this: "Due to the nature of the threat revealed by this investigation, we are prohibiting any liquids, including beverages, hair gels, and lotions from being carried on the airplane." Please, for everyone's sake, leave your liquids at home:
  • You can get a drink after you pass through security.
  • Every hotel provides shampoo, conditioner, and lotion for free, so you don't need to bring them.
  • If you absolutely have to have your favorite brand, or some gel or spray, pack it in your checked bags.
And yes, please check your dang luggage! Although I am sometimes amused by idiots trying to jam a huge bag into the overhead bin, it becomes less amusing after a two hour amble through security. If you have a large bag check it!

And I'll see you all in Baltimore.