Thursday, August 31, 2006

Accessing Partitioned Data

One area that tends to confuse some DB2 developers until they gain experience is how DB2 partitioning works. A common question I get goes something like this: “If a table is in a partitioned table space, say four partitions, for example, then to process the table in batch can I run four instances of the batch program in parallel, one against each partition. What do I code to do this?”

Well, the short and sweet answer to this question is “Yes, you can run four instances of a batch program in parallel if you so desire.” But there is a nuance to this misconception that might be missed here. The question lurking beneath the question is this: “How can I make sure I am accessing only data in one partition in each of the batch programs?”

To do this requires some programming work. The program will need to have a means of identifying which partition it should run against. So, you might code the program to accept an input parameter of 1, 2, 3, or 4. The program would read the parameter and translate it into the key range of values that should be read by the program. This is the LIMITKEY value for the particular partition as found in the partitioning index. You can retrieve this value from the DB2 catalog using the following query:

SELECT PARTITION, LIMITKEY
FROM SYSIBM.SYSINDEXPART
WHERE IXNAME = ?
AND IXCREATOR = ?
ORDER BY PARTITION;

Supply the index name and creator and this query will return the partition number and LIMITKEY for that partition. (If you include this query in the program you probably will want to include the PARTITION column in the WHERE clause and return only a single row.) The LIMITKEY is the high key value for that partition. Using this information you will need to write the queries in the program such that only values from the partition being processes will be retrieved. As long as the program adheres to that key range you should only process data from the one partition that holds that data.

Of course, none of this is necessary to get DB2 to operate in parallel. The best approach uses DB2 query parallelism because it minimizes the amount of programming work and has the potential to maximize performance. To signal DB2 to turn on parallelism you will need to BIND your program specifying the DEGREE(ANY) parameter. Then DB2 will choose the degree of parallelism for the batch program. You will only need to run the program once (instead of 4 times as in our example); DB2 will figure out how many tasks it can run in parallel for each query in the program. And there is no need to modify the program at all! This is far simpler than any alternate approach because DB2 handles chunking up the work into parallel tasks for you.

Tuesday, August 29, 2006

How are Indexes Being Used?

In keeping with my promise to periodically post blog entries based on questions I have received, here we have another question I have been asked:

If I have five indexes on a table, what is the best way to determine if all, some or none of the indexes are being used?

Here is the answer I sent:

The best approach would be to make sure you have run EXPLAIN on all of your production plans and packages. Then examine the PLAN_TABLE output for those indexes. For example:

SELECT *
FROM my.PLAN_TABLE
WHERE ACCESSNAME IN (put your list of indexes here);

Of course, this will not show the dynamic SQL that uses any of these indexes. There are tools on the market that can help you to gather SQL usage statistics and access path information for both static and dynamic SQL statements.

You might consider acquiring one of these tools if you use a lot of dynamic SQL. One such offering is NEON Enterprise Software’s SQL Performance Expert solution.

Thursday, August 24, 2006

VARCHAR versus Compression

A couple of days ago I posted a blurb giving advice on using variable character columns in DB2. After thinking about the topic a little bit more, I decided to post a follow-on topic: namely, comparing the use of VARCHAR to DB2 compression.

Even though these are two entirely different "things," they are each probably done for similar reasons - to save disk storage. VARCHAR does this by adjusting the size of the column to fit the actual length of text being stored; compression does this by sending rows of data through an algorithm to minimize its length. For those interested in the details of compression I refer you to Willie Favero's excellent blog where he has written a several-part series on compression -- here are the links to it: part one, part two, and part three.

So, what advice can I give on comparing the two? Well, you might want to consider forgoing the use of variable columns and instead turn on compression. With variable columns you always add overhead: there is a two-byte prefix for every VARCHAR column to store the length of the VARCHAR. If instead you use CHAR and turn on compression you no longer need the extra two bytes per row per variable column.

Also, keep in mind that compression reduces the size of the entire row. So not only will you be compressing the CHAR column (that used to be VARCHAR), but you will also give DB2 the opportunity to compress every other column in that row.

All in all, that means that comrpession can return better disk storage savings than variable columns, and all without the programmatic overhead of having to calculate and store the two-byte prefix for each previously variable column.

Of course, I don't want to give the impression that this should always be done... (remember the DBA's maxim: Almost never say "always or never.") And there are additional things to consider, such as:
  • Compression adds a compression dictionary to the table space so a compressed table space can actually be larger than a non-compressed table space (if it is very small to begin with).
  • Compression requires additional CPU cycles to compress and de-compress the data as it is inserted, modified, and read (of course, I/O can decrease because smaller rows will fit more on each page, so degraded CPU performance can be offset by improved I/O)

This is just an additional "thing to consider" when you are building your DB2 databases and trying to decide whether you should use VARCHAR or CHAR...

Monday, August 21, 2006

IBM Mainframes - Not Just for Big Shops Any More

Just a quick blog today to point you to an interesting article in the latest issue of IBM Systems Magazine - Mainframe Edition. The article, titled A New System for a New Market, points out that the System z9 Business Class (z9 BC) platform, the latest mainframe in IBM's product line announced in April 2006, is suitable for the small and medium business (SMB) space.

This offering brings high performance and scalability to the SMB market at a very reasonable cost (around $100k). With specialty engines that can be added (IFL, zIIP and zAAP), again at a reasonable cost, it looks like the IBM mainframe will not only remain viable for large shops, but it could expand out into smaller ones, too.

So, as most mainframe afficianados know, the mainframe is not dead. But, it may actually be able to grow with the new features and affordability being built into IBM's new mainframes.

Sunday, August 20, 2006

Advice on Using Variable Character Columns in DB2

One of the long-standing, troubling questions in DB2-land is when to use VARCHAR versus CHAR. The high-level advice for when to use VARCHAR instead of CHAR is for larger columns whose length varies considerably from row-to-row. Basically, VARCHAR should be used to save space in the database when your values are truly variable.

In other words, if you have a 10-byte column, it is probably not a good idea to make it variable... unless, of course, 90% of the values are only one or two bytes, then it might make some sense. Have you gotten the idea here that I'm not going to give any hard and fast rules? Hope so, cause I won't - just high-level guidance.

Another situation: say you have an 80 byte column where values range from 10 bytes to the full 80 bytes... and more than 50% of them are less than 60 bytes. Well, that sounds like a possible candidate for VARCHAR to me.

Of course, there are other considerations. Java programmers tend to prefer variable character columns because Java does not have a native fixed length character data type.

For traditional programming languages though, CHAR is preferred because VARCHAR requires additional programmatic handling (to set the length of each column when inserting or modifying the data).

OK, so what if you are trying to determine whether or not the appropriate decision was made when for VARCHAR columns instead of CHAR? You can use information from the DB2 Catalog to get a handle on the actual sizes of each VARCHAR column.

Using views and SQL it is possible to develop a report showing the lengths of the variable column values. First, determine which VARCHAR column you need information about. For the purposes of this example, let's examine the NAME column of SYSIBM.SYSTABLES. This column is defined as VARCHAR(18). Create a view that returns the length of the NAME column for every row, for example:

CREATE VIEW LENGTH_INFO
(COL_LGTH)
AS
SELECT LENGTH(NAME)
FROM SYSIBM.SYSTABLES;

Then, issue the following query using SPUFI to produce a report detailing the LENGTH and number of occurrences for that length:

SELECT COL_LGTH, COUNT(*)
FROM LENGTH_INFO
GROUP BY COL_LGTH
ORDER BY COL_LGTH;

This query will produce a report listing the lengths (in this case, from 1 to 18, excluding those lengths which do not occur) and the number of times that each length occurs in the table. These results can be analyzed to determine the range of lengths stored within the variable column. If you are not concerned about this level of detail, the following query can be used instead to summarize the space characteristics of the variable column in question:

SELECT 18*COUNT(*),
SUM(2+LENGTH(NAME)),
18*COUNT(*)-SUM(2+LENGTH(NAME)),
18,
AVG(2+LENGTH(NAME)),
18-AVG(2+LENGTH(NAME))
FROM SYSIBM.SYSTABLES;

The constant 18 will need to be changed in the query to indicate the maximum length of the variable column as defined in the DDL. This query will produce a report such as the one shown below:

SPACE SPACE TOTAL AVERAGE AVERAGE AVERAGE
USED AS USED AS SPACE SPACE AS SPACE AS SPACE
CHAR(18) VARCHAR(18) SAVED CHAR(18) VARCHAR(18) SAVED
--------- ----------- ------ -------- ----------- -------
158058 96515 61543 18 10 8



This information can then be analyzed to determine if the appropriate decision was made when VARCHAR was chosen. (Of course, the values returned will differ based on your environment and the column(s) that you choose to analyze.) Also, keep in mind that this report will not include the 2 byte prefix stored by DB2 for variable length columns.

I hope this high-level overview with advice on when to use VARCHAR versus CHAR has been helpful. If you have your own guidelines or queries that you use please feel free to post a comment to this blog and share them with everyone.



NOTE: You could skip the creation of the VIEW in the above query and just use a nested table expression (aka in-line view) instead.