Monday, October 12, 2009

On The Importance of Choosing the Correct Data Type

Most DBAs have grappled with the pros and cons of choosing one data type versus another. Sometimes the decision is easy, whereas sometimes the decision is a little bit more difficult. In today's blog entry, we'll discuss both situations.

Data type and length are the most fundamental integrity constraints applied to data in a database. Simply by specifying the data type for each column when a table is created, DB2 automatically ensures that only the correct type of data is stored in that column. Processes that attempt to insert or update the data to a non-conforming value will be rejected. Furthermore, a maximum length is assigned to the column to prohibit larger values from being stored in the table.

The DBA must choose the data type and length of each column wisely. The general rule of thumb for choosing a data type for the columns in your tables is to choose the data type that most closely matches the domain of correct values for the column. That means you should try to adhere to the following rules:
  • If the data is numeric, favor SMALLINT, INTEGER, BIGINT, or DECIMAL data types. DECFLOAT and FLOAT are also options for very large numbers.
  • If the data is character, use CHAR or VARCHAR data types.
  • If the data is date and time, use DATE, TIME, and TIMESTAMP data types.
  • If the data is multimedia, use GRAPHIC, VARGRAPHIC, BLOB, CLOB, or DBCLOB data types.
But, of course, there are always exceptions. For example, what about social security number? That is a numeric column, but you won't be performing calculations using it. And, to format it correctly requires hyphens, which cannot be stored in a numeric data type. So, perhaps, SSN is an exception.

Leading Zeroes

Let's consider another typical situation: the desire to display leading zeroes. Maybe the application requires a four-byte code that is used to identify products, accounts, or some other business object, and all of the codes are numeric and will stay that way. But, for reporting purposes, the users want the codes to print out with leading zeroes. So, the users request that the column be defined as CHAR(4) to ensure that leading zeroes are always shown. But what are the drawbacks of doing this?

Well, there are drawbacks! Without proper edit checks, inserts and updates could place invalid alphabetic characters into the product code. This can be a very valid concern if ad hoc data modifications are permitted. This is rare in production databases, but data problems can still occur if the proper edit checks are not coded into every program that can modify the data. But let's assume that proper edit checks are coded and will never be bypassed. This removes the data integrity question (yeah, right!).

There is another problem that is related to performance and filter factors. Consider the possible number of values that a CHAR(4) column and a SMALLINT column can assume. Even if edit checks are coded for each, DB2 is not aware of these and assumes that all combinations of characters are permitted. DB2 uses base 37 math when it determines access paths for character columns, under the assumption that 26 alphabetic letters, 10 numeric digits, and a space will be used. This adds up to 37 possible characters. For a four-byte character column there are 374 or 1,874,161 possible values.

A SMALLINT column can range from -32,768 to 32,767 producing 65,536 possible small integer values. The drawback here is that negative or 5 digit product codes could be entered. However, if we adhere to our proper edit check assumption, the data integrity problems will be avoided here, as well. And we can always code a simple CHECK constraint to ensure that the code is always greater than zero.

DB2 will use the HIGH2KEY and LOW2KEY values to calculate filter factors. For character columns, the range between HIGH2KEY and LOW2KEY is larger than numeric columns because there are more total values. The filter factor will be larger for the numeric data type than for the character data type which may influence DB2 to choose a different access path. For this reason, favor the SMALLINT over the CHAR(4) definition.

The leading zeroes problem might be able to be solved using other methods. If you are using a report writer, most of them have quick methods of displaying leading zeroes. When using QMF, you can ensure that leading zeroes are shown by using the "J" edit code. Report programs can be coded to display leading zeroes easily enough by moving the host variables to appropriate display fields. Ad hoc access through other reporting tools typically provide a parameter that can enable leading zeroes to be displayed.

In general, it is wise to choose a data type which is closest to the domain for the column. IF the column is to store numeric data, favor choosing a numeric data type: SMALLINT, INTEGER, DECIMAL, or floating point. In addition, always be sure to code appropriate edit checks to ensure data integrity.

A DATE with DB2

What about dates? In my opinion, you should always use a DATE data type for date data. Why anyone would ever store a date or time in a DB2 table any other format than DATE, TIME, or TIMESTAMP is beyond me. But, oh, they do it all the time. And it causes all sorts of headaches. The benefits of using the proper DB2 data type are many, including:
  • Ensuring data integrity because DB2 will ensure that only valid date and time values are stored
  • The ability to use date and time arithmetic
  • A vast array of built-in functions to operate on and transform date and time values
  • Multiple formatting choices
Additionally, you have multiple built-in functions at your disposal for manipulating date and time data, but only when the data is stored as DATE, TIME, and TIMESTAMP data types. The time-related DB2 functions include CHAR, DATE, DAY, DAYOFMONTH, DAYOFWEEK, DAYOFYEAR, DAYS, HOUR, JULIAN_DAY, MICROSECOND, MIDNIGHT_SECONDS, MINUTE, MONTH, QUARTER, SECOND, TIME, TIMESTAMP, WEEK, WEEK_ISO, and YEAR.

I get questions all the time from folks who've stored dates using character data types; they ask all kinds of integrity and manipulation questions and I always, always, always start my reply with "You shouldn't have done that!"... and then I try to help them out, if I can. Don't fall into the trap of storing dates using anything but a DATE data type... you'll thank me later.

Summary

There is a lot more that can be said about choosing appropriate data types, but I think we've covered enough for today. The bottom line on data types is to use them to protect the integrity of your DB2 data and to simplify your job by taking advantage of DB2’s built-in capabilities. By choosing that data type that most closely matches your data you will be doing yourself, your systems, and your users a big favor.

Friday, October 02, 2009

IDUG Europe is Right Around the Corner

Just a quick post today to remind everybody that the annual European IDUG conference will be held next week (the week of October 5, 2009) in Rome, Italy. And it is not too late to ensure that you will be there to hear the latest and greatest news, tips, tricks, and guidelines on our favorite DBMS - IBM's DB2!

For those of you not lucky enough to be there keep an eye on my DB2portal blog here where I will attempt to summarize the key events of the week.

And if you are a Twitter aficionado, be sure to follow me on Twitter as I will try to make regular Tweets about the event (as long as my Blackberry works in Rome).

Thursday, September 24, 2009

Limiting the Number of Rows Fetched

Application developers frequently need to retrieve a limited number of qualifying rows from a table. For example, maybe you need to list the top ten best selling items from inventory, or a list of the top five most expensive products (i.e., highest price tag). There are several ways to accomplish this prior to DB2 V7 using SQL, but they are not necessarily efficient.

The first reaction is to simply use the WHERE clause to eliminate non-qualifying rows. But this is simplistic, and often is not sufficient to produce the results desired in an optimal manner. What if the program only requires that the top ten results be returned? This can be a somewhat difficult request to formulate using SQL alone.

Consider, for example, an application that needs to retrieve only the top ten most highly paid employees from the EMP sample table. You could simply issue a SQL request that retrieves all of the employees in order by salary, but only use the first ten retrieved. That is easy; for example:

EMPNO, FIRSTNME, LASTNAME, SALARY
FROM DSN8710.EMP
ORDER BY SALARY DESC;

You must specify the ORDER BY clause with the DESC key word. This sorts the results into descending order, instead of the default, which is ascending. Without the DESC key word, the "top ten" would be at the very end of the results set, not at the beginning.

But that does not really satisfy the requirement - retrieving only the top ten. It merely sorts the results into descending sequence. So the results would still be all employees in the table, but in the correct order so you can view the "top ten" salaries very easily. The ideal solution should return only the top ten employees with the highest salary and not merely a sorted list of all employees.

You can code some "tricky" SQL to support this request for all versions of DB2, such as the following:

SELECT EMPNO, FIRSTNME, LASTNAME, SALARY
FROM DSN8710.EMP A
WHERE 10 > (SELECT COUNT(*)
FROM DSN8710.EMP B
WHERE A.SALARY < B.SALARY)
AND SALARY IS NOT NULL
ORDER BY SALARY DESC;

This SQL is portable from DB2 to other DBMSs, such as Oracle or SQL Server. And, of course, you can change the constant 10 to any number you wish, thereby retrieving the top 20, or top 5, as deemed necessary by the needs of your application.

Since the SALARY column is nullable in the EMP table, you must remove the nulls from the results set. And the ORDER BY is required to sort the results in the right order. If it is removed from the query, the results will still contain the top ten, but they will be in no particular order.

But DB2, as of V7, provides an easier and less complicated way to limit the results of a SELECT statement - the FIRST key word. You can code FETCH FIRST n ROWS which will limit the number of rows that are fetched and returned by a SELECT statement.

Additionally, you can specify a new clause -- FETCH FIRST ROW ONLY clause -- on SELECT INTO statements when the query can return more than one row in the answer set. Doing so informs DB2 to ignore the other rows.

There is one difference between the new V7 formulation and the other SELECT statement we reviewed, and that is the way "ties" are handled. A tie occurs when more than one row contains the same value. The previous query we examined may return more than 10 rows if there are multiple rows with the same value for price within the top ten.

Using the FIRST key word DB2 will limit the number of rows returned to ten, even if there are other rows with the same value for price as the number ten row in the results set. The needs of your application will dictate whether ties are to be ignored or included in the result set. If all "ties" need to be included in the results set, which would mean that more than 10 rows would be needed, the new V7 feature may not prove to be helpful.

And it is also important to note that as of DB2 9, you can include the FETCH FIRST clause in a subselect. ORDER BY is allowed in a subselect, too. The subselect MUST be enclosed in parentheses and the FETCH FIRST (or ORDER BY) cannot be in the outermost fullselect of a view, or in a materialized query table.

Friday, September 04, 2009

Dynamic SQL Causing Lock Escalation?

Lock escalation is the promotion of a lock from a row, page or LOB lock to a table space lock because the number of page locks that are concurrently held on a given resource exceeds a preset limit.

But lock escalation can cause problems. Yes, when fewer locks are taken, CPU cost and memory usage can be reduced. On the other hand, escalating the size of a lock causes more resources to be locked... and that impacts concurrency with the most likely result being applications experiencing lock timeouts or deadlocks.

When lock escalation occurs, DB2 writes messages to the system log. So when complaints inevitably arrive about failing transactions you can look for the appropriate lock escalation message indicating that it could be the culprit.

But such simple steps are typically not sufficient to track down the root cause and take corrective actions. You need to be able to determine what SQL statement in what program and under what circumstances is causing the escalation. Doing so involves tracing and using performance monitoring tools.

For those interested in this topic, (indeed, the primary purpose of this blog post) you should seek out the IBM TechDoc titled Identifying the dynamic SQL statement which is causing a
lock escalation in DB2 for z/OS
. For those who don't know what a TechDoc is, don't worry, it isn't that complicated. It is basically a published piece of technical documentation that isn't part of a larger manual or IBM RedBook. They can be short papers, product flashes, presentations, etc.

This particular TechDoc details a methodology for identifying dynamic SQL statements involved in a lock escalation. It describes which traces to start, and which jobs to run to analyze the collected traces. It also explains how to analyze the trace report to find the problematic SQL statement.

You can search all of IBM's TechDocs by following this link.

Happy reading!

Friday, August 21, 2009

Approaches to Access Path Management

BIND and REBIND are important components in assuring efficient DB2 applications. Because the
BIND/REBIND process determines exactly how your DB2 data is accessed, it is important that you develop an appropriate strategy for when and how to REBIND your programs.

There are several common REBIND approaches taken by DB2 users. By far, the best approach is to REBIND your applications over time as your data and systems change. This approach involves some form of regular maintenance that keeps DB2 statistics up to date and formulates new access paths as data volumes and patterns change.

Other approaches include REBINDing only when a new version of DB2 is installed, or perhaps more ambitious, whenever new PTFs are applied to DB2. Another approach is to REBIND automatically after a regular period of time (days, weeks, months, etc.). This approach can work if the period of time is wisely chosen based on the application data – but it still can pose administrative issues.

The final approach can be summarized as “if it ain’t broke don’t fix it!” This is the worst of the several approaches discussed here. The biggest problem with this approach is that you are penalizing every program in your subsystem for fear that a program or two may have a degraded access path. This results in potentially many programs having sub-optimal performance because the optimizer never gets a chance to create better access paths as the data changes.

Of course, the possibility of degraded performance is real – and that is why this approach has been adopted at some sites. The problem is being able to find which statements have degraded. In an ideal world we would be to be able to review the access path changes beforehand to determine if they are better or worse. But DB2 itself does not provide any systematic method of administering access paths that way. There are third party tools that can help you achieve this though.

Anyway, let’s go back to the best approach again, and that is to REBIND on a regular basis as your data changes. This approach has become known as the three Rs. To implement this approach you:
  1. Regularly reorganize the data to ensure that it is optimally structured.
  2. Follow that with RUNSTATS to be sure that the reorganized state of the data is reflected in the DB2 Catalog.
  3. And follow that with a REBIND for all the application programs that access the data structures impacted by the REORG and RUNSTATS.
At any rate, your goal should be to keep your access paths up-to-date with the current state of your data. Failing to do this means that DB2 is accessing data based upon false assumptions. DB2 is unlikely to make the same access path choice as your data grows – and as patterns within the data change.

By REBINDing you can generally improve the overall performance of your applications because the access paths will be better designed based on an accurate view of the data. Additionally, as DB2 changes are introduced (PTFs, new version/release) optimizer improvements and new access techniques can be incorporated into the access paths. That is, if you never REBIND, not only are you forgoing better access paths due to data changes but you are also forgoing better access paths due to changes to DB2 itself.

Adopting the Three R’s approach can pose additional questions. For example, when should you reorganize? In order to properly determine when a REORG is needed you’ll have to look at statistics. This means looking at either RUNSTATS or Real-Time Statistics (RTS). So, we need to add another R – in other words:
  1. RUNSTATS or better yet, RTS
  2. REORG
  3. RUNSTATS
  4. REBIND
Now it is true that some folks don’t rely on statistics to schedule a REORG. Instead, they just build the JCL to REORG their database objects when they create the object. So they create a table space then build the REORG job and schedule it to run monthly, or quarterly, or on some regular basis. This is better than no REORG at all, but it is probably not the best approach because you are most likely either reorganizing too soon (in which case you waste the CPU cycles to do the REORG) or you are reorganizing too late (in which case performance is suffering for a period of time before the REORG runs). Better to base your REORGs off of statistics and thresholds using either RUNSTATS or RTS.

Without accurate statistics there is little hope that the optimizer will formulate the best access path to retrieve your data. If the optimizer doesn’t have accurate information on the size, organization, and particulars of your data then it will be creating access paths based on either default or inaccurate statistics. Incorrect statistics will cause bad choices to be made – such as choosing a merge-scan join when a nested loop join would be better, or failure to invoke sequential prefetch, or using the wrong index – or no index at all. And the problem of inaccurate statistics is pervasive.

There are shops out there that never, or rarely, run RUNSTATS to gather up-to-date statistics. Make sure yours is not one of those shops!

When should you run RUNSTATS? One answer is “As frequently as possible based on how often your data changes.” To do this you will need to know a thing or two about your data growth patterns: what is its make-up, how is it used, how fast does it grow, and how often does it change? These patterns will differ for every table space in your system.

Next we need to decide when to REBIND? The best answer for this is when statistics have changed significantly enough to change access paths. When we know that data has significantly changed it makes sense to REBIND after the RUNSTATS completes. But the trick is determining exactly when we have a “significant” change in our data. Without an automated method of comparing and contrasting statistics (or even better yet, access paths) coming up with an answer in a manual way can be time-consuming and error-prone – especially if we have thousands of DB2 programs to manage.

As we REBIND, we always must be on alert for rogue access paths. A rogue access path is created when the optimizer formulates a new access path that performs worse than the previous access path. This can happen for a variety of reasons. Of course, number one is that the optimizer, though good, is not perfect. So mistakes can happen. Other factors can cause degraded access paths, too. The access paths for volatile tables depend on when you run the RUNSTATS. Volatile tables are those that start out empty, get rows added to them during processing, and are emptied out at the end of the day. And, of course, if the catalog or statistics are not accurate we can get problems, too.

So adopting the Four R’s approach implies that you will have to develop a methodology for reviewing your access paths and taking care of any “potential” problem access paths. Indeed, the Four R’s becomes the Five R’s as we add a step to review the access paths after REBINDing to make sure that there are no rogue access paths:

  1. Start RTS (or a RUNSTATS) to determine when to REORG.
  2. Reorganize the table spaces (and/or indexes)
  3. After reorganizing, run RUNSTATS again,
  4. Follow that with the REBINDs.
  5. Then we need that fifth R – which is to Review the access paths generated by the REBIND.
The review is of utmost importance because the optimizer can make mistakes. And, of course, so can you. But your users will not call you when performance is better (or the same). They only ring you up when performance gets worse. As such, proactive shops will put best practices in place to test REBIND results comparing the before and after impact of the optimizer’s choices.