The Db2 Portal Blog

Tuesday, November 17, 2009

Replacing UNION with CASE

When a UNION is required to put together data from multiple queries, you might be able to use a CASE statement instead. This is very useful, particularly when the data for each of the queries in the UNION come from the same table. The CASE statement can potentially enhance performance by minimizing the number of times the data is read.

Let’s look at an example to clarify why:

SELECT CREATOR, NAME, 'TABLE'
FROM SYSIBM.SYSTABLES
WHERE TYPE = 'T'
UNION
SELECT CREATOR, NAME, 'VIEW '
FROM SYSIBM.SYSTABLES
WHERE TYPE = 'V'
UNION
SELECT CREATOR, NAME, 'ALIAS'
FROM SYSIBM.SYSTABLES
WHERE TYPE = 'A'
ORDER BY NAME;

This simple SQL statement uses UNION to put together the results of three queries against the SYSTABLES table. The report shows all of the DB2 table-like objects that exist in the DB2 subsystem: tables, views, and synonyms.

To do this, DB2 must scan through the table three times – once for each query (as there is no index on the TYPE column). But, you can use CASE and code an equivalent, but more efficient query, as follows:

SELECT CREATOR, NAME,
CASE TYPE
WHEN 'T' THEN 'TABLE'
WHEN 'V' THEN 'VIEW '
WHEN 'A' THEN 'ALIAS'
END
FROM SYSIBM.SYSTABLES
ORDER BY NAME;

This new query will need to scan SYSTABLES only once. The CASE statement will translate the code in the TYPE column into the text that we desire.

CASE statements are very powerful and you should use them when you can to create elegant and optimal SQL in your DB2 applications.

Tuesday, November 03, 2009

Deprecated Features Planned for DB2 X for z/OS

Everyone is always interested in the latest and greatest features of their favorite DBMS, in this case DB2. But sometimes features get removed from the DBMS when a new version is released. According to the IBM teleconference on DB2 X for z/OS today, there are several features planned to be deprecated (i.e. removed). Let's briefly take a look at them.

The first feature that will be removed is private protocol DRDA. This should come as no surprise to anybody since IBM has been indicating that private protocal distribution was on its way out for a number of releases now. And it really is not that difficult to convert to DRDA (and there will be some new help in DSNTP2DP).

Perhaps more troublesome to some shops will be the removal of support for plans containing DBRMs. You will have to convert the DBRMs to packages when you get to DB2 X. You would think that since packages have been available as far back as V2.3 that most shops would have converted to them already. But I do know that there are some shops out there" who still have DBRMs bound directly into their plans. So start planning to convert ASAP! IBM will offer some support in DB2 X to track down affected plans.

While on the topic of plans and packages, you will have to REBIND any that have not been rebound since V5 or before. But it is a good rule to REBIND everything when you go to a new version of DB2 anyway, and most (but not all) shops do that.

From a documentation perspective, BookManager will no longer be supported. Instead we get the PDF versions and the Info Center online.

And the DB2 Management Clients (DB2 Administration Server, Control Center, and Development Center) are deprecated. IBM's new management client direction for DB2 is Data Studio.

Other deprecated items include:

ACQUIRE(ALLOCATE) [change to ACQUIRE(USE)]
Workload capture through profile monitor
XML Extender [change to use pureXML data type]
DB2 MQ XML user-defined functions and stored procedures [change to XML functions]
msys for Setup DB2 Customization Center [change to install panels]

At this point, DB2 X is on track to become generally available at the end of next year, 2010. At this point IBM has announced that migration will be supported from DB2 9 NFM (and has hinted that they are considering supporting migration to DB2 X directly from DB2 V8 NFM).

Monday, November 02, 2009

New DB2 Twitter List

Just a very quick post this morning to let all you DB2 Twitter folks out there know that I created a list of the DB2 tweeters I know about at http://twitter.com/craigmullins/db2-folks.

If you are a DB2 professional and I left you off the list please leave a comment here or drop me an e-mail and I'll be happy to add you.

Friday, October 30, 2009

IOD2009 Day Three – Malcolm Gladwell

Today’s blog entry is a little late seeing as how this is Friday and IOD is over, but I’m writing about Wednesday morning’s keynote session highlighted by Malcolm Gladwell.

For those who do not know him, Malcolm Gladwell is a Canadian journalist and author best known as the author of the books The Tipping Point (2000), Blink (2005), and Outliers (2008). I’ve read all three of them and I highly recommend that you do, too. He also has a new book, What The Dog Saw, that I bought at the airport on the way home from IOD. I hope it is as good as the other three!

Gladwell’s books deal with the unexpected implications of research in the social sciences. He spoke about some of these during his keynote session in a very entertaining and informative way. It was especially rewarding to hear him tie his messaging into the conference them of information-led transformation… and to hear him call all of the attendees mavens (read his books to understand that term).

Gladwell began his talk by noting the irony of hosting a conference dealing with information analytics in Las Vegas, of all places. You would think that the casinos might have an interest in that topic!

The major idea conveyed by Gladwell during his talk focused on change, and how it never occurs the way you think it will. He explained how radical changes happen much more quickly than we imagine. And he used a story about how the broadcast of a major prize fight transformed radio from a niche product to a transformative one.

He also talked about reframing as being necessary to elicit major changes. Prior to broadcasting the boxing match, radio was used to deliver news and classical music. But reframing it as a product for the delivery of real-time sports coverage – reframing its use – caused major transformation. Gladwell also highlighted Apple and the iPod. The iPod was not the first portable MP3 product, but it is the most successful. Because Apple simplified the interface and promoted it simply… that is, they reframed the issue!

I also enjoyed Gladwell’s story about how he purchased a laptop at CompUSA. Upon entering the store he was confronted with tables and tables of similar looking laptops and had no way to differentiate them. He tried the sales people but as anyone who ever went to CompUSA knows, they either couldn’t be found when you needed them or knew nothing about technology. So he called up his brother, who works in IT. He is not a well-known expert, but he is Malcolm’s maven – the guy he turns to for help. And he told him which laptop to purchase. Gladwell then went back to the CompUSA and he said he stood there pointing to the laptop his brother told him about. And about a half hour later the crack CompUSA salesman came to help him.

This story highlighted the concept of the maven. He also mentioned that if the executives were watching how he confidently pointed to the machine he wanted that it might have led them to erroneously believe that CompUSA needed less experienced… and, indeed, fewer sales people. Which might have led to their demise.

All in all, Gladwell was entertaining and informative – a very powerful combination.

The other highlight of the keynote session, for me at least, was that I was mentioned by name as the most prolific Twitter-er at the IOD conference. Now does that mean I was the most helpful or the most annoying… I’ll leave that to you all to decide.

Wednesday, October 28, 2009

IOD2009 Day Two

Day two of the IBM Information on Demand conference was just as informative and exciting as day one. The day kicked off with a general session titled "A New Kind of Intelligence for a Smarter Planet." The idea presented is that the world is changing. It is becoming more instrumented, interconnected, and intelligent. Basically, as Steve Mills of IBM clarified, the ability to embed intelligence into millions of things will lead the transformation to an information led smarter planet. And that this information-led transformation will create opportunities for organizations to strategically gain control of information and create a new kind of intelligence.

Expanded intelligence begins with sensors and metering, which is doable today because price points have become reasonable now. There are 1 billion transistors per human, today. And an estimated 2 billion people will soon be on the Internet. At the same time, we are moving toward one trillion connected objects (video cameras, GPS devices, healthcare instruments, and even livestock). And there are hundreds of satellites orbiting the Earth, generating terabytes of data each and every day.

As we begin to intelligently interconnect these devices and analyze their streaming data, a transformative opportunity can results...

IBM brought up three customers to talk about how their organization were helping to transform to a smarter planet. The customers came from Cardinal Health, Statoil Hydro ASA (Norway), and the Food and Drug Administration. Highlights of the panel discussion:

Security in the supply chain for food is absolutely critical and food safety is one of the most complex systems to deal with.
The US imports 50pct of its food - from over 150 different countries.
Every food supplier to US (domestic or foreign) must be registered with the FDA.
And each supplier must complete forms as to the safety of its food. This can be difficult since suppliers range from large agri-businesses to small farms many of whom do not have a computer at all. So some records are probably kept on paper in shoe boxes.
Supply chain security is very important in the health care (drugs) and oil industries too! In fact, it was discussed how each of these seemingly disparate industries face many similar challenges.
Preventing problems requires understanding risk. And complexity requires collaboration in order to succeed becase nobody has enough resources to do it all alone.
In some areas (such as remote parts of Norway) instrumentation is essential to even get the data because nobody wants to go there.
Legacy systems often are rich sources of information (hey, that probably means mainframes!)
Analytics are required for prevention of problems, but also aid in reaction to problems that actually occur too. No one prevents 100 percent of their problems.

After the panel IBM came back and wrapped it up. They mentioned how IBM was awarded the National Medal of Honor for their Blue Gene supercomputer for DNA sequencing. It was a very informative and entertaining general session.

I then attended the DB2 9.7 versus Oracle 11g Smackdown presentation. It was chock full of statistics on why IBM's DB2 is superior to Oracle in terms of cost. The presenter explained how DB2 outperforms Oracle on TPC-C benchmarks for the same test, on the same machine, at the same point in time. He cautioned folks to to read the small print details on all benchmark results... for example, if you are examining the cost of ownership double check to see whether the benchmark uses a term or full purchase license. Also, the cost of the database license depends a great deal on your ability to negotiate a discount (if the vendor will discount). And you also need to be aware of how the products are licensed. Some features are separately licensed for both DB2 and Oracle. The bottom line is that licensing can cause a more than 30 percentt swing in price performance results

But do people even believe these benchmarks any more? I don't think very many people put much stock in benchmark tests today.

The author frequently cited an independent study by ITG that compares the value proposition of DB2 9.7 versus Oracle Database 11g. You can read the study yourself at this link.

(Note to my regular z/OS readers: the previous discussion was all about DB2 LUW and not DB2 for z/OS).

I also got to attend a special briefing for bloggers on IBM's new stream computing solution, which I blogged about on my Data Management Today blog if you are interested.

And finally, I'll plug my Twitter feed again. You can follow my tweets at IOD this week (well, thru Wednesday) by following me on Twitter at http://www.twitter.com/craigmullins.

Monday, October 26, 2009

IBM IOD2009 Day One

It is Monday, October 26, 2009, and the annual IBM Information on Demand (IOD2009) conference is officially underway. Well, actually it kicked off with a bang on Sunday. The exhibition hall opened at 6:00 pm and the early goers traipsed through the vendor hall sharing stories, checking out the vnedor's wares, and looking for the latest tschotskes (the favorites seem to be a mini-book light being given out by SPSS and the nifty DB2 t-shirts being given out by SEGUS, Inc.).

But the event really does not get started (at least as far as I'm concerned) until the big Monday morning kickoff, which this year was emceed by Terry Fator (the ventriloquist impersonator who won America's Got Talent) last year. He did a very nice job, and his mimicry and ventriloquism skills are great... but I think ventriloquism works better in a more intimate setting.

Before Terry came on an IBMer shared several tidbits of data from research and from polls they have been taking at IOD. For example, did you know that 15 petabytes of data are created every day? Regarding the even, they shared that there are more overall attendees here than there were at last year's event (6,000+) and that the average distance folks traveled to attend #IOD2009 is 2500 miles. You can actually participate in the daily IOD polls at http://iodpoll.com.

In between the entertainment respites, we got to hear from several IBM folks, including Ambuj Goyal, Arvind Krishna, and Frank Kern. According to Mr. Goyal, IBM is building an information on demand software stack to deliver trusted information. This can solve a very real problem that organizations are experiencing today. You see, it seems that one in three business leaders frequently make critical decisions based on inaccurate or non-existing information. Business executives don't just need data, but trusted information that can be relied upon for analysis to make accurate business decisions.

Again, according to Goyal, he sees IBM's Information Agenda doing for trusted information what ERP did for enterprise resource processes. IBM will help move its customers from information projects to information-based architecture and operations, moving organizations from information on demand to information transformation.

Then Frank Kern hosted three IBM customers as he asked each of them to explain how they benefited from IBM's Information Management solutions. The representative from BlueCross BlueShield talked about their data warehouse with 54 millions records, which happens to be the largest healthcare data warehouse in the world. Their analytical data is made available thru a portal. Offering integrated healthcare details helps improve healthcare. And the Chevron representative spoke about how IBM has helped Chevron improve its supply chain operations using integrated information.

And Arvind Krishna, GM of the Information Management group at IBM, told about IBM's on-going investments in research and development for IOD. So far, IBM has invested over $12 billion in R+D and acquisitions.

IBM also unveiled some announcements today at IOD including several new offerings to provide organizations with analytics capabilities to make better use of their archived information and improve business processes. These offerings are part of IBM's unified archiving strategy called IBM Smart Archive. Most intriguing to me is the preview of a SaaS offering called IBM Information Archive Cloud Services. The entire press release can be read here, if you're interested in the topic.

I also attended a presentation on DB2 X which was quite interesting, too. Delivered by Jeff Josten, there was much good news for mainframe DB2 folks including improved performance, temporal support, improved utility operations, hashed data and access paths, and more. I tweeted about this quite extensively on Twitter.

In fact, if you want to follow my tweets all week (well, thru Wednesday) be sure to follow me on Twitter at http://www.twitter.com/craigmullins... even when I'm not at a conference like IOD, I regularly tweet on DB2, database, and information management topics.

Monday, October 19, 2009

Hear Ye, Hear Ye, Learn All About DB2 X for z/OS

IBM is hosting a free webinar on November 3, 2009, offering a technical preview of the next version of DB2 for z/OS, currently known as DB2 X (but we all know, err, ah, think it will be called DB2 10).

If you work with DB2 on the mainframe you will want to set aside some time to attend this informative DB2 X webinar. It will have information for DBAs, as well as for Managers, Application Architects, Developers, System Administrators, System Programmers, and System Architects... and that is just about anyone who works with DB2.

The speaker will be IBM Distinguished Engineer, Jeff Josten.

How do you participate, you may be asking? Well, that is easy. Simply click on over to DB2 X for z/OS Technical Preview and register yourself.

The webinar will be held on November 3, 2009 at 11:00 a.m. Eastern Standard Time, 4:00 p.m. UTC.

Monday, October 12, 2009

On The Importance of Choosing the Correct Data Type

Most DBAs have grappled with the pros and cons of choosing one data type versus another. Sometimes the decision is easy, whereas sometimes the decision is a little bit more difficult. In today's blog entry, we'll discuss both situations.

Data type and length are the most fundamental integrity constraints applied to data in a database. Simply by specifying the data type for each column when a table is created, DB2 automatically ensures that only the correct type of data is stored in that column. Processes that attempt to insert or update the data to a non-conforming value will be rejected. Furthermore, a maximum length is assigned to the column to prohibit larger values from being stored in the table.

The DBA must choose the data type and length of each column wisely. The general rule of thumb for choosing a data type for the columns in your tables is to choose the data type that most closely matches the domain of correct values for the column. That means you should try to adhere to the following rules:

If the data is numeric, favor SMALLINT, INTEGER, BIGINT, or DECIMAL data types. DECFLOAT and FLOAT are also options for very large numbers.
If the data is character, use CHAR or VARCHAR data types.
If the data is date and time, use DATE, TIME, and TIMESTAMP data types.
If the data is multimedia, use GRAPHIC, VARGRAPHIC, BLOB, CLOB, or DBCLOB data types.

But, of course, there are always exceptions. For example, what about social security number? That is a numeric column, but you won't be performing calculations using it. And, to format it correctly requires hyphens, which cannot be stored in a numeric data type. So, perhaps, SSN is an exception.

Leading Zeroes

Let's consider another typical situation: the desire to display leading zeroes. Maybe the application requires a four-byte code that is used to identify products, accounts, or some other business object, and all of the codes are numeric and will stay that way. But, for reporting purposes, the users want the codes to print out with leading zeroes. So, the users request that the column be defined as CHAR(4) to ensure that leading zeroes are always shown. But what are the drawbacks of doing this?

Well, there are drawbacks! Without proper edit checks, inserts and updates could place invalid alphabetic characters into the product code. This can be a very valid concern if ad hoc data modifications are permitted. This is rare in production databases, but data problems can still occur if the proper edit checks are not coded into every program that can modify the data. But let's assume that proper edit checks are coded and will never be bypassed. This removes the data integrity question (yeah, right!).

There is another problem that is related to performance and filter factors. Consider the possible number of values that a CHAR(4) column and a SMALLINT column can assume. Even if edit checks are coded for each, DB2 is not aware of these and assumes that all combinations of characters are permitted. DB2 uses base 37 math when it determines access paths for character columns, under the assumption that 26 alphabetic letters, 10 numeric digits, and a space will be used. This adds up to 37 possible characters. For a four-byte character column there are 374 or 1,874,161 possible values.

A SMALLINT column can range from -32,768 to 32,767 producing 65,536 possible small integer values. The drawback here is that negative or 5 digit product codes could be entered. However, if we adhere to our proper edit check assumption, the data integrity problems will be avoided here, as well. And we can always code a simple CHECK constraint to ensure that the code is always greater than zero.

DB2 will use the HIGH2KEY and LOW2KEY values to calculate filter factors. For character columns, the range between HIGH2KEY and LOW2KEY is larger than numeric columns because there are more total values. The filter factor will be larger for the numeric data type than for the character data type which may influence DB2 to choose a different access path. For this reason, favor the SMALLINT over the CHAR(4) definition.

The leading zeroes problem might be able to be solved using other methods. If you are using a report writer, most of them have quick methods of displaying leading zeroes. When using QMF, you can ensure that leading zeroes are shown by using the "J" edit code. Report programs can be coded to display leading zeroes easily enough by moving the host variables to appropriate display fields. Ad hoc access through other reporting tools typically provide a parameter that can enable leading zeroes to be displayed.

In general, it is wise to choose a data type which is closest to the domain for the column. IF the column is to store numeric data, favor choosing a numeric data type: SMALLINT, INTEGER, DECIMAL, or floating point. In addition, always be sure to code appropriate edit checks to ensure data integrity.

A DATE with DB2

What about dates? In my opinion, you should always use a DATE data type for date data. Why anyone would ever store a date or time in a DB2 table any other format than DATE, TIME, or TIMESTAMP is beyond me. But, oh, they do it all the time. And it causes all sorts of headaches. The benefits of using the proper DB2 data type are many, including:

Ensuring data integrity because DB2 will ensure that only valid date and time values are stored
The ability to use date and time arithmetic
A vast array of built-in functions to operate on and transform date and time values
Multiple formatting choices

Additionally, you have multiple built-in functions at your disposal for manipulating date and time data, but only when the data is stored as DATE, TIME, and TIMESTAMP data types. The time-related DB2 functions include CHAR, DATE, DAY, DAYOFMONTH, DAYOFWEEK, DAYOFYEAR, DAYS, HOUR, JULIAN_DAY, MICROSECOND, MIDNIGHT_SECONDS, MINUTE, MONTH, QUARTER, SECOND, TIME, TIMESTAMP, WEEK, WEEK_ISO, and YEAR.

I get questions all the time from folks who've stored dates using character data types; they ask all kinds of integrity and manipulation questions and I always, always, always start my reply with "You shouldn't have done that!"... and then I try to help them out, if I can. Don't fall into the trap of storing dates using anything but a DATE data type... you'll thank me later.

Summary

There is a lot more that can be said about choosing appropriate data types, but I think we've covered enough for today. The bottom line on data types is to use them to protect the integrity of your DB2 data and to simplify your job by taking advantage of DB2’s built-in capabilities. By choosing that data type that most closely matches your data you will be doing yourself, your systems, and your users a big favor.

Friday, October 02, 2009

IDUG Europe is Right Around the Corner

Just a quick post today to remind everybody that the annual European IDUG conference will be held next week (the week of October 5, 2009) in Rome, Italy. And it is not too late to ensure that you will be there to hear the latest and greatest news, tips, tricks, and guidelines on our favorite DBMS - IBM's DB2!

For those of you not lucky enough to be there keep an eye on my DB2portal blog here where I will attempt to summarize the key events of the week.

And if you are a Twitter aficionado, be sure to follow me on Twitter as I will try to make regular Tweets about the event (as long as my Blackberry works in Rome).

Thursday, September 24, 2009

Limiting the Number of Rows Fetched

Application developers frequently need to retrieve a limited number of qualifying rows from a table. For example, maybe you need to list the top ten best selling items from inventory, or a list of the top five most expensive products (i.e., highest price tag). There are several ways to accomplish this prior to DB2 V7 using SQL, but they are not necessarily efficient.

The first reaction is to simply use the WHERE clause to eliminate non-qualifying rows. But this is simplistic, and often is not sufficient to produce the results desired in an optimal manner. What if the program only requires that the top ten results be returned? This can be a somewhat difficult request to formulate using SQL alone.

Consider, for example, an application that needs to retrieve only the top ten most highly paid employees from the EMP sample table. You could simply issue a SQL request that retrieves all of the employees in order by salary, but only use the first ten retrieved. That is easy; for example:

EMPNO, FIRSTNME, LASTNAME, SALARY
FROM DSN8710.EMP
ORDER BY SALARY DESC;

You must specify the ORDER BY clause with the DESC key word. This sorts the results into descending order, instead of the default, which is ascending. Without the DESC key word, the "top ten" would be at the very end of the results set, not at the beginning.

But that does not really satisfy the requirement - retrieving only the top ten. It merely sorts the results into descending sequence. So the results would still be all employees in the table, but in the correct order so you can view the "top ten" salaries very easily. The ideal solution should return only the top ten employees with the highest salary and not merely a sorted list of all employees.

You can code some "tricky" SQL to support this request for all versions of DB2, such as the following:

SELECT EMPNO, FIRSTNME, LASTNAME, SALARY
FROM DSN8710.EMP A
WHERE 10 > (SELECT COUNT(*)
FROM DSN8710.EMP B
WHERE A.SALARY < B.SALARY)
AND SALARY IS NOT NULL
ORDER BY SALARY DESC;

This SQL is portable from DB2 to other DBMSs, such as Oracle or SQL Server. And, of course, you can change the constant 10 to any number you wish, thereby retrieving the top 20, or top 5, as deemed necessary by the needs of your application.

Since the SALARY column is nullable in the EMP table, you must remove the nulls from the results set. And the ORDER BY is required to sort the results in the right order. If it is removed from the query, the results will still contain the top ten, but they will be in no particular order.

But DB2, as of V7, provides an easier and less complicated way to limit the results of a SELECT statement - the FIRST key word. You can code FETCH FIRST n ROWS which will limit the number of rows that are fetched and returned by a SELECT statement.

Additionally, you can specify a new clause -- FETCH FIRST ROW ONLY clause -- on SELECT INTO statements when the query can return more than one row in the answer set. Doing so informs DB2 to ignore the other rows.

There is one difference between the new V7 formulation and the other SELECT statement we reviewed, and that is the way "ties" are handled. A tie occurs when more than one row contains the same value. The previous query we examined may return more than 10 rows if there are multiple rows with the same value for price within the top ten.

Using the FIRST key word DB2 will limit the number of rows returned to ten, even if there are other rows with the same value for price as the number ten row in the results set. The needs of your application will dictate whether ties are to be ignored or included in the result set. If all "ties" need to be included in the results set, which would mean that more than 10 rows would be needed, the new V7 feature may not prove to be helpful.

And it is also important to note that as of DB2 9, you can include the FETCH FIRST clause in a subselect. ORDER BY is allowed in a subselect, too. The subselect MUST be enclosed in parentheses and the FETCH FIRST (or ORDER BY) cannot be in the outermost fullselect of a view, or in a materialized query table.

Friday, September 04, 2009

Dynamic SQL Causing Lock Escalation?

Lock escalation is the promotion of a lock from a row, page or LOB lock to a table space lock because the number of page locks that are concurrently held on a given resource exceeds a preset limit.

But lock escalation can cause problems. Yes, when fewer locks are taken, CPU cost and memory usage can be reduced. On the other hand, escalating the size of a lock causes more resources to be locked... and that impacts concurrency with the most likely result being applications experiencing lock timeouts or deadlocks.

When lock escalation occurs, DB2 writes messages to the system log. So when complaints inevitably arrive about failing transactions you can look for the appropriate lock escalation message indicating that it could be the culprit.

But such simple steps are typically not sufficient to track down the root cause and take corrective actions. You need to be able to determine what SQL statement in what program and under what circumstances is causing the escalation. Doing so involves tracing and using performance monitoring tools.

For those interested in this topic, (indeed, the primary purpose of this blog post) you should seek out the IBM TechDoc titled Identifying the dynamic SQL statement which is causing a
lock escalation in DB2 for z/OS. For those who don't know what a TechDoc is, don't worry, it isn't that complicated. It is basically a published piece of technical documentation that isn't part of a larger manual or IBM RedBook. They can be short papers, product flashes, presentations, etc.

This particular TechDoc details a methodology for identifying dynamic SQL statements involved in a lock escalation. It describes which traces to start, and which jobs to run to analyze the collected traces. It also explains how to analyze the trace report to find the problematic SQL statement.

You can search all of IBM's TechDocs by following this link.

Happy reading!

Friday, August 21, 2009

Approaches to Access Path Management

BIND and REBIND are important components in assuring efficient DB2 applications. Because the
BIND/REBIND process determines exactly how your DB2 data is accessed, it is important that you develop an appropriate strategy for when and how to REBIND your programs.

There are several common REBIND approaches taken by DB2 users. By far, the best approach is to REBIND your applications over time as your data and systems change. This approach involves some form of regular maintenance that keeps DB2 statistics up to date and formulates new access paths as data volumes and patterns change.

Other approaches include REBINDing only when a new version of DB2 is installed, or perhaps more ambitious, whenever new PTFs are applied to DB2. Another approach is to REBIND automatically after a regular period of time (days, weeks, months, etc.). This approach can work if the period of time is wisely chosen based on the application data – but it still can pose administrative issues.

The final approach can be summarized as “if it ain’t broke don’t fix it!” This is the worst of the several approaches discussed here. The biggest problem with this approach is that you are penalizing every program in your subsystem for fear that a program or two may have a degraded access path. This results in potentially many programs having sub-optimal performance because the optimizer never gets a chance to create better access paths as the data changes.

Of course, the possibility of degraded performance is real – and that is why this approach has been adopted at some sites. The problem is being able to find which statements have degraded. In an ideal world we would be to be able to review the access path changes beforehand to determine if they are better or worse. But DB2 itself does not provide any systematic method of administering access paths that way. There are third party tools that can help you achieve this though.

Anyway, let’s go back to the best approach again, and that is to REBIND on a regular basis as your data changes. This approach has become known as the three Rs. To implement this approach you:

Regularly reorganize the data to ensure that it is optimally structured.
Follow that with RUNSTATS to be sure that the reorganized state of the data is reflected in the DB2 Catalog.
And follow that with a REBIND for all the application programs that access the data structures impacted by the REORG and RUNSTATS.

At any rate, your goal should be to keep your access paths up-to-date with the current state of your data. Failing to do this means that DB2 is accessing data based upon false assumptions. DB2 is unlikely to make the same access path choice as your data grows – and as patterns within the data change.

By REBINDing you can generally improve the overall performance of your applications because the access paths will be better designed based on an accurate view of the data. Additionally, as DB2 changes are introduced (PTFs, new version/release) optimizer improvements and new access techniques can be incorporated into the access paths. That is, if you never REBIND, not only are you forgoing better access paths due to data changes but you are also forgoing better access paths due to changes to DB2 itself.

Adopting the Three R’s approach can pose additional questions. For example, when should you reorganize? In order to properly determine when a REORG is needed you’ll have to look at statistics. This means looking at either RUNSTATS or Real-Time Statistics (RTS). So, we need to add another R – in other words:

RUNSTATS or better yet, RTS
REORG
RUNSTATS
REBIND

Now it is true that some folks don’t rely on statistics to schedule a REORG. Instead, they just build the JCL to REORG their database objects when they create the object. So they create a table space then build the REORG job and schedule it to run monthly, or quarterly, or on some regular basis. This is better than no REORG at all, but it is probably not the best approach because you are most likely either reorganizing too soon (in which case you waste the CPU cycles to do the REORG) or you are reorganizing too late (in which case performance is suffering for a period of time before the REORG runs). Better to base your REORGs off of statistics and thresholds using either RUNSTATS or RTS.

Without accurate statistics there is little hope that the optimizer will formulate the best access path to retrieve your data. If the optimizer doesn’t have accurate information on the size, organization, and particulars of your data then it will be creating access paths based on either default or inaccurate statistics. Incorrect statistics will cause bad choices to be made – such as choosing a merge-scan join when a nested loop join would be better, or failure to invoke sequential prefetch, or using the wrong index – or no index at all. And the problem of inaccurate statistics is pervasive.

There are shops out there that never, or rarely, run RUNSTATS to gather up-to-date statistics. Make sure yours is not one of those shops!

When should you run RUNSTATS? One answer is “As frequently as possible based on how often your data changes.” To do this you will need to know a thing or two about your data growth patterns: what is its make-up, how is it used, how fast does it grow, and how often does it change? These patterns will differ for every table space in your system.

Next we need to decide when to REBIND? The best answer for this is when statistics have changed significantly enough to change access paths. When we know that data has significantly changed it makes sense to REBIND after the RUNSTATS completes. But the trick is determining exactly when we have a “significant” change in our data. Without an automated method of comparing and contrasting statistics (or even better yet, access paths) coming up with an answer in a manual way can be time-consuming and error-prone – especially if we have thousands of DB2 programs to manage.

As we REBIND, we always must be on alert for rogue access paths. A rogue access path is created when the optimizer formulates a new access path that performs worse than the previous access path. This can happen for a variety of reasons. Of course, number one is that the optimizer, though good, is not perfect. So mistakes can happen. Other factors can cause degraded access paths, too. The access paths for volatile tables depend on when you run the RUNSTATS. Volatile tables are those that start out empty, get rows added to them during processing, and are emptied out at the end of the day. And, of course, if the catalog or statistics are not accurate we can get problems, too.

So adopting the Four R’s approach implies that you will have to develop a methodology for reviewing your access paths and taking care of any “potential” problem access paths. Indeed, the Four R’s becomes the Five R’s as we add a step to review the access paths after REBINDing to make sure that there are no rogue access paths:

Start RTS (or a RUNSTATS) to determine when to REORG.
Reorganize the table spaces (and/or indexes)
After reorganizing, run RUNSTATS again,
Follow that with the REBINDs.
Then we need that fifth R – which is to Review the access paths generated by the REBIND.

The review is of utmost importance because the optimizer can make mistakes. And, of course, so can you. But your users will not call you when performance is better (or the same). They only ring you up when performance gets worse. As such, proactive shops will put best practices in place to test REBIND results comparing the before and after impact of the optimizer’s choices.

Thursday, July 30, 2009

Discount on IBM Information on Demand Conference

The IBM Information on Demand (IOD, for short) conference is rapidly approaching. The conference will be held in Las Vegas, Nevada the week of October 25 – 30, 2009 at the Mandalay Bay casino and hotel.

The IOD conference is IBM’s signature event for their data and information management product lines. By attending IOD 2009 you can gain unique perspectives from IBM experts, technical leaders and visionaries as well as peers in your industry. Many of the developers of IBM’s offerings, such as DB2, Informix, and IMS, will be delivering educational sessions at IOD. And over 200 customers will share real-world experiences detailing how they have unlocked the value of their information and realized tangible and immediate return on investment.

OK, so what about that discount? IBM has been kind enough to provide me with a discount code that I can share with my readers. You will need to provide the code, G09MULL, when your register for the conference and you will get a $100 discount off of your registration!

Here’s how to register on-line:

Visit the IBM Information On Demand 2009 global conference website and select 'Register now' at: www.ibm.com/events/informationondemand

An IBM userid and Password are required to register. If you have an existing IBM.com account ID, please select 'Register for the Conference' and sign on. If you do not have an IBM.com account ID, please follow the link 'MY IBM ID' to obtain one. Complete the form and click 'SUBMIT'. You will then be allowed to continue as a registered IBM user. Click 'Continue' and select 'Register for the Conference' and sign on with your new userid and password.

Select registration type of 'Customer' and click 'Continue'

Complete the appropriate fields on the enrollment form

Under the 'Promotion Code Information' section, enter PROMOTION CODE G09MULL. You must enter this code to receive your $100 discount.

Complete the form with attendee, arrival and payment information clicking 'Continue' at the bottom of each section.

Click 'Submit Registration'

Note: Credit card information is required to guarantee your registration regardless of your method of payment and/or discount amount.

That’s all there is to it. And you can use the promotion code multiple times for all of the folks at you company that will be attending the Information On Demand conference in 2009.

Cheers!

Friday, July 24, 2009

40 years of CICS in 40 seconds

Monday, July 06, 2009

The "Dirty" Read (AKA Uncommitted Read)

Anyone accustomed to application programming when access to a database is required understands the potential for concurrency problems. When one application program attempts to read data that is in the process of being changed by another, the DBMS must forbid access until the modification is complete in order to ensure data integrity. Most DBMS products, DB2 included, use a locking mechanism for all data items being changed. Therefore, when one task is updating data on a page, another task can not access data (read or update) on that same page until the data modification is complete and committed.

Programs that read DB2 data typically access numerous rows during their execution and are thus quite susceptible to concurrency problems. Since V4, DB2 has provided read-through locks, also know as “dirty read” or “uncommitted read,” to help overcome concurrency problems. When using an uncommitted reads an application program can read data that has been changed, but is not yet committed.

Dirty read capability is implemented using a new isolation level, UR, for uncommitted read. This adds to the current isolation levels of RR (repeatable read) and CS (cursor stability). If the application program is using the UR isolation level, it will read data without taking locks. This enables the application program to read data contained in the table as it is being manipulated.

Consider the following sequence of events:

(1) At 9:00 AM, the a transaction is executed containing the following SQL to change a specific value.


   UPDATE EMP
     SET  FIRST_NAME = “MICHELLE”
   WHERE  EMPNO = 10020;

The transaction is a long-running one and continues to execute without issuing a COMMIT.

(2) At 9:01 AM, a second transaction attempts to SELECT the data that was changed, but not committed.

If the UR isolation level were specified for the second transaction, it would read the changed data even though it had yet to be committed. Obviously, if the program need not wait to take a lock and merely reads the data in whatever state it happens to be at that moment, the program will execute faster than if it had to wait for locks to be taken and resources to be freed before processing.

However, the implications of reading uncommitted data must be carefully examined before being implemented. Several types of problems can occur. A dirty read can cause duplicate rows to be returned where none exist. Alternately, a dirty read can cause no rows to be returned when one (or more) actually exists. Additionally, an ORDER BY clause does not guarantee that rows will be returned in order if the UR isolation level is used. Obviously, these problems must be taken into consideration before using the UR isolation level.

UR Isolation Requirements

The UR isolation level applies to read-only operations: SELECT, SELECT INTO, and FETCH from a read-only result table.
Any application plan or package bound with an isolation level of UR will use uncommitted read functionality for any read-only SQL. Operations that are contained in the same plan or package and are not read-only will use an isolation level of CS.
The isolation level that is defined at the plan or package level during BIND or REBIND can be overridden as desired for each SQL statement in the program. The WITH clause can be used to specify the isolation level for any individual SQL statement. For example:


      SELECT EMPNO, LAST_NAME
      FROM EMP
      WITH UR;

The WITH clause is used to allow an isolation level to be used on a statement-by-statement basis. The UR isolation level can be used only with read-only SQL statements. This includes read-only cursors and SELECT INTO statements.

Benefits and Drawbacks

So when is it a good idea to implement dirty reads using the UR isolation level? The general rule of thumb is to avoid dirty reads whenever the results must be 100 percent accurate. Examples of this would be when:

calculations that must balance are being performed on the selected data
data is being retrieved from one source to insert to or update another
production, mission-critical work is being performed that can not contain or cause data integrity problems

Truthfully, most DB2 applications are not usually candidates for dirty reads. However, there are a few specific situations in which the dirty read capability will be of major benefit. Consider the following cases in which the UR isolation level could prove to be useful:

Access is required to a reference, code, or look-up table that is basically static in nature. Due to the non-volatile nature of the data, a dirty read would be no different than a normal read the majority of the time. In those cases when the code data is being modified, any application reading the data would incur minimum, if any, problems.
Statistical processing must be performed on a large amount of data. For example, your company may wish to determine the average age of female employees within a certain pay range. The impact of an uncommitted read on an average of multiple rows will be minimal because a single value changed will not greatly impact the result.
Dirty read can prove invaluable in a data warehousing environment that uses DB2 as the DBMS. A data warehouse is a time sensitive, subject-oriented, store of business data that is used for on-line analytical processing. Other than periodic data propagation and/or replication, access to the data warehouse is read only. An uncommitted read can be perfect in a read only environment since it can cause little damage because the data is generally not changing. More and more data warehouse projects are being implemented in corporations worldwide and DB2 with dirty read capability may be a very wise choice for data warehouse implementation.
In those rare cases when a table, or set of tables, is used by a single user only, UR can make a lot of sense. If only one individual can be modifying the data, the application programs can be coded such that all (or most) reads are done using UR isolation level, and the data will still be accurate.
Finally, if the data being accessed is already inconsistent little harm can be done using a dirty read to access the information.

Administration

Because of the data integrity issues associated with dirty reads, it is a good idea for DBAs to keep track of the plans and packages that specify an isolation level of UR. This information can be found in the DB2 catalog. The following two queries can be used to find the applications using uncommitted reads.

Issue the following SQL for a listing of plans that were bound with ISOLATION(UR) or contain at least one statement specifying the WITH UR clause:


SELECT  DISTINCT S.PLNAME
FROM    SYSIBM.SYSPLAN P,
        SYSIBM.SYSSTMT S
WHERE   (P.NAME = S.PLNAME AND
         P.ISOLATION = ‘U’
        )
        OR S.ISOLATION = ‘U’
ORDER BY S.PLNAME;

Issue the following SQL for a listing of packages that were bound with ISOLATION(UR) or contain at least one statement specifying the WITH UR clause:


SELECT  DISTINCT P.COLLID, P.NAME, P.VERSION
FROM    SYSIBM.SYSPACKAGE   P,
        SYSIBM.SYSPACKSTMT  S
WHERE   (P.LOCATION = S.LOCATION AND
         P.LOCATION = ‘ ‘        AND
         P.COLLID = S.COLLID     AND
         P.NAME = S.NAME         AND
         P.VERSION = S.VERSION   AND
         P.ISOLATION = ‘U’
        )
        OR S.ISOLATION = ‘U’
ORDER BY S.COLLID, S.NAME, S.VERSION;

Synopsis

The dirty read capability can provide relief to concurrency problems and deliver faster performance in very specific situations. Be certain to understand the implications of the UR isolation level and the “problems” it can cause before diving headlong into implementing it in your production applications.

Monday, June 22, 2009

Know Your ISOLATION Levels

Did you know that DB2 provides a way to change the way that a program or SQL statement acquires locks? That way is known as the isolation level and it can be set to specify the locking behavior for a transaction or statement. Standard SQL defines four isolation levels that can be set using the SET TRANSACTION ISOLATION LEVEL statement:

Serializable
Repeatable read
Read committed
Read uncommitted

The isolation level determines the mode of page or row locking implemented by the program as it runs.

DB2 supports a variation of the standard isolation levels. DB2 implements page and row locking at the program execution level, which means that all page or row locks are acquired as needed during the program run. Page and row locks are released as the program run depending on the isolation level.

In DB2 you can specify the following four isolation levels:

cursor stability (CS),
repeatable read (RR),
read stability (RS), and
uncommitted read (UR).

Using the ISOLATION parameter of the BIND command you can set the isolation level of a package or plan. You also can use the WITH parameter on a SELECT statement to set the isolation level of a single SQL statement.

Cursor stability is the DB2 implementation of the SQL standard read committed isolation level. CS is perhaps the most common DB2 isolation level in use in production applications because it offers a good tradeoff between data integrity and concurrency. When CS is specified the transaction will never read data that is not yet committed; only committed data can be read.

A higher level of integrity is provided with repeatable read. Under an RR isolation level all page locks are held until they are released by a COMMIT (or ROLLBACK), whereas with CS read-only page locks are released as soon as another page is accessed. Repeatable read is the default isolation level if none is specified at BIND time.

An RR page locking strategy is useful when an application program requires consistency in rows that may be accessed twice in one execution of the program, or when an application program requires data integrity that cannot be achieved with CS.

For example of a good reason to use RR page locking, consider a reporting program that scans a table to produce a detail report, and then scans it again to produce a summarized managerial report. If the program is bound using CS, the results of the first report might not match the results of the second.

Suppose that you are reporting the estimated completion dates for project activities. The first report lists every project and the estimated completion date. The second, managerial report lists only the projects with a completion date greater than one year.

The first report indicates that two activities are scheduled for more than one year. After the first report but before the second, however, an update occurs. A manager realizes that she underestimated the resources required for a project. She invokes a transaction to change the estimated completion date of one of her project's activities from 8 months to 14 months. The second report is produced by the same program, but reports 3 activities.

If the program used an RR isolation level rather than CS, an UPDATE that occurs after the production of the first report but before the second would not have been allowed. The program would have maintained the locks it held from the generation of the first report and the updater would be locked out until the locks were released.

How about another example? Consider a program that is looking for pertinent information about employees in the information center and software support departments who make more than $30,000 in base salary. In the DB2 sample tables department 'C01' is the information center and department 'E21' is software support.

The program opens a cursor based on the following SELECT statement:


SELECT  EMPNO, FIRSTNME, LASTNAME,
        WORKDEPT, SALARY
FROM    DSN8710.EMP
WHERE   WORKDEPT IN ('C01', 'E21')
AND     SALARY > 30000;

The program then begins to FETCH employee rows. Assume further, as would probably be the case, that the statement uses the XEMP2 index on the WORKDEPT column. An update program that implements employee modifications is running concurrently. The program handles transfers by moving employees from one department to another, and implements raises by increasing the salary.

Assume that Sally Kwan, one of your employees, has just been transferred from the information center to software support. Assume further that another information center employee, Heather Nicholls, received a 10 percent raise. The update program running concurrently with the report program implements both of these modifications.

If the report program were bound with an isolation level of CS, the second program could move Sally from 'C01' to 'E21' after she was reported to be in department 'C01' but before the entire report was finished. Thus, she could be reported twice: once as an information center employee and again as a software support employee. Although this circumstance is rare, it can happen with programs that use cursor stability. If the program were bound instead with RR, this problem could not happen. The update program probably would not be allowed to run concurrently with a reporting program, however, because it would experience too many locking problems.

Now consider Heather's dilemma. The raise increases her salary 10 percent, from $28,420 to $31,262. Her salary now fits the parameters specified in the WHERE condition of the SQL statement. Will she be reported? It depends on whether the update occurs before or after the row has been retrieved by the index scan, which is clearly a tenuous situation. Once again, RR avoids this problem.

You might be wondering, "If CS has the potential to cause so many problems, why is it used so ubiquitously? Why not trade the performance and concurrency gain of CS for the integrity of RR?"

The answer is simple: the types of problems outlined are rare. The expense of using RR, however, can be substantial in terms of concurrency. So the tradeoff between the concurrency expense of RR and the efficiency of CS usually is not a sound one.

The third isolation level provided by DB2 is read stability (RS). Read stability is similar in functionality to the RR isolation level, but a little less. A retrieved row or page is locked until the end of the unit of work; no other program can modify the data until the unit of work is complete, but other processes can insert values that might be read by your application if it accesses the row a second time.

Consider using read stability over repeatable read only when your program can handle retrieving a different set of rows each time a cursor or singleton SELECT is issued. If using read stability, be sure your application is not dependent on having the same number of rows returned each time.

Finally, we come to the last, and most maligned isolation level, uncommitted read (UR). The UR isolation level provides read-through locks, also know as dirty read or read uncommitted. Using UR can help to overcome concurrency problems. When you're using an uncommitted read, an application program can read data that has been changed but is not yet committed.

UR can be a performance booster, too, because application programs bound using the UR isolation level will read data without taking locks. This way, the application program can read data contained in the table as it is being manipulated. Consider the following sequence of events:

1. To change a specific value, at 9:00 a.m. a transaction containing the following SQL is executed:


 UPDATE EMP
 SET FIRSTNME = "MICHELLE"
 WHERE EMPNO = 10020;

The transaction is a long-running one and continues to execute without issuing a COMMIT.

2. At 9:01 a.m., a second transaction attempts to SELECT the data that was changed, but not committed.

If the UR isolation level were used for the second transaction, it would read the changed data even though it had yet to be committed. Obviously, if the program doesn't need to wait to take a lock and merely reads the data in whatever state it happens to be at that moment, the program will execute faster than if it had to wait for locks to be taken and resources to be freed before processing.

The implications of reading uncommitted data, however, must be carefully examined before being implemented. Several types of problems can occur. Using the previous example, if the long-running transaction rolled back the UPDATE to EMPNO 10020, the program using dirty reads may have picked up the wrong name ("MICHELLE") because it was never committed to the database.

Inaccurate data values are not the only problems that can be caused by using UR. A dirty read can cause duplicate rows to be returned where none exist. Alternatively, a dirty read can cause no rows to be returned when one (or more) actually exists. Additionally, an ORDER BY clause does not guarantee that rows will be returned in order if the UR isolation level is used. Obviously, these problems must be taken into consideration before using the UR isolation level.

Keep in mind, too, that the UR isolation level applies to read-only operations: SELECT, SELECT INTO, and FETCH from a read-only result table. Any application plan or package bound with an isolation level of UR will use uncommitted read functionality for any read-only SQL. Operations contained in the same plan or package and are not read-only will use an isolation level of CS.

When is it appropriate to use UR isolation? The general rule of thumb is to avoid UR whenever the results must be 100 percent accurate. Following are examples of when this would be true:

Calculations that must balance are being performed on the selected data
Data is being retrieved from one source to insert to or update another
Production, mission-critical work is being performed that cannot contain
or cause data integrity problems

In general, most DB2 applications are not serious candidates for dirty reads. In a few specific situations, however, the dirty read capability will be of major benefit. Consider the following cases in which the UR isolation level could prove to be useful:

Access is required to a reference, code, or look-up table that basically is static in nature. Due to the non-volatile nature of the data, a dirty read would be no different than a normal read the majority of the time. In those cases when the code data is being modified, any application reading the data would incur minimum, if any, problems.
Statistical processing must be performed on a large amount of data. Your company, for example, might want to determine the average age of female employees within a certain pay range. The impact of an uncommitted read on an average of multiple rows will be minimal because a single value changed will not greatly impact the result.
Dirty reads can prove invaluable in a data warehousing environment that uses DB2 as the DBMS. A data warehouse is a time-sensitive, subject-oriented, store of business data that is used for online analytical processing. Other than periodic data propagation and/or replication, access to the data warehouse is read-only. Because the data is generally not changing, an uncommitted read is perfect in a read-only environment due to the fact that it can cause little damage. More data warehouse projects are being implemented in corporations worldwide and DB2 with dirty read capability is a very wise choice for data warehouse implementation.
In those rare cases when a table, or set of tables, is used by a single user only, UR can make a lot of sense. If only one individual can be modifying the data, the application programs can be coded such that all (or most) reads are done using UR isolation level, and the data will still be accurate.
Finally, if the data being accessed already is inconsistent, little harm can be done using a dirty read to access the information.

Although the dirty read capability can provide relief to concurrency problems and deliver faster performance in specific situations, it also can cause data integrity problems and inaccurate results. Be sure to understand the implications of the UR isolation level and the problems it can cause before diving headlong into implementing it in your production applications.

Summary

It is important for DB2 DBAs and application programmers to know the four isolation levels and their impact on SQL. Using the isolation levels is an effective way to control concurrency and locking for your DB2 applications.

Friday, June 12, 2009

The Mainframe is Number One!

Congratulations to the mainframe, just cited at #1 in the InfoWorld Hardware Hall of Fame.

Okay, so maybe they are listing the hardware chronologically, but I refuse to acknowledge that. The IBM System/360 mainframe is listed first -- and that means number one, right?

Actually, the entire list is interesting and entertaining, so take a moment to click over and review the InfoWorld Hardware Hall of Fame, which is littered with titans from the past (DEC, Compaq) as well as current offerings.

Thursday, May 28, 2009

Hasn't DB2 9 for z/OS Been GA for 2 Years Now?

Just a quick note to question the entire DB2 community about embracing the latest (I hesitate to call a 2 year old product "new") version of DB2. Isn't it about time that most of the DB2 base moved to DB2 V9?

Why would people not be moving? Well, there is always the cost and time involved in migrating to a new version. Perhaps organizations are not willing to expend the resources needed to migrate as frequently as they did in the past.

Perhaps some organizations are waiting for others to shake the bugs out of a new release. But this is troublesome. The more organizations that follow that approach, the longer it takes for those that do adopt the latest version to fully test the software. Which means that waiting might not help.

There is also the possibility that some organizations do not find a compelling need for the new functionality offered by V9. Even if that is the case, most organizations still should want the performance gains to be had by upgrading. But to fully take advantage of that requires rebinding, which also takes time and resources to do properly.

And then there is the issue of support for new versions. Organizations need their DB2 tool providers (ISVs and IBM) to not only support, but exploit the new features of the "new" version. Remember a few years ago when IBM entered the DB2 tools market in a big way? One of their stated objectives was to provide day one support for new versions. I wonder what happened to that objective? It is years after V9 GA and many of the IBM tools (according to IBM's own web site) only offer a "very limited number of new functions and features of DB2 9."

So is your shop at DB2 V9 yet? If not, why? If so, howzitgoing?

Tuesday, May 26, 2009

Ensuring Data Integrity is a Tricky Business

The term "data integrity" can mean different things to different people and at different times. But at a high level, there really are two aspects of integrity with respect to databases: database structure integrity and semantic data integrity. Keeping track of database objects and ensuring that each object is created, formatted and maintained properly is the goal of database structure integrity. Each DBMS uses its own internal format and structure to support the databases, table spaces, tables, and indexes under its control. System and application errors at times can cause faults within these internal structures. The DBA must identify and correct such faults before insurmountable problems occur. Semantic data integrity refers to the meaning of data and relationships that need to be maintained between different types of data. The DBMS provides options, controls and procedures to define and assure the semantic integrity of the data stored within its databases.

Structural database integrity and consistency is critical in the ongoing administration of databases. If the structural integrity of the database is not sound, everything else will be suspect, too. There are multiple types of structural problems that can occur. Indexing problems are one. Certain types of database maintenance can cause such problems and DBAs need to be able to recognize the problem, and rebuild the indexes to correct their structural integrity. Indexes are not the only database objects that utilize pointers. Many DBMSs use pointers to store very large objects containing text and image data. These can become corrupted.In today's modern database systems, structural integrity is rare -- much rarer than it used to be.

The more difficult and more pervasive problem is assuring the semantic integrity of the data. Getting that right requires proper design, processes that match your business requirements, good communication skills, and constant vigilance.

Perhaps the number one cause of data integrity problems is improperly designed databases. Just getting the data type and length correct for each column can go a long way to making sure the right data is stored. Think about it. If you need to store dates but the column is defined as CHAR(8) how can you enforce that only valid dates are stored? You would need to code program logic to accomplish that. But if the column is defined as DATE then the DBMS would take care of it -- and more of the data would be likely to be correct.

The DBA must also set up data relationships properly in the database. This is done using referential integrity (RI), a method for ensuring the "correctness" of data within a DBMS. People tend to over-simplify RI stating that it is merely the identification of relationships between tables. It is actually much more than this. Of course, the identification of the primary and foreign keys that constitutes a relationship between tables is a component of defining referential integrity. Basically, RI guarantees that an acceptable value is always in the foreign key column. Acceptable is defined in terms of an appropriate value as housed in the corresponding primary key (or perhaps null).

The combination of the relationship and the rules attached to that relationship is referred to as a referential constraint. The rules that accompany the RI definition are just as important as the relationship. These rules define how data is to be properly added to the databases and what happens when it is removed.

There are other mechanisms in the DBMS that DBAs can use to enforce semantic data integrity. Check constraints and rules can be applied to columns that dictate valid values. The DBMS will reject invalid data that does not conform to the constraints. More complex data relationships can be set up using database triggers.

Every DBA should take advantage of the mechanisms provided by the DBMS to ensure data integrity. When DBMS-provided methods are used, fewer data integrity problems are likely to be found. Fewer data integrity problems mean higher quality databases and more proficient end users. You have to know what integrity rules are proper for the DBMS to enforce. But once defined, many of those rules can be enforced by the DBMS.

And that is very good, indeed!

Tuesday, May 05, 2009

Approaches to Access Path Management... or The Five R's

BIND and REBIND are important components in assuring efficient DB2 applications. Because the BIND/REBIND process determines exactly how your DB2 data is accessed it is important that you develop an appropriate strategy for when and how to REBIND your programs.

There are several common REBIND approaches taken by DB2 users. By far, the best approach is to REBIND your applications over time as the data changes. This approach involves some form of regular maintenance that keeps DB2 statistics up to date (or better yet, uses Real Time Statistics) and formulates new access paths as data volumes and patterns change.

Other approaches include REBINDing only when a new version of DB2 is installed, or perhaps more ambitious, whenever new PTFs are applied to DB2. Another approach is to REBIND automatically after a regular period of time (days, weeks, months, etc.). This approach can work if the period of time is wisely chosen based on the application data – but it still can pose administrative issues.

Another (unfortunately) popular approach can be summarized as “if it ain’t broke don’t fix it!” This is the worst of the several approaches discussed here. The biggest problem with this approach is that you are penalizing every program in your subsystem for fear that a program or two may have a degraded access path. This results in potentially many programs having sub-optimal performance because the optimizer never gets a chance to create better access paths as the data changes.

Of course, the possibility of degraded performance is real – and that is why this approach has been adopted at some sites. The problem is being able to find which statements have degraded. In an ideal world we would be to be able to review the access path changes beforehand to determine if they are better or worse. But DB2 itself does not provide any systematic method of administering access paths that way. There are third party tools that can help you achieve this though.

Anyway, let’s go back to the best approach again, and that is to REBIND on a regular basis as your data changes. This approach has become known as the three Rs. To implement this approach you:

Regularly REORGanize the data to ensure that it is optimally structured.
Follow that with RUNSTATS to be sure that the reorganized state of the data is reflected in the DB2 Catalog.
And follow that with a REBIND for all the application programs that access the data structures impacted by the REORG and RUNSTATS.

By REBINDing you can generally improve the overall performance of your applications because the access paths will be better designed based on an accurate view of the data. Additionally, as DB2 changes are introduced (PTFs, new version/release) optimizer improvements and new access techniques can be incorporated into the access paths. That is, if you never REBIND, not only are you forgoing better access paths due to data changes but you are also forgoing better access paths due to changes to DB2 itself.

Adopting the Three R’s approach can pose additional questions. For example, when should you reorganize? In order to properly determine when a REORG is needed you’ll have to look at statistics. This means looking at either RUNSTATS or RTS. So, perhaps it should be at least 4 R’s – in other words:

RUNSTATS or RTS
REORG
RUNSTATS
REBIND

Without accurate statistics there is little hope that the optimizer will formulate the best access path to retrieve your data. If the optimizer doesn’t have accurate information on the size, organization, and particulars of your data then it will be creating access paths based on either default or inaccurate statistics. Incorrect statistics will cause bad choices to be made – such as choosing a merge-scan join when a nested loop join would be better, or failure to invoke sequential prefetch, or using the wrong index – or no index at all. And the problem of inaccurate statistics is pervasive. There are shops out there that never, or rarely, run RUNSTATS to gather up-to-date statistics. Make sure yours is not one of those shops!

When should you run RUNSTATS? One answer is “As frequently as possible based on how often your data changes.” To do this you will need to know a thing or two about your data growth patterns: what is its make-up, how is it used, how fast does it grow, and how often does it change? These patterns will differ for every table space in your system.

Next we need to decide when to REBIND? The best answer for this is when statistics have changed significantly enough to change access paths. When we know that data has significantly changed it makes sense to REBIND after the RUNSTATS completes. But the trick is determining exactly when we have a “significant” change in our data. Without an automated method of comparing and contrasting statistics (or even better yet, access paths) coming up with an answer in a manual way can be time-consuming and error-prone – especially if we have thousands of DB2 programs to manage.

As we REBIND, we always must be on alert for rogue access paths. A rogue access path is created when the optimizer formulates a new access path that performs worse than the previous access path. This can happen for a variety of reasons. Of course, number one is that the optimizer, though good, is not perfect. So mistakes can happen. Other factors can cause degraded access paths, too. The access paths for volatile tables depend on when you run the RUNSTATS. Volatile tables are those that start out empty, get rows added to them during processing, and are emptied out at the end of the day. And, of course, if the catalog or statistics are not accurate we can get problems, too.

So adopting the Four R’s approach implies that you will have to develop a methodology for reviewing your access paths and taking care of any “potential” problem access paths. Indeed, the Four R’s becomes the Five R’s as we add a step to review the access paths after REBINDing to make sure that there are no rogue access paths:

Start with a RTS (or use RUNSTATS) to determine when to REORG.
REORGanize the table spaces (and indexes)
After reorganizing, run RUNSTATS (to ensure the DB2 Catalog is up-to-date)
Follow that with the REBINDs.
Then we need that fifth R – which is to review the access paths generated by the REBIND.

The review is of utmost importance because the optimizer can make mistakes. And, of course, so can you. But your users will not call you when performance is better (or the same). They only dial your numbers when performance gets worse. As such, proactive shops will put best practices in place to test REBIND results comparing the before and after impact of the optimizer’s choices. Again, there are tools that can help to automate this review process.

Tuesday, April 28, 2009

Basic DB2 Buffering and Memory Guidelines

One of the most important areas for tuning DB2 subsystem performance is memory usage. DB2 for z/OS uses memory for buffer pools, the EDM pool, RID pool and sort pools to cache data and structures in memory. The better memory is allocated to these structures, the better DB2 will perform.

When allocating DB2 buffer pools, keep these rules of thumb in mind:

Don't allocate everything to a single buffer pool (e.g., BP0); use a multiple buffer pool strategy.

Explicitly specify a buffer pool for every table space and index.

Isolate the DB2 Catalog in BP0; put user and application DB2 objects into other buffer pools.

Consider separating indexes from table spaces with each in their own dedicated buffer pools.

Consider isolating heavily hit data into its own buffer pool to better control performance.

Consider isolating sorts into a single buffer pool and tuning for mostly sequential access (e.g. BP7).

Consider separating DB2 objects into separate buffer pools that have been configured for sequential verses random access.

Forget about trying to follow a cookie-cutter approach to buffer pool management. Every shop must create and optimize a buffer pool strategy for its own data and application mix. DB2 offers the following buffer pool tuning "knobs" that can be used to configure virutal buffer pools to the type of processing they support:

DWQT –this value is the deferred write threshold; it is expressed as a percentage of the virtual buffer pool that might be occupied by unavailable pages. When this threshold is reached DB2 will start to schedule write I/Os to externalize data. The default is 50%, which is likely to be too high for most shops.

VDWQT – this value is the vertical deferred write threshold; it is basically the same as DWQT, but for individual data sets. The default is 10%, which once again is quite likely to be too high for many shops.

VPSEQT – this value is the sequential steal threshold; it is a expressed as a percentage of the virtual buffer pool that can be occupied by sequentially accessed pages. Tune buffer pools for sequential access such as scans and sorting by modifying VPSEQT to a larger value. The default is 80%.

VPPSEQT – this value is the sequential steal threshold for parallel operations; the default value is 50%.

VPXPSEQT – this value is assisting parallel sequential threshold; it is basically the VPPSEQT for opertaions from another DB2 subsystem in the data sharing group.

These parameters can be changed using the ALTER BUFFERPOOL command. Additionally, hiperpools can be created to back up DB2 virtual buffer pools with additional memory. DB2 provides several tuning knobs for hiperpools, too, including HPSIZE to adjust the size of hiperpools and HPSEQT to adjust the hiperpool sequential steal threshold.

With the advent of DB2 V8, we will have more memory at our disposal for DB2's use. This next version of DB2 will be able to surmount the limitation of 2GB real storage that was imposed due to S/390's 31-bit addressing. Theoretically, with 64-bit addressing DB2 could have up to 16 exabytes of virtual storage addressability to be used by a single DB2 address space. Now there is some room for growth!

In addition to buffer pools, DB2 uses memory for the EDM pool. The EDM pool is used for caching internal structures used by DB2 programs. This includes DBDs, SKCTs, CTs, SKPTs, and PTs. It also includes the authorization cache for plans and packages, as well as the cache for dynamic SQL mini-plans. As a general rule of thumb, shoot for an 80 percent hit rate with the EDM pool; this means that only one out every five times should a structure need to be loaded from disk into the EDM pool.

Finally, remember that buffer and EDM pool tuning are in-depth subjects that cannot be adequately covered in a high-level tip such as this. So, study those IBM DB2 manuals - and learn by doing. Additionally, there is much more to proper DB2 system performance tuning than memory tuning. Other system elements requiring attention include allied agent setup (CICS, TSO, etc.), network configuration, locking, logging, and Parallel Sysplex configuration and management for DB2 data-sharing shops.

Thursday, April 16, 2009

Stages 3 and 4

All good DB2 developers and DBAs know about Stage 1 and Stage 2 predicates, right? But have you ever heard of Stage 3 and Stage 4 predicates? Well, you’re about to!

First of all, let’s do a quick review to catch those readers who don’t know what Stage 1 and 2 are. You may have heard about sargable and nonsargable, and if so, Stage 1 is sargable and Stage 2 is nonsargable. If not, don’t worry about those terms, they are obsolete.

A predicate that can be evaluated in the Data Manager (DM) component of DB2, that is at the earliest stage of query execution, is called a Stage 1 predcicate. Stage 2 predicates need to be passed up to the Relational Data System (RDS) to process. So Stage 1 predicates are more efficient than Stage 2 predicates because the Data Manager component of DB2 is at a level closer to the data than the Relational Data System. Stage 1 predicates, being evaluated earlier in the data retrieval process, avoid the overhead of passing data from component to component of DB2. For this reason, developers are encourage to use Stage 1 predicates rather than Stage 2 predicates to optimize performance.

What makes a predicate Stage 2 instead of Stage 1? Well, it is all in the type of predicate you code and how you write your SQL. There is a list of Stage 1 and Stage 2 predicates in Chapter 12 of the DB2 Performance and Tuning manual. (The same chart also tells you whether a predicates is indexable or not.) Whenever you move from one release of DB2 to another one of the first things you should do is consult this manual to see if any predicates have changed from Stage 2 to Stage 1… and you should make sure all of your developers have a copy of that chart taped to their cubicle wall!

OK, so what is all of this about Stage 3 and Stage 4, then? Well, it is a way of thinking about some bad SQL practices. Instead of coding a SQL predicate some programmers choose to bring all (or most) of the data into their program and then filter it using IF-THEN or CASE statements. You can think of these as Stage 3 predicates because it is one more place that the data must be passed to before it can be determined whether the data is needed.

Stage 4? That is when you use a black box (see the link for an explanation if you don't know what a black box is)... Instead of filtering the data in the DM or the RDS or even in your program, you have to work with another program altogether – the black box – to return the right data.

So just remember 1… 2… 3… 4… and that is the order of efficiency for those types of predicates. 1 is better than 2 is better than 3 is better than 4…

Monday, March 30, 2009

Today's Modern Mainframe

IBM mainframes process two thirds of the world’s business information every day, and have
been doing so for decades, because of the many strengths provided by the mainframe, such
as high availability, good redundancy, high performance I/O capabilities, and much more.

But have you ever been challenged by an anti-mainframer by a statement like "Oh, aren't mainframes dead?" - or - "Why would anyone still use a mainframe?" Have you wanted a ready response that is easily digestible and not overly techie? Well, IBM has a new Redguide book titled The IBM Mainframe Today: System z Strengths and Values that might be of help.

This publication describes the business drivers, their impact on IT, and how the System z platform can be a major player for the business solution by providing flexibility, responsiveness to customers and market needs, as well as cost effectiveness. It also discusses the mainframe hardware, software, and environment, with an emphasis on the features and capabilities that make the System z mainframe a good choice for businesses.

Also, you may have noticed the new term: Redguide. Similar to an IBM Redbook, an IBM Redguide is not quite as technical as a Redbook, but more in-depth than a brochure or other pieces of "sales" collateral.

So, if you are a mainframer like me, you'll want to download and read The IBM Mainframe Today: System z Strengths and Value. After all, the price is right (free). And you'll want to keep an eye out for additional Redguides from IBM. Susan Visser recently blogged about the first 15 Redguides here.

Thursday, March 26, 2009

Cost vs. Advantage of Moving From IMS DB to DB2

As my regular readers know, every now and then I like to share Q+A exchanges I've had with folks. Today, the question I was asked is as follows:

My customer is wondering about the possible advantages of converting his IMS DB/DC system to IMS DC/DB2. The application currently performs well with an internal response time of less than .5 seconds on average.

Even with an arrival rate of 425 full-function transactions per second, the queue count rarely goes above 10. This system typically peaks at 12.5 million transactions per twelve-hour day against HDAM and HIDAM databases totaling close to 1 terabyte. The application itself is currently a bit over seven million lines of code.

Can you comment on the relative cost vs. advantage of moving an existing application from IMS DB to DB2 along with relative CPU capacity requirements?

And here is my short response:

Well, the main advantages of converting from IMS/DB to DB2 would be to gain better support for ad hoc queries, standard SQL (instead of non-standard DL/1) for writing queries and a deeper pool of talent to support the DB2 environment (there are more DB2 folks out there than IMS folks now-a-days).

The benefit of sticking with IMS is the good performance you currently enjoy as well as no need to convert the database structures or the 7 million lines of application code. Converting database structures is not horribly difficult, but there are some gotchas that can arise. The bigger problem is converting all of those DL/1 calls to appropriate SQL. This will not be a simple 1 to 1 conversion and it will very likely be quite time-consuming.

I guess it boils down to this: How happy are you with the current application, are you able to support it properly and how many other IMS/DB databases do you support? If this is the last IMS/DB database and you are looking to rid yourself of the IMS license, then it might make sense to convert. But you should do a project plan and cost/benefit analysis before making your final decision (conversion can be very costly). If you have a lot of other IMS/DB databases, then it probably doesn't make a lot of sense to convert to DB2 unless you cannot support the needs of your end users (management, ad hoc support, etc.) using IMS.

In terms of CPU requirements, DB2 will consume more CPU than IMS. DB2 optimizes queries internally whereas IMS programmers construct access paths to data. This additional requirement will cause DB2 to consume more CPU. But, of course, that additional CPU brings with it the enormous benefit of database optimization and better ad hoc query support.

You might also want to take a look at a product like DL/2. I have never used it so I cannot recommend for or against its functionality, but it looks like it might save you some work.