The Db2 Portal Blog

Wednesday, March 09, 2011

DB2 Symposium 2011

Today's blog post is about a great symposium dedicated to the topic of DB2. It is called, appropriately enough, the DB2 Symposium. DB2 Symposium is a three day training event with one day seminars presented by well-known DB2 consultants. I was fortunate enough to be asked to participate this year by the primary organizer of the event, Klaas Brant. For those of you who don't know him, Klaas is a well-respected DB2 consultant based in the Netherlands... and an all around great guy.

Why should I attend the DB2 Symposium you may ask? Don't IDUG and IOD provide everything I need in the way of events? Well, DB2 Symposium fills the gap between a conference and a multi-day training course. The DB2 Symposium is unique because you can participate for 1, 2, or 3 days, depending on your needs and budget.

Although it has not been to the USA the past few years, the DB2 Symposium is a regular, well-known event in Europe! And after a period of absence the DB2 Symposium is back in the USA.

The USA DB2 Symposium is happening soon, so you'll need to act fast if you want to participate. It occurs March 21-23, 2011 in the Dallas, Texas area. More precisely, at the Hilton Arlington (2401 East Lamar Boulevard, Arlington, Texas, USA 76006-7503). Each day the training sessions start at 9.00am and end at around 5.00pm.

But registration on site is not possible, you must pre-register online... so plan ahead!

My session is on March 21st and it is called DB2 Developer's Guide Comes Alive! This one day session, covers tips, techniques, and procedures you need to know in order to excel at administering and using DB2 on the mainframe.The material is based upon DB2 Developer's Guide, the best-selling DB2 for z/OS book on the market. Additionally, the course material will contain references to sections of the book for students to find additional material on each topic after the sessions. Topics to be covered will include:

A performance tuning roadmap for managing DB2 application, database and system performance. You will learn SQL coding and tuning techniques, guidance for database optimization and reorganization, coverage of buffer pool settings and parameters for performance.
Logical and physical database design recommendations for DB2, so you can build and maintain effective DB2 databases immediately. Includes discussion of standards, logical to physical translation, data types, usage of nulls, and more.
Information and guidance on BINDing and REBINDing, along with a discussion of the most important parameters.
Along the way we'll look at locking, access paths, statistics, indexing and more.
And even though the current edition of the book covers through DB2 V8, this course adds coverage of some of the newer features added to DB2 in versions 9 and 10 that can boost your productivity and performance.

If you own the book already, bring it along and I'll be happy to autograph it for you. And then you can use it along with the course materials... and if you don't own it already, you'll probably want to grab a copy after attending the seminar... you can always find a link to buy my books on the front page of my web site at http://www.craigsmullins.com.

So register for the DB2 Symposium today... and I'll see you in Dallas, pardner!

Monday, February 21, 2011

Not Your Standard Sorting Requirement

Sometimes the requirements of a particular application dictate that data needs to be sorted using some irregular collating sequence. These odd needs sometimes cause developers to sit and scratch their heads for hours, searching for ways to make DB2 do something that seems to be "unnatural." But often you can create an answer just by understanding the problem and applying some creative SQL.

At this point, some of you might be asking "What the heck is he talking about?" Fair enough. Let’s take a look at an example to bring the issue into focus.

Assume that you have a table containing transactions, or some other type of interesting facts. The table has a CHAR(3) column containing an abbreviation for the name of the day on which the transaction happened; let’s call this column DAY_NAME. So, for example, the DAY_NAME column would contain MON for Monday data, and so on.

Now, let’s further assume that we want to write queries against this table that orders the results by DAY_NAME. We’d want Sunday first, followed by Monday, Tuesday, Wednesday, and so on. How can this be done?

Well, the first step is usually to write the first query that comes to mind, or something like this:

SELECT DAY_NAME, COL1, COL2 . . .
FROM TXN_TABLE
ORDER BY DAY_NAME;

Of course, the results will be sorted improperly. ORDER BY will sort the results alphabetically; in other words: FRI MON SAT SUN THU TUE WED

This is what I mean by an irregular sorting requirement. Here we have an example that occurs commonly enough, but without an obvious immediate solution. Furthermore, many businesses and applications have similar requirements for which the business needs dictate a different sort order than strictly alphabetical or numeric. So what is the solution here?

Of course, one solution would be to design the table with an additional numeric or alphabetic column that would sort properly. By this, I mean that we could add a DAY_NUM column that would be 1 for Sunday, 2 for Monday, and so on. But this requires a database design change, and it becomes possible for the DAY_NUM and DAY_NAME to get out of sync if you are not careful.

There is another solution that is both elegant and does not require any change to the database. To implement this solution, all you need is an understanding of SQL and SQL functions -- in this case, the LOCATE function. Consider this SQL:

SELECT DAY_NAME, COL1, COL2 . . .
FROM TXN_TABLE
ORDER BY LOCATE('SUNMONTUEWEDTHUFRISAT',DAY_NAME);

The trick here is to understand how the LOCATE function works. The LOCATE function returns the starting position of the first occurrence of one string within another string.

So, in our example, LOCATE finds the position of the DAY_NAME value within the string 'SUNMONTUEWEDTHUFRISAT', and returns the integer value of that position. So, if DAY_NAME is WED, the LOCATE function returns 10. (Note: Some other database systems have a similar function called INSTR.) Sunday would return 1, Monday 4, Tuesday 7, Wednesday 10, Thursday 13, Friday 16, and Saturday 19. This means that our results would be in the order we require.

Of course, you can go one step further if you’d like. Some queries may need to actually return the day of week. You can use the same technique with a twist to return the day of week value, given only the day’s name. To turn this into the appropriate day of the week number (that is, a value of 1 through 7), we divide by three, use the INT function on the result to return only the integer portion of the result, and then add one:

INT(LOCATE('SUNMONTUEWEDTHUFRISAT',DAY_NAME)/3) + 1;

Let’s use our previous example of Wednesday again. The LOCATE function returns the value 10. So, INT(10/3) = 3 and add 1 to get 4. And sure enough, Wednesday is the fourth day of the week.

Summary

With a sound understanding of the features of DB2 SQL and a little imagination many irregular requirements are achievable using nothing more than SQL!

Thursday, February 10, 2011

View Naming Conventions

Naming conventions sometimes instigate conflict within the world of DB2, especially when it comes to views. But, really, it should be very easy. Just always remember, that a view is a logical table. It consists of rows and columns, exactly the same as a DB2 table. A DB2 view can (syntactically) be used in SQL SELECT, UPDATE, DELETE, and INSERT statements in the same way that a DB2 table can. Furthermore, a view can be used functionally the same as a DB2 table (with certain limitations on updating as outlined in my article).

Therefore, shouldn't it stand to reason that views should be held to the same naming conventions as are used for tables? (As an aside, the same can be said for DB2 aliases and synonyms).

End users querying views don't need to know whether they are accessing a view or a table. That is the whole purpose of views. Why then, should we enforce an arbitrary naming standard, such as putting a V in the first or last position of a view name, on views?

DBAs and technical analysts, those individuals who have a need to differentiate between tables and views, can utilize the DB2 Catalog to determine which objects are views and which objects are tables.

Most users don't care whether they are using a table, view, synonym, or alias. They simply want to access the data. And, in a relational database, tables, views, synonyms, and aliases all logically appear to be identical to the end user: collections of rows and columns.

There are certain operations that cannot be performed on certain types of views, but the end users who need to know this will generally be sophisticated users. For example, very few shops allow end users to update any table they want using a report writer or query tool (e.g. QMF, SPUFI, etc.). Updates, deletions, and insertions (the operations which are not available to some views) are generally coded into application programs and executed in batch or via online transactions. Most end users need to query tables dynamically.

Now you tell me, which name will your typical end user remember more readily when he needs to access his marketing contacts: MKT_CONTACT or VMKTCT01?

Thursday, February 03, 2011

TIMESTAMP versus DATE/TIME

Consider a database design decision point where you need to store both date and time information on a single row in DB2. Is it better to use a single TIMESTAMP column or two columns, one DATE and the other TIME?

Well, of course, the answer is "it depends!" The correct solution will depend on several factors specific to your situation. Consider the following points before making your decision:

With DATE and TIME you must use two columns. TIMESTAMP uses one column, thereby simplifying data access and modification.

The combination of DATE and TIME columns requires 7 bytes of storage, while a TIMESTAMP column always requires 10 bytes of storage. Using the combination of DATE and TIME columns will save space.

TIMESTAMP provides greater time accuracy, down to the microsecond level. TIME provides accuracy only to the second level. If precision is important, use TIMESTAMP. Use TIME if you want to ensure that the actual time is NOT stored down to the microsecond level.

A TIMESTAMP can always be broken down into a DATE and a TIME component, after which you can treat the data just like DATE and TIME data.

Date and time arithmetic probably will be easier to implement using TIMESTAMP data instead of a combination of DATE and TIME. Subtracting one TIMESTAMP from another results in a TIMESTAMP duration. To calculate a duration using DATE and TIME columns, two subtraction operations must occur: one for the DATE column and one for the TIME column.

Formatting may be easier with DATE and TIME data. DB2 provides for the formatting of DATE and TIME columns via local DATE and TIME exits, the CHAR function, and the DATE and TIME precompiler options. If the date and time information is to be extracted and displayed on a report or by an online application, the availability of these DB2-provided facilities for DATE and TIME columns should be considered when making your decision.

Prior to DB2 V9, not much help was available for the formatting of TIMESTAMP columns.But DB2 9 for z/OS adds the TIMESTAMP_FORMAT function, which offers three different formats for displaying timestamp data.

Upon reviewing all of these details, and factoring in your usage requirements, you can then make an informed decision about whether to use one TIMESTAMP column, or two columns, one DATE and one TIME.

Monday, December 13, 2010

More Indicator Variables Available in DB2 10 for z/OS

As you all should know by now, version 10 of DB2 doe z/OS is generally available and has been for a month or so now. As such, it is probably time that I start to blog about some of the features of the new release. But instead of starting with one of the bigger features, that you already may have heard about, I decided to start with a feature that has flown somewhat under the radar: extended indicator variables.

Those of you who write programs that deal with possibly null results should know what an indicator variable is. Basically, DB2 represents null in a special variable known as an indicator. An indicator is defined to DB2 for each column that can accept nulls. The indicator variable is transparent to the end user, but must be provided for when programming in a host language (such as COBOL or PL/I). If the indicator variable is less than zero, then the column to which it applies has returned NULL.

DB2 10 for z/OS enhances and extends the concept of an indicator variable so they can be used outside the scope of nullability. Consider the scenario where a program is being written to modify data. There are multiple combinations of columns that may need to be modified based on conditional programmatic processing. Maybe the program is for editing customer data. The customer has multiple columns that could be modified: name, address, telephone number, credit rating, etc. So the programmer codes up an online screen (e.g. CICS) with all of the data that can then be changed by the end user.

But what happens when the end user cracks the enter key to submit the changes? What actually changed and what stayed the same? Does the program check every value on the screen (perhaps hundreds) and build every UPDATE statement iteration for data that might have been changed? Unlikely, since that would require x! statements (where x is the total number of columns). For non-mathematicians a discussion of factorial can be found here (http://en.wikipedia.org/wiki/Factorial).

Yes, there are CICS options to help the programmer determine which values have changed (or simply save and compare). But until now, dealing with all the potential SQL statements could be problematic. Well, DB2 10 indicator variables come to the rescue. As of DB2 10 NFM you can use indicator variables to inform DB2 whether the value for an associated host variable has been supplied or not… and to specify how DB2 should handle the missing value.

This is an extended indicator variable. And it can be applied to host variables and parameter markers. Whether you will use extended indicator variables can be enabled at the package level, by using the EXTENDEDINDICATOR option of the BIND PACKAGE command. You can also enable extended indicator variables on a statement level for dynamic SQL by using the WITH EXTENDED INDICATORS attribute on the PREPARE statement.

How would this work? Well, extended indicator variables can be specified only for host variables that appear in the following situations:

The set assignment list of an UPDATE operation in UPDATE or MERGE statements
The values list of an INSERT operation in INSERT or MERGE statements
The select list of an INSERT statement in the FROM clause

OK, then, how would we use an extended indicator variable? By setting its value to tell DB2 how to proceeed. The following values are available:

0 (zero) or a positive integer: This indicates the first host identifier provides the value of this host variable reference and it is not null.
-1, -2, -3, -4, or -6: This indicates a null.
-5: If extended indicator variables are not enabled, this indicates a null; otherwise, a value of -5 indicates that the DEFAULT value is to be used for the target column for this host variable.
-7: If extended indicator variables are not enabled, this indicates a null; otherwise, a value of -7 indicates that the UNASSIGNED value is to be used for the target column for this host variable (in other words, treat it as if it were not specified in this statement).

For an INSERT, -5 and -7 settings for an extended indicator variable will end up with the same result. This is so because the INSERT statement works by inserting a default value for any column that is missing. On the other hand, for UPDATE and the UPDATE portion of a MERGE, setting the extended indicator variable to -5 leads to the column being update to the default value, but -7 leads to the update of the column not being applied.

With extended indicator variables then, there is no need for the application to re-send a column’s current value, or to know a column’s DEFAULT value. Which should make things easier for developers.

Thursday, October 28, 2010

IBM Information On Demand 2010 - The Final Keynote

The keynote session for the third day of the IOD conference features the authors of Freakonomics, Steven Levitt and Stephen Dubner. I've read their first book and it is an excellent read... I highly recommend it.

But, of course, there are the IBMers that must speak first. The session kicked off with a video on intelligence being infused into the devices we use in our everyday life. And this “smarter planer” improves our life in countless ways. Smart grids, smart healthcare, smart supply chains, etc. All of which make us more productive and effective not just in business, but in all aspects of our lives. IBM calls this the “Decade of Smart.”

The first part of the session then featured Mike Rhodin, Sr. VP IBM Software Solutions Group. He indicated that we are at the beginning of what is going to constitute massive changes to the way we look at and solve problems. He explained by talking about solutions for commerce that have changed over the last decade or so. The experience is vastly different today across the board. This is so in terms of how buying decisions are made, how buying is done, and how the transaction is completed.

But how can we know what the customer of the future wants? The idea now is to look at how you can leverage things like social media to perform “sentiment analysis” to engage in conversation. By making it a dialogue instead of a one way street we can start this transformation.

He talked about a Southwest flier who was dissatisfied by a delay and tweeted about it. A few days later a Southwest representative contacted him and offered him “something” to assuage his dissatisfaction. Although that is good, he said it would have been better if the Southwest rep was waiting for him at the gate at his destination. OK, but in my opinion, it would have been even better if the Southwest rep could have made contact before the plane took off (either on the plane or at the gate if they had not yet boarded).

Next up was Brenda Dietrich, VP Business Analytics and Mathematical Sciences and IBM Research. It was good to hear from someone in the research group because they don't get "out" to speak much. Dietrich espoused the global reach of IBM’s research group with 9 major offices across the world and many more co-laboratories, which are smaller labs with the goal of working with more local talent.

If it has to do with the future of technology, IBM Research is probably involved in it. Examples include nanotechnology, supercomputing and workload-optimized systems, cloud computing, and analytics.

The future of the “smarter planet” is at optimizing individual systems, like an electrical grid. And then developing systems of systems where those individual systems interact with other systems. For example, where the electrical grid interacts with the traffic grid. Additionally, today things are being digitized and we are analyzing and reacting to this information. We are moving toward using this information to model and predict outcomes.

She also discussed an analytics project called Smart Enterprise Executive Decision Support (SEEDS). It is a super-dashboard that IBM Research is working on. It incorporated a common data model with multiple IBM technologies to perform analytics that delivers better answers. Sounds exciting to me!

And IBM Research has even created a computer that plays Jeopardy! The video she played that demonstrated that was very impressive. This is especially so because it understood the questions in natural language, which is very difficult for computers to accomplish.

Then the Freakonomics dudes came out. And they were very entertaining. They took turns telling stories. Levitt is the economist and Dubner is the writer, but both were eloquent speakers who mixed information with humor extremely well. Dubner started out with my favorite story from their first book: how the legalization of abortion led to a decrease in crime. If that surprises you, you really need to read the book(s).

Levitt followed and told the story of how adding social security number to the tax forms caused 7 million children to vanish from the face of the Earth. It turns out that Americans are very immoral and had created children for the tax deduction. Levitt had troubles believing this until he talked to his father and was told that he himself had lost two brothers!

I would try to explain how they then moved from trying to teach monkeys to use money to a discussion of the proper pricing for prostitution services... but it would be far better if you read about it in their book(s). I know after seeing them that I am going to buy their new book, SuperFreakonomics.

All in all, though, it was a thoroughly entertaining and education final keynote session at the IOD show.

Tuesday, October 26, 2010

A Video from IOD, DB2 -- Monday 10/25

This video was shot by Rebecca Bond at the IBM Information On Demand Conference 2010 in Las Vegas. She was interviewing DB2 folks on what they do and the benefit they get from attending IOD. I am the third interview... but don't just skip to me! Listen to Melanie and Fred, too!

News From The IOD Conference

As usual, IBM has put out a number of press releases in conjunction with the Information On Demand conference, and I will use today’s blog to summarize some of the highlights of these releases.

First of all, IBM is rightly proud of the fact that more than 700 SAP clients have turned to IBM DB2 database software to manage heavy database workloads for improved performance… and, according to IBM, at a lower cost. By that they mean at a lower cost than Oracle. Even though the press release does not state that these SAP sites chose DB2 over Oracle, the IBM executive I spoke with yesterday made it clear that that was indeed the case.

This stampede of SAP customers over to DB2 should not be a surprise because DB2 is SAP’s preferred database software. This might be surprising given that SAP recently acquired Sybase, but IBM notes that seven Sybase runs SAP on DB2.

The press release goes on to call out several customers who are using DB2 with SAP and their reasons for doing so. For example, Reliance Life chose DB2 for the better transaction performance and faster access to data it delivered. Banco de Brasil, on the other hand, was looking to reduce power consumption and storage by consolidating its database management systems.

IBM also announced new software that helps clients automate content-centric processes and manage unstructured content. The highlight of this announcement is IBM Case Manager, software that integrates content and process management with advanced analytics, business rules, collaboration and social software.

IBM also enhanced its content analytics software. NTT DOCOMO of Japan is impressed with IBM’s offering. “With Content Analytics, we have an integrated view of all information that’s relevant to out business in one place regardless of where it’s stored,” said Makoto Ichise, Manager of Information Systems Department Group at NTT DOCOMO.

IBM also enhanced its Information Governance solutions and announced further usage of it InfoSphere Streams product for analyzing medical data to improve healthcare.

So IBM software keeps improving and helping us to better manage our data in a constantly changing world…

Monday, October 25, 2010

DB2 10 Technical Overview at IOD Conference

Today I attended the IOD conference and had the opportunity to listen to Jeff Josten present an technical overview of DB2 10 for z/OS. Even though information and specifications have been trickling out on DB2 10 for z/OS over the course of the past year or so, this is the first DB2 10 presentation I have attended subsequent to the GA announcement IBM made last week (announced October 19th, General Availability on October 22nd, 2010). So I’m fairly certain that everything Jeff will talk about will be officially part of DB2 10, instead of just the rumors, hints and allegations proffered prior to the GA announcement. I don’t think anything to be covered will surprise me, but we’ll see.

Jeff started off saying that the technical strategy for DB2 10 was four-pronged:

Continuous availability

Performance and scalability

Ease of management

Advance application features

One of the key value propositions is the deep synergy with the System z hardware. Because the code does not need to be ported all over the place, it can take advantage of the hardware capabilities that bring improved performance and efficiency.

DB2 10 is being promoted as delivering 5 to 10 percent CPU batch and transaction performance improvement out-of-the-box; 20 percent for new workloads. And then an additional 10 percent when you start using new features. This seems to be the high-level talking point for DB2 10 – but that is okay, it is a very good one!

DB2 V8 will go out of service April 2012. But with the ability to go staright from V8 to 10, I’m betting a lot of V8 shops will skip 9 altogether and go right to 10. But it looks like you have just under 2 years to get off of V8, though.

DB2 10 takes advantage of many zEnterprise (the new mainframe announced a couple of months ago) features to deliver scalability. Examples include improved compression, cache optimization, blades for running the Smart Analytics Optimizer, etc.

Jeff mentioned that we’ll be able to support 10 times more users by avoiding memory constraints in DB2 10. That is a big scalability improvement!

DB2 10 went through the largest beta ever: 23 customers and more than 80 vendors. The focus was on testing production level workloads to ensure that the release is stable. That is different than past betas where the focus was on testing new features. And 22 of the beta customers are planning to go into production with DB2 10 next year. Impressive!

Before diving into the technical details, Jeff mentioned that not much has changed versus what IBM has been talking about over the past months. A couple late add features include hash performance, BIND performance, REBIND not required for packages flagged as private protocol (but they will fail if they actually use private protocol), and a new ZPARM for default SEGSIZE for DDL compatibility.

Post GA delivery items include APREUSE and APCOMPARE (for reusing access paths instead of using “hints”) because beta testing exposed some quality items, the ability to delete a data sharing member, inline LOBs for SPT01, online REORG concurrency for materializing deferred ALTERs, and some temporal enhancements (e.g. TIMESTAMP WITH TIMEZONE support).

High performance DBATs (DDF threads) were forced to be RELEASE COMMIT. This gave good storage usage but at the expense of CPU for releasing resources at COMMIT and putting the thread back on the queue. For DB2 10, customers can use RELEASE DEALLOCATE which will keep the thread assigned to the distributed requestor so the next time he comes in he can reuse the thread. This is a nice feature for shops with heavy distributed usage.

Another nice thing right out of the box is parallel index updating at INSERT. This used to be synchronous, each index modified one after the other. Now, the indexes can be modified in parallel, which should be a nice performance improvement for many shops!

There is also a new buffer pool option for a “fully in memory” object for reading the data and pinning it in buffers. Don’t know about you, but I’ve been wanting that for years.

Things that require a REBIND will include most access path enhancements, query parallelism improvements, IN list performance improvements, and Stage 2 predicates being pushed to Stage 1.

Things that require NFM include DB2 Catalog concurrency, compress on insert capability, most utility enhancements, LOB streaming between DDF and the rest of DB2, INSERT improvements for universal table spaces, faster FETCH and INSERT with lower virtual storage consumption, SQL Procedure Language performance improvements, efficient caching of dynamic SQL with literals, as well as a few other “things.”

And then there are things that require NFM and DBA work, such as hashing, index include columns, inline LOBs, DEFINNE NO for LOB and XML columns, MEMBER CLUSTER for universal table spaces, and online REORG for all DB2 Catalog and Directory table spaces.

So it looks like our favorite DBMS is continuing to grow and expand offering high performance functionality and features that are unparalleled in the industry. Indeed, there is a lot of great and exciting new “stuff” on the way in DB2 10.

Thursday, October 07, 2010

Null Follow-up: IS [NOT] DISTINCT FROM

After publishing the last blog post here on the topic of pesky problems that crop up when dealing with nulls, I received a comment lamenting that I did not address the IS [NOT] DISTINCT FROM clause. So today’s blog post will redress that failure.

First of all, IS [NOT] DISTINCT FROM is a relatively new predicate operator, introduced to DB2 for z/OS in Version 8. It is quite convenient to use in situations where you are looking to compare to columns that could contain NULL.

Before diving into the operator, let’s first discuss the problem it helps to solve. Two columns are not equal if both are NULL, that is because NULL is unknown and a NULL never equals anything else, not even another NULL. But sometimes you might want to treat NULLs as equivalent. In order to do that, you would have to code something like this in your WHERE clause:


WHERE COL1 = COL2
OR (COL1 IS NULL AND COL2 IS NULL)

This coding would cause DB2 to return all the rows where COL1 and COL2 are the same value, as well as all the rows where both COL1 and COL2 are NULL, effectively treating NULLs as equivalent. But this coding although relatively simply, can be unwieldy and perhaps, at least not at first blush, unintuitive.

Here comes the IS NOT DISTINCT FROM clause to the rescue. As of DB2 V8, the following clause is logically equivalent to the one above, but perhaps simpler to code and understand:


WHERE COL1 IS NOT DISTINCT FROM COL2

The same goes for checking a column against a host variable. You might try to code a clause specifying WHERE COL = :HV :hvind (host variable and indicator variable). But such a search condition would never be true when the value in that host variable is null, even if the host variable contains a null indicator. This is because one null does not equal another null - ever. Instead we’d have to code additional predicates: one to handle the non-null values and two others to ensure both COL1 and the :HV are both null. With the introduction of the IS NOT DISTINCT FROM predicate, the search condition could be simplified to just:


WHERE COL1 IS NOT DISTINCT FROM :HV :hvind

Wednesday, October 06, 2010

Null Troubles

A null represents missing or unknown information at the column level. A null is not the same as 0 (zero) or blank. Null means no entry has been made for the column and it implies that the value is either unknown or not applicable.

DB2 supports null, and as such you can use null to can distinguish between a deliberate entry of 0 (for numerical columns) or a blank (for character columns) and an unknown or inapplicable entry (NULL for both numerical and character columns).

Nulls sometimes are inappropriately referred to as “null values.” Using the term value to describe a null is inaccurate because a null implies the lack of a value. Therefore, simply use the term null or nulls (without appending the term “value” or “values” to it).

DB2 represents null in a special “hidden” column known as an indicator. An indicator is defined to DB2 for each column that can accept nulls. The indicator variable is transparent to the end user, but must be provided for when programming in a host language (such as COBOL or PL/I).

Every column defined to a DB2 table must be designated as either allowing or disallowing nulls. A column is defined as nullable – meaning it can be set to NULL – in the table creation DDL. Null is the default if nothing is specified after the column name. To prohibit the column from being set to NULL you must explicitly specify NOT NULL after the column name. In the following sample table, COL1 and COL3 can be set to null, but not COL2, COL4, or COL5:


CREATE TABLE SAMPLE1
  (COL1   INTEGER,
   COL2   CHAR(10) NOT NULL,
   COL3   CHAR(5),
   COL4   DATE     NOT NULL WITH DEFAULT,
   COL5   TIME     NOT NULL);

What Are The Issues with Null?

The way in which nulls are processed usually is not intuitive to folks used to yes/no, on/off, thinking. With null data, answers are not true/false, but true/false/unknown. Remember, a null is not known. So when a null participates in a mathematical expression, the result is always null. That means that the answer to each of the following is NULL:

5 + NULL
NULL / 501324
102 – NULL
51235 * NULL
NULL**3
NULL + NULL
NULL/0

Yes, even that last one is null, even though the mathematician in us wants to say “error” because of division by zero. So nulls can be tricky to deal with.

Another interesting aspect of nulls is that the AVG, COUNT DISTINCT, SUM, MAX, and MIN functions omit column occurrences set to null. The COUNT(*) function, however, does not omit columns set to null because it operates on rows. Thus, AVG is not equal to SUM/COUNT(*) when the average is being computed for a column that can contain nulls. To clarify with an example, if the COMM column is nullable, the result of the following query:


    SELECT  AVG(COMM)
    FROM    DSN8810.EMP;

is not the same as for this query:


    SELECT  SUM(COMM)/COUNT(*)
    FROM    DSN8810.EMP;

But perhaps the more troubling aspect of this treatment of nulls is “What exactly do the results mean?” Shouldn’t a function that processes any NULLs at all return an answer of NULL, or unknown? Does skipping all columns that are NULL return a useful result? I think what is really needed is an option for these functions when they operate on nullable columns. Perhaps a switch that would allow three different modes of operation:

Return a NULL if any columns were null, which would be the default
Operate as it currently does, ignoring NULLs
Treat all NULLs as zeroes

At least that way users would have an option as to how NULLs are treated by functions. But this is not the case, so to avoid confusion, try to avoid allowing nulls in columns that must be processed using these functions whenever possible.

Here are some additional considerations regarding the rules of operation for nulls:

When a nullable column participates in an ORDER BY or GROUP BY clause, the returned nulls are grouped at the high end of the sort order.
Nulls are considered to be equal when duplicates are eliminated by SELECT DISTINCT or COUNT (DISTINCT column).
A unique index considers nulls to be equivalent and disallows duplicate entries because of the existence of nulls, unless the WHERE NOT NULL clause is specified in the index.
For comparison in a SELECT statement, two null columns are not considered equal. When a nullable column participates in a predicate in the WHERE or HAVING clause, the nulls that are encountered cause the comparison to evaluate to UNKNOWN.
When a nullable column participates in a calculation, the result is null.
Columns that participate in a primary key cannot be null.
To test for the existence of nulls, use the special predicate IS NULL in the WHERE clause of the SELECT statement. You cannot simply state WHERE column = NULL. You must state WHERE column IS NULL.
It is invalid to test if a column is <> NULL, or >= NULL. These are all meaningless because null is the absence of a value.

Examine these rules closely. ORDER BY, GROUP BY, DISTINCT, and unique indexes consider nulls to be equal and handle them accordingly. The SELECT statement, however, deems that the comparison of null columns is not equivalence, but unknown. This inconsistent handling of nulls is an anomaly that you must remember when using nulls.

Here are a couple of other issues to consider when nulls are involved.

Did you know it is possible to write SQL that returns a NULL even if you have no nullable columns in your database? Assume that there are no nullable columns in the EMP table (including SALARY) and then consider the following SQL:


SELECT SUM(SALARY)
FROM   EMP
WHERE  DEPTNO > 999;

The result of this query will be NULL if no DEPTNO exists that is greater than 999. So it is not feasible to try to design your way out of having to understand nulls!

Another troubling issue with NULLs is that some developers have incorrect expectations when using the NOT IN predicate with NULLs. Consider the following SQL:


SELECT C.color
FROM   Colors AS C 
WHERE  C.color NOT IN (SELECT P.color 
                       FROM   Products AS P);

If one of the products has its color set to NULL, then the result of the SELECT is the empty set, even if there are colors to which no other product is set.

Summary

Nulls are clearly one of the most misunderstood features of DB2 – indeed, of most SQL database systems. Although nulls can be confusing, you cannot bury your head in the sand and ignore nulls if you choose to use DB2 as your DBMS. Understanding what nulls are, and how best to use them, can help you to create usable DB2 databases and design useful and correct queries in your DB2 applications.