The Db2 Portal Blog

Monday, February 16, 2015

Choosing Between DECIMAL and FLOAT Data Types

DB2 can use both DECIMAL and FLOAT data types to store non-integer numeric data. But the two are not equivalent. In general, use DECIMAL instead of FLOAT whenever you can. The main problem with floating point numbers is that they are not precise. DECIMAL values are precise. In other words, a FLOAT value will be an approximate value whereas a DECIMAL value will be an exact value.

At times, if 100 percent precision is not required, you might want to use floating point numbers to save on storage. DB2 provides a parameter to size the floating point column; n where the data type specification is FLOAT(n). If n is between 1 and 21, this is a single precision floating point number and the column will require 4 bytes of storage; if n is between 22 and 53, this is a double precision floating point number and it will require 8 bytes to store it. A single precision floating-point number is a short (32 bits) floating-point number. A double precision floating-point number is a long (64 bits) floating-point number.

For DECIMAL columns, the byte count is calculated as INTEGER(p/2)+1; where p is the precision of the DECIMAL column. So, a DECIMAL(10,2) column will require (10/2)+1 bytes = 6 bytes. An approximation of the same number could be stored in a FLOAT(21) column that would require only 4 bytes of storage.

For very large or very small numbers, though, you will have to use FLOAT columns. This is so because there is a limit of 31 on decimal precision. The following outlines the largest and smallest values that can be supported using DECIMAL and FLOAT data types:

Smallest FLOAT value is about -7.2**75
Largest FLOAT value is about 7.2**75
Smallest positive FLOAT value is about 5.4**-79
Largest negative FLOAT value is about -5.4**-79
Smallest DECIMAL value is 1 – 10**31
Largest DECIMAL value is 10**31 - 1

Note that the values for floating point numbers are approximations. The ** is used to indicate “raised to the power of”.

If you are moving the between platforms, there is an additional concern when using FLOAT. Mainframes use an IBM standard whereas other platforms use different standards. Since floating point numbers are imprecise to begin with this may not be a problem. However, if you want to make sure that a particular column will be exactly the same value regardless of platform, then floating point is not the way to go.

There is a third, newer option called DECFLOAT. Introduced in DB2 9 for z/OS, DECFLOAT is a combination of the two data types discussed here, or a decimal floating-point data type. Specified as DECFLOAT(n), where the value of n can be either 16 or or 34, representing the number of significant digits that can be stored. If the n is not specified, then the DECFLOAT column can represent 34 significant digits.

A decimal floating-point value is an IEEE 754r number with a decimal point. The maximum precision is 34 digits and the range of a DECFLOAT number is as follows:

A DECFLOAT(16) value can range from a low of:

-9.999999999999999×10**384

to a high of:

9.999999999999999×10**384

And a DECFLOAT(34) value can range from a low of:

-9.999999999999999999999999999999999 ×10**6144

to a high of:

9.999999999999999999999999999999999 ×10**6144

In addition, the DECFLOAT data type can be used to represent several special values that represent "non-number numbers," as follows:

Infinity - a value that represents a number whose magnitude is infinitely large.
Quiet NaN - a value that represents undefined results which does not cause an invalid number condition. NaN is not a number.
Signaling NaN - a value that represents undefined results which will cause an invalid number condition if used in any numerical operation.

So decimal-floating point values can be more flexible and precise with the ability to range lower and higher than floating point values (or decimal values). However, before using DECFLOAT be careful and plan ahead. If you use COBOL programs to operate on your DB2 for z/OS data because there is no way to specify the SQL DECFLOAT data type in COBOL.

Monday, February 09, 2015

Nulls Cannot Be Ignored!

NULLs are one of the more controversial things that a DB2 professional has to deal with. And I do mean HAS to deal with.

Because of all the controversy about their usefulness, implementation, and confusing nature, some DB2 folks just decide to hide their head in the sand and ignore NULLs. The thinking goes something like this: “If I do not create any nullable columns in any of my DB2 tables, then I can blissfully ignore the whole NULL mess and be happy!”

Well, that is simply not true. You can follow this approach and still write a query that will return NULL. Don’t believe it? Run this query then:

SELECT AVG(PRSTAFF)

FROM DSN8B10.PROJ

WHERE DEPTNO = ˈXOXˈ;

PRSTAFF is defined as a DECIMAL(5,2) column and we are trying to find the average staffing for a specific department. The PRSTAFF column can contain nulls, but even if it could not, the result here would be the same. This query returns a NULL (unless someone inserted a row with the value of ‘XOX’ for DEPTNO at your site).

Why? There is no department ‘XOX’ in the sample databases (provided with DB2). So, the query is attempting to find an average for the empty set. This is NULL. Some folks think this query should return 0, but it won't! The sum of all PRSTAFF for the department ‘XOX’ is not zero, but is unknown…

So, take the time to understand how NULLs work in DB2, because they cannot be ignored!

Wednesday, February 04, 2015

Helping Out the DB2 Optimizer Using the VOLATILE Keyword

Do you know about the VOLATILE keyword? This keyword was added to DB2 for z/OS back in Version 8. It can be specified on a table using either the CREATE TABLE or ALTER TABLE statement.

By specifying VOLATILE, you are indicating that the volume of data in the table is not stable and is likely to fluctuate. In other words, it is volatile! One common scenario where VOLATILE will be helpful is for tables that are emptied nightly and then repopulated the next day, such as an input queue.

When you specify the VOLATILE keyword on a table, BIND will favor using indexed access paths, even if the table was empty when RUNSTATS was run. It is ideal for single-index tables where you want DB2 to favor using the index.

ERP environments, such as SAP and Peoplesoft, with thousands of tables typically have some tables that meet these criteria. Even worse, it is not uncommon for DBAs to have no idea of the actual content or use for many of those thousands of tables generated by the ERP installation. At times, some of the ERP tables are not in use – depending on which modules of the ERP system you implement… but the tables get created anyway. Many DBAs simply maintain all of the tables provided with the ERP system, whether they are used or not, including running image copies and gathering RUNSTATS for them... and many are empty tables.

Collecting statistics on an empty table populates the catalog with stats indicating that the table contains no data. And, of course, when access paths are generated using those statistics DB2 will probably favor a scan because the table is small (how much smaller can you get than empty?) But some of those tables are volatile, going from empty to perhaps hundreds of thousands of rows during processing.

Of course, if the table is actually empty (or contains only a small amount of data), and VOLATILE is specified, DB2 will favor the use an index if one exists, which can degrade performance a bit. But that is a smaller price to pay than scanning thousands of rows, isn't it?

So one approach is to use the VOLATILE keyword for these types of tables... your users will be glad that you did.

Tuesday, January 20, 2015

Approaches to DB2 Access Path Management

BIND and REBIND are crucially important components as you attempt to assure efficient DB2 applications. Because the BIND/REBIND process determines exactly how your DB2 data is accessed it is important that you develop an appropriate strategy for when and how to REBIND your programs.

There are several common REBIND approaches taken by DB2 users. By far, the best approach is to REBIND your applications over time as the data changes. This approach involves some form of regular maintenance that keeps DB2 statistics up to date and formulates new access paths as data volumes and patterns change.

Other approaches include REBINDing only when a new version of DB2 is installed, or perhaps more ambitious, whenever new PTFs are applied to DB2. Another approach is to REBIND automatically after a regular period of time (days, weeks, months, etc.). This approach can work if the period of time is wisely chosen based on the application data – but it still can pose administrative issues.

The final approach can be summarized as “if it ain’t broke don’t fix it!” This is the worst of the several approaches discussed here. The biggest problem with this approach is that you are penalizing every program in your subsystem for fear that a program or two may have a degraded access path. This results in potentially many programs having sub-optimal performance because the optimizer never gets a chance to create better access paths as the data changes.

Of course, the possibility of degraded performance is real – and that is why this approach has been adopted at some sites. The problem is being able to find which statements have degraded. In an ideal world we would be to be able to review the access path changes beforehand to determine if they are better or worse. But DB2 itself does not provide any systematic method of administering access paths that way. There are third party tools that can help you achieve this though.

Anyway, let’s go back to the best approach again, and that is to REBIND on a regular basis as your data changes. This approach has become known as the Three Rs. To implement this approach you:

Regularly reorganize the data to ensure that it is optimally structured.
Follow that with RUNSTATS to be sure that the reorganized state of the data is reflected in the DB2 Catalog.
And follow that with a REBIND for all the application programs that access the data structures impacted by the REORG and RUNSTATS.

At any rate, your goal should be to keep your access paths up-to-date with the current state of your data. Failing to do this means that DB2 is accessing data based upon false assumptions. DB2 is unlikely to make the same access path choice as your data grows – and as patterns within the data change.

By REBINDing you can generally improve the overall performance of your applications because the access paths will be better designed based on an accurate view of the data. Additionally, as DB2 changes are introduced (PTFs, new version/release) optimizer improvements and new access techniques can be incorporated into the access paths. That is, if you never REBIND, not only are you forgoing better access paths due to data changes but you are also forgoing better access paths due to changes to DB2 itself.

Adopting the Three R’s approach can pose additional questions. For example, when should you reorganize? In order to properly determine when a REORG is needed you’ll have to look at statistics. This means looking at either RUNSTATS or Real-Time Statistics (RTS). So, perhaps it should be at least 4 R’s – in other words:

RTS (or RUNSTATS)
REORG
RUNSTATS
REBIND

Now it is true that some folks don’t rely on statistics to schedule a REORG. Instead, they just build the JCL to REORG their database objects when they create the object. So they create a table space then build the REORG job and schedule it to run monthly, or quarterly, or on some regular basis. This is better than no REORG at all, but it is probably not the best approach because you are most likely either reorganizing too soon (in which case you waste the CPU cycles to do the REORG) or you are reorganizing too late (in which case performance is suffering for a period of time before the REORG runs). Better to base your REORGs off of statistics and thresholds using either RUNSTATS or RTS.

Without accurate statistics there is little hope that the optimizer will formulate the best access path to retrieve your data. If the optimizer does not have accurate information on the size, organization, and particulars of your data then it will be creating access paths based on either default or inaccurate statistics. Incorrect statistics will cause bad choices to be made – such as choosing a merge-scan join when a nested loop join would be better, or failure to invoke sequential prefetch, or using the wrong index – or no index at all. And the problem of inaccurate statistics is pervasive. There are shops out there that never, or rarely, run RUNSTATS to gather up-to-date statistics. Make sure yours is not one of those shops!

When should you run RUNSTATS? One answer is “As frequently as possible based on how often your data changes.” To do this you will need to know a thing or two about your data growth patterns: what is its make-up, how is it used, how fast does it grow, and how often does it change? These patterns will differ for every table space in your system.

Next we need to decide when to REBIND? The best answer for this is when statistics have changed significantly enough to change access paths. When we know that data has significantly changed it makes sense to REBIND after the RUNSTATS completes. But the trick is determining exactly when we have a “significant” change in our data. Without an automated method of comparing and contrasting statistics (or even better yet, access paths) coming up with an answer in a manual way can be time-consuming and error-prone – especially if we have thousands of DB2 programs to manage.

As we REBIND, we always must be on alert for rogue access paths. A rogue access path is created when the optimizer formulates a new access path that performs worse than the previous access path. This can happen for a variety of reasons. Of course, number one is that the optimizer, though good, is not perfect. So mistakes can happen. Other factors can cause degraded access paths, too. The access paths for volatile tables depend on when you run the RUNSTATS. Volatile tables are those that start out empty, get rows added to them during processing, and are emptied out at the end of the day. And, of course, if the catalog or statistics are not accurate we can get problems, too.

So adopting the Four Rs approach implies that you will have to develop a methodology for reviewing your access paths and taking care of any “potential” problem access paths. Indeed, the Four Rs becomes The Five Rs as we add a step to review the access paths after REBINDing to make sure that there are no rogue access paths:

Start with Real Time Stats (or RUNSTATS) to determine when to REORG.
Then we Reorganize the needy table spaces (and indexes)
After reorganizing, run RUNSTATS to pick up current statistics,
Follow that with the BEBINDs.
Then we need that fifth R – which is to Review the access paths generated by the REBIND.

The review is of utmost importance because the optimizer can make mistakes. And, of course, so can you. But your users will not call you when performance is better (or the same). They only dial your numbers when performance gets worse. As such, proactive shops will put best practices in place to test REBIND results comparing the before and after impact of the optimizer’s choices.

Wednesday, January 14, 2015

IBM Announces New Mainframes

IBM today announced their latest model mainframe, the IBM z13.

Everybody knows -- or should know -- that the mainframe is the work horse of the IT industry, but the new z13 boasts some amazing capabilities. The z13 can process 2.5 billion transactions a day. And those transactions are persistent, protected and auditable end-to-end! Furthermore, the z13 delivers practical real-time encryption based on 500 new patents, including advanced cryptographic encryption technologies. And the z13 comes with embedded analytics capabilities.

“Every time a consumer makes a purchase or hits refresh on a smart phone, it can create a cascade of events on the back end of the computing environment. The z13 is designed to handle billions of transactions for the mobile economy. Only the IBM mainframe can put the power of the world's most secure data centers in the palm of your hand," said Tom Rosamilia, senior vice president, IBM Systems.

To put this amount of power into context, IBM notes that a single z13 mainframe can porcess more than 30,000 transactions per second, which is roughly the same as 100 Cyber Mondays every day!

IBM has designed the z13 to integrate real-time scoring and guarantees this capability as a feature of the system. This scoring can be used for fraud detection on 100 percent of a client's business transactions. To take advantage of this capability customers must be running the latest versions of DB2 and SPSS Scoring Adapter on z13 machine, as well as following IBM's capacity planning implementation instructions and processes.

For the long-time mainframe zealots out there, of which I am one, it always gives me a warm feeling inside when IBM comes out with a new line of mainframes. The mainframe always leads the IT industry in terms of processing power and capabilities... and it looks like it will continue to do so well into the future...