The Db2 Portal Blog

Monday, December 13, 2021

Top 10 Steps to Building the Right Indexes

One of the constant struggles with Db2 development, and indeed developing for any relational DBMS, is designing and creating appropriate indexes. Perhaps the most important thing you can do to assure optimal application performance when accessing data in a relational/SQL database is to create correct indexes for your tables based on the queries your applications use. Of course, this is easier said than done.

But we can start with some basics. For example, consider this SQL statement:

    SELECT   LASTNAME, SALARY
    FROM     EMP
    WHERE    EMPNO = '000010'
    AND      DEPTNO = 'D01';

What index (or indexes) would it make sense to build for this simple query? First, think about all the possible indexes that you could create. Your first short-list probably looks something like this:

Index1 on EMPNO
Index2 on DEPTNO
Index3 on EMPNO and DEPTNO

This is a good start, and Index3 is likely to be the best of the lot. It enables the optimizer to use the index to immediately look up the row or rows that satisfy the two simple predicates in the WHERE clause. Of course, if you already have a lot of indexes on the EMP table, you might want to examine the impact of creating yet another index on the table.

With the caveat that appropriate index creation can be complicated, let’s look at the Top 10 things you can do to build the right indexes on your tables:

1. Index by workload, not by object

Many people make the mistake of just guessing at some indexes to create when they are creating other database objects (like tables and tablespaces). But without an idea of how the tables are going to be accessed these guesses are usually wrong – at least some of them.

Indexes should be built to optimize the access of your SQL queries. To properly create an optimal set of indexes requires a list of the SQL to be used, an estimate of the frequency that each SQL statement will be executed, and the importance of each query. Only then can the delicate balancing act of creating the right indexes to optimize the right queries most of the time be made.

If you are doing it any other way, you are doing it wrong.

2. Build indexes based on predicates

3. Index most-heavily used queries

Numbers 2 and 3 can be thought of as corollaries to Number 1… that is, these are the aspects of application workload that need to be examined to produce appropriate and effective indexes.

Look at the predicates of the queries in the workload and work to create a set of indexes that match up to most (if not all) of them. When it is not practical to build all of the indexes you have identified, then it makes sense to look at the queries that will be used most often and optimize them first.

4. Index important queries

The more important the query, the more you might want to tune by index creation. If you are coding a query that the CIO will run every day, you want to make sure it delivers optimal performance. So building indexes for that particular query is important. On the other hand, a query for a clerk might not necessarily be weighted as high, so that query might have to make do with the indexes that already exist.

Of course, the decision should depend on the application’s importance to the business - not just on the user’s importance. For example, if that clerk runs a query hundreds of times a day and the CIO runs his query only once, then perhaps the clerk's query is more important.

5. Index to avoid sorting (GROUP BY, ORDER BY)

In addition to building indexes to optimize data access, indexes can be used to avoid sorting. The GROUP BY and ORDER BY clauses tend to invoke sorts, which can cause performance slowdowns. By indexing on the columns specified in these clauses, the optimizer can use an index to avoid a sort, and thereby potentially improve performance.

6. Create indexes for uniqueness (PK, U)

Some indexes are required in order to make the database schema valid. Most Database systems require that unique indexes be created when unique and primary key constraints exist.

7. Create indexes for foreign keys

Even if not required, creating indexes for each foreign key can optimize the performance when accessing and enforcing referential constraints (RI – referential integrity).

8. Consider adding columns for index-only access

Sometimes it can be advantageous to include additional columns in an index to increase the chances of index-only access. With index-only access, all of the data needed to satisfy the query can be found in the index alone — without having to read data from the tablespace.

For example, suppose that there is an index on the DEPTNO column of the DEPT table. The following query may use this index:

   SELECT   DEPTNAME
FROM     DEPT
WHERE    DEPTNO = 'D01';

The index could be used to access only those columns with a DEPTNO greater than D00, but then the DBMS would need to access the data in the table space to return the DEPTNAME. If you added DEPTNAME to the index, that is, create the index on (DEPTNO, DEPTNAME) then all of the data needed for this query exists in the index and additional I/O to the table space would not be needed. This technique is sometimes referred to as index overloading.

Of course, this is not always a good idea. You have to take into account whether other queries use the index and how it might negatively impact their performance.

It is also worth mentioning index include columns, which allows you to define additional (included) columns that are not part of the actual key but are available in the index. So we can create an index like this:

CREATE INDEX IXNAME
    ON DEPT ( DEPTNO )
     INCLUDE ( DEPTNAME )

The index key is only on the DEPTNO column, but the DEPTNAME is also included in the index, so the earlier query can be satisfied using index-only access with this index.

9. Don’t arbitrarily limit the number of indexes

An example of a bad standard would be to place an artificial limit on indexing. There should be no arbitrary limit on the number of indexes that you can create for any database table. Indexes are undoubtedly one of the most important factors in creating efficient queries. Relational optimizers rely on indexes to build fast access paths to data. Without indexes data must be scanned – and that can be a long, inefficient means by which to retrieve your data. When a rule such as this exists, it usually is stated in the standards manual using verbiage something like “Each table can have at most five indexes created for it” — or — “Do not create more than three indexes for any single table in the database.” These are bad standards.

If you already have three indexes, or five indexes, or even 57 indexes, and another index will improve performance why would you arbitrarily want to avoid creating that index?

Anyway, a good indexing standard, if you choose to have one, should read something like this: “Create indexes as necessary to support your database queries. Limitations on creating new indexes should only be entertained when they begin significantly to impede the efficiency of data modification.”

Which brings us to…

10. Be aware of I/U/D implications

Db2 must automatically maintain every index you create. This means every INSERT and every DELETE to an indexed table will insert and delete not just from the table, but also from its indexes.

Additionally, when you UPDATE the value of a column that has been defined in an index, the DBMS must also update the index. So, indexes speed the process of retrieval but slow down modification.

So the general rule of thumb should be "Index until it hurts... and then back off the least effective index to make it no longer hurt." Sure, that is easier said than done, but it is a valid philosophy to follow.

Summary

Following these Top 10 index design techniques can go a long way toward improving not just your index usage, but also the performance of your database applications.

Monday, December 06, 2021

A Systematic Approach to View Creation

Historically, one of the most fertile grounds for disagreement between database professionals was the appropriate usage of views. Some analysts promote the liberal creation and usage of views, whereas others preach a more conservative approach.

When properly implemented and managed, views can be fantastic tools that help to ease data access and simplify development. Although views are simple to create and implement, few organizations take a systematic and logical approach to view creation. And therein lies the problem. A strategic and reasonable policy guiding the creation and maintenance of views is required to avoid a muddled and confused mish-mash of view usage. Basically, views are very useful when implemented wisely, but can be an administrative burden if implemented without planning.

Let's back up a minute and define what a view is. First, remember that all SQL access to relational tables results in another table. This is called relational closure and it is a requirement of the relational model. A view can be considered to be a logical table. That is, a view is a "logical" representation of data that is "physically" stored in other tables (and perhaps other views as well). Views are defined using SQL and are represented internally to the DBMS by SELECT statements, not by stored data (unless it is a materialized view, which is another thing altogether and beyond the scope of this particular post). Almost any SQL that can be issued natively can be coded into a view.

View Implementation Rules

After you understand the basics of views, you should develop guidelines for view creation in order to limit administrative burden. The following rules can be used to ensure that views are created in a responsible and useful manner at your shop. These rules were developed over a number of years as a result of reviewing implementations and working with views in many different environments. There are three basic view implementation rules:

1. The View Usage Rule

2. The Proliferation Avoidance Rule

3. The View Synchronization Rule

The View Usage Rule: The first rule is simple: your view creation strategy should be goal-oriented. Views should be created only when they satisfy a specific application or business requirement. That requirement should be documented somewhere, preferably in a dictionary or repository. Although this rule may seem to be obvious, views are implemented at some shops without much thought as to how they will be used. This can cause the number of views that must be supported and maintained to continually expand until so many views exist that it is impossible to categorize their uses. And the time needed to maintain and administer the system increases as the number of views increases.

There are five basic uses for which views excel. These are:

1) to provide row and column level security,
2) to ensure efficient access paths,
3) to mask complexity from the user,
4) to ensure proper data derivation, and
5) to rename tables and/or columns.

The second rule is the Proliferation Avoidance Rule. This rule is simple and to the point: do not needlessly create views (or, indeed, any database objects) that are not absolutely required. Whenever a database object is created additional entries are placed in the system catalog. Creating needless views causes "catalog clutter"…that is, entries in the catalog for objects which are not needed and/or are not used.

The proliferation avoidance rule is based on common sense. Why create something that is not needed? It just takes up space that could be used for something that is needed.

The third, and final view implementation rule is the View Synchronization Rule. The basic intention of this rule is to ensure that views are kept in sync with the base tables upon which they are based.

Whenever a change is made to a base table, all views that are dependent upon that base table should be analyzed to determine if the change impacts them. All views should remain logically pure. The view was created for a specific reason (if we followed the View Usage Rule above). The view should therefore remain useful for that specific reason. This can only be accomplished by ensuring that all subsequent changes that are pertinent to a specified usage are made to all views that satisfy that usage.

For example, say a view was created to satisfy an access usage, such as a join between the employee table and the department table. If a column is added to the employee table specifying the employee’s location, it should also be added to the EMP_DEPT view if it is pertinent to that view’s specific use. Of course, the column can be added to the table immediately and to the view at the earliest convenience of the development team.

The synchronization rule requires that strict change impact analysis procedures be in place. Every change to a base table should trigger the usage of these procedures. Simple SQL queries can be created to assist in the change impact analysis. These queries should pinpoint any SQL in QMF queries, application packages, dynamic packages, and so on. Policies for informing users of the views to be impacted also need to be established before view changes can be implemented.

View synchronization is needed to support the view usage rule. By keeping views in sync with table changes the original purpose of the view is maintained.

Synopsis

Views are practical and helpful when implemented in a systematic and thoughtful manner. Hopefully, this post has provided you with some food for thought pertaining to how views are implemented at your shop.

And if you follow the guidelines contained in this article, in the end, all that will remain is a beautiful view!

Monday, November 22, 2021

"Unstructured " Data and LOBs

Let's take a look at the burgeoning role of "unstructured" data and the impact of using LOBs to store that data in our Db2 databases.

The Growth of "Unstructured" Data

Although structured data remains the bedrock of the information infrastructure in most organizations, unstructured data is growing in importance. And there is much more "unstructured" data “out there” than structured. Indeed, analysts at IDC estimate that unstructured data accounts for as much as 90 percent of all digital information.

The rise of "unstructured" data is often attributed to the growing amount of multimedia data being adopted by organizations as they store more audio, video, and images. But that is only part of the story. Text documents, in the form of business forms, letters, and documents, and most importantly, e-mail, greatly contribute to the growing importance of unstructured data.

Now I'm sure you've noticed the quotes I've been putting around the word "unstructured," and they are there for a reason. You see, the word unstructured is a horrible adjective to use in conjunction with data to refer to this type of data. I refer you to this article I wrote called The Big Unstructured Data Lie for why I think this way. Nevertheless, the term "unstructured" will continue to be used to describe this type of data, so from now on I will stop using the quotes...

As organizations expand their digital information footprint, more types of unstructured data are being captured, stored, and made available for analysis. And not just internally generated data, but from more external data sources, too.

Db2 and Unstructured Data

Db2 for z/OS can be used to store unstructured data using BLOB, CLOB, and DBCLOB data types – collectively known as LOBs. Historically, LOBs were added to relational products like DB2 to compete more capably against the Object-Oriented databases of the time – this was back in the late 1990s. The idea was to enable relational tables to house unstructured data like images and audio and video data.

But Db2 for z/OS users were slow to adopt LOBs in their mainframe databases. This was due to several different reasons, not the least of which is that it took several versions of Db2 for LOBs to mature into capable, useful things for enterprise applications. Early implementations of LOBs in Db2 were somewhat kludgy and difficult to administer and use. But IBM has corrected many of those deficiencies over time and there are now tools that can help organizations to exploit and effectively manage Db2 LOBs, too.

The other new force driving LOB usage is the whole Big Data movement. Big Data is driving organizations to accumulate and analyze more data, and more varied types of data, to gain business insight. A specific example of Big Data driving the usage of LOBs in Db2 is the JSON support that has been added to Db2. JSON objects are stored in Db2 as BLOBs.

So, more and more organizations are adopting LOB data in their Db2 databases – to support unstructured data, for their big data projects, and to store documents and multimedia data.

Nevertheless, sometimes I hear DBAs say things like “Oh, well, I don’t use LOBs so I don’t really need to worry about them.” That can cause trouble, because you have been using LOBs for some time now, whether you know it or not. LOB data has been part of the Db2 Catalog since Version 7 and the number of LOB columns used in the Db2 Catalog has been increasing over the past couple of Db2 releases.

As you can see in table below, the Db2 Catalog now has over 40 LOB columns. So, even if you have not created any user Db2 tables with LOBs, you have system Db2 tables with LOBs. For example, SYSIBM.SYSVIEWS contains a CLOB containing the source text used to create the VIEW.

LOBs are also used in the Db2 Directory, in the DBD01 “table” a BLOB column is used to store the DBD data (2GB). And in the SPT01 “table” there are two BLOB columns for storing data and explain information.

Using LOBs in Db2

I will not attempt to train you on the usage of LOBs in Db2 in this blog post. Suffice it to say that LOBs require different management and administration tactics to ensure their accuracy and usability. But I do want to bring up some of the administration concerns that using LOBs can create.

The first thing to keep in mind is that most LOBs are larger than your typical column data. I mean, it’s right there in the name LOB = Large OBject. As the size of data increases, so do the management concerns, such as lengthy elapsed times to run utilities on the data, slower performance of accessing the data, and so on. And you’ll also need to decide whether or not to log changes to your LOB data. If you do log LOB changes, then you can stress your logs because of the size of the LOB data. If you do not log LOB changes, then you will need to make sure that you have sufficient methods to recover the LOB data because the changes between image copies won’t be there in the log. Generally speaking, most usually choose to avoid the logging of LOBs. You can turn off LOB logging by specifying NOT LOGGED in the LOB table space DDL.

There are also many restrictions on how LOB data can be used with SQL. LOB data is not like traditional, structured database data, so Db2 imposes some limitations, for example:

Cannot use a LOB in a GROUP BY or ORDER BY clause
Cannot specify SELECT DISTINCT on a LOB
LOBs cannot be used in the context of an INCLUDE(column-name) clause in a MERGE statement
Cannot define check constraints, primary key, unique, or foreign keys on LOBs
LOBs cannot be used in any predicate except EXISTS, LIKE and NULL

These are for illustrative purposes only. There are other restrictions, all of which can be found in the IBM SQL Reference manual.

Unless it is an inline LOB where the entire LOB is stored in the base table, a LOB will require a LOB table space, auxiliary table, and LOB index. When building auxiliary tables and indexes, you do not specify columns the same way that you do for normal tables and indexes. For the Auxiliary Table you specify the LOB column and base table and Db2 automatically generates the columns needed. For the Auxiliary Index you just specify the auxiliary table and Db2 implicitly generates the needed index keys.

Each table can have 0, 1, or many LOB columns and each LOB instance can be up to 2GB in size. Each table with at least 1 LOB must have a ROWID; the ROWID is a varying-length 17 byte field. One page of a LOB table space will never contain more than one LOB, but one LOB can span multiple LOB table space pages. An auxiliary table, which resides in the LOB table space, can store only one LOB column of a base table; there must be one and only one index on this column. All of these things can alter the way in which you manage and administer your Db2 tables and table spaces.

Let’s backtrack and review the size of LOBs in a little more depth. Each LOB instance can be up to 2G – and that is per row! Each LOB table space can have as many as 254 different data sets with a DSSIZE from 2G to 64G each for a total of about 16 terabytes (TB). This is per partition, so if there are 4096 partitions, then the total size for a single LOB is over 66,000 TB. That is big!

Unless all of your LOB data is static – meaning it never changes – the size of your LOB data sets will continue to grow. Are you prepared for running utilities on such large table spaces?

When you drop a LOB column from a base table, DB2 will not automatically clean up LOB table spaces. After removing the LOB column you can either drop the LOB table space yourself explicitly, or perhaps reuse it for another LOB.

And finally, LOB columns are not really updated. The old version of the LOB is de-allocated, and a new LOB is allocated. So LOBs are a bit different than the traditional data we are used to managing.

What can go wrong with LOBs?

Errors with LOBS occur when there are inconsistencies between the components of the LOB. We all know that “normal” Db2 indexes can be inconsistent with their associated table, but the issues are multiplied for LOB indexes:

The ROWID-Version number in the base table row may not be found in the LOB index.
There may be entries in the LOB index that are not referenced by any row in the base table.
The LOB data itself may not be where the LOB index points to.
There may be LOBs in the LOB table space that are not referenced by the LOB index.

CHECK DATA can be used to find errors 1 and 2 (from the list above); CHECK LOB can be used to find errors 3 and 4. But it is possible that CHECK LOB will convert a type 4 error into a type 2 error, so proceed with caution.

Then there is the issue of LOB index consistency. If the LOB index is inconsistent with the base table data, the LOB data cannot be accessed. There is no direct access to the LOB table space except through the LOB index. If the LOB index is inconsistent with the LOB table space, Db2 will get errors trying to access the LOB data for that row.

Maintaining LOBs

Db2 maintains LOB data in the LOB table space using a hierarchical structure.

LOB data in the LOB table space can be distributed over many different pages of the LOB table space. Remember, this LOB data is very large. Db2 uses a structure of map pages to point to data pages. At the top is the first map page and it is this page number that is stored in the LOB index. This first map page contains a list of pages, which can be other map pages and data pages. It also contains the total size of the LOB data. If all the data pages are not referenced by map pages or if the map pages are not properly referenced by a higher level map page, LOB data will be lost.

With all of these pointers and structures to maintain, there are a variety of things that can go wrong. To verify that your LOBs are structurally sound you must run a series of DB2 utilities, in the following order:

Run CHECK DATA to verify that the ID fields specified in the base table are also found in the LOB index.
Run CHECK INDEX to verify that the LOB index is valid.
Run CHECK LOB to verify that the internal structure of the LOB table space is sound.

Of course, there are easier ways. It can make a lot of sense to consider using a modern tool that understands the nuances of LOBs so you can manage them accordingly and appropriately.

The Bottom Line

Business and industry trends dictate that unstructured data, in the form of LOBs, is increasingly being stored in our Db2 databases. This type of data is different than traditional, structured data and must be managed with these differences in mind. To do so requires in-depth knowledge and planning to avoid inconsistencies and errors.

Monday, November 15, 2021

Db2, SQL, and Modern Coding Techniques

Mainframe application development has changed dramatically during the lifetime of Db2. Way back in the beginning, and even for a long time thereafter, most Db2 programs were written in COBOL and coded to use static SQL. This meant that the SQL was bound before it was executed and the access paths were etched in stone. And it also meant that the access paths for any SQL in almost any program were readily available to DBAs in PLAN_TABLEs.

Fast forward to the present and static SQL in COBOL is the exception rather than the rule. Oh, sure, those COBOL programs with static SQL are still there, running in CICS or in batch (or even IMS/TM), but most new development is not done this way anymore. Today, programmers use IDEs to build code that accesses mainframe Db2 data using distributed data access over the web or from a GUI.

Most modern, distributed, application development projects typically rely upon application development frameworks. The two most commonly-used frameworks are Microsoft .NET and Java/J2EE. And these frameworks use dynamic SQL, not static.

An additional contributing force is the adoption of commercial off-the-shelf applications for ERP and CRM like SAP, Peoplesoft, and Siebel (the last two of which are now Oracle products). These applications are not tied to a specific DBMS but support by multiple different DBMSs, one of which is (was) DB2 for z/OS. So, these applications use dynamic SQL because that makes it easier to accomplish multi-vendor DBMS support. For the most part though, none of these ERP packages run on the mainframe (or Db2) because two of the are now Oracle products and SAP is promoting its own DBMS (HANA).

The point though is that the manner in which Db2 applications are developed has changed. And that means the manner in which Db2 is managed is changing, too. Instead of relying on access paths already being there, DBAs must develop ways of capturing access paths for dynamic SQL statements. Prepared dynamic SQL can be cached in the dynamic statement cache so that the same SQL statement can reuse the mini plan for the statement the next time it runs. And the BIND command can be used to snap the statements in the cache and find the access paths used. But the dynamic statement cache works like a buffer pool, with least recently used statements being flushed to make room for new statements… so you might not find the dynamic SQL you are looking for, at least not without helpful tools or scripts to stay on top of things.

This change has caused many organizations to experience SQL performance problems. Because dynamic SQL is easier to introduce into the system performance can become problematic. It is difficult to get a handle on what dynamic SQL runs when. And it is difficult to track because the DBAs are not accustomed to monitoring and tuning dynamic SQL… or they just do not have the proper tools to be able to do the job appropriately. So, these black holes of dynamic SQL performance problems open up that are left to run wild.

New SQL Stuff

It is not just the new programming paradigm that causes management problems for modern DB2 usage. The amount of new SQL statements and functions and features continues to grow with each new version (and even function level) of Db2. This can be a mixed blessing though. It is good because expanding SQL functionality makes it easier to program and access Db2 data. But it can be problematic because it can be more confusing and difficult to learn these new features.

Exacerbating the issue is that many organizations do not invest a sufficient amount of time and effort in educating their staff. If the DBAs are not trained in all of the new functionality, then new SQL can cause big issues. How? Imagine a shop that just migrated to a new version or level of Db2, but the only education available was to read the manuals. So, some of the DBAs are not knowledgeable on the new SQL code. Then a developer picks up a manual, and learns how to use a new function or arrays as parameters. Kudos to the developer for the initiative, but if problems arise there may not be anybody who knows how to support the new feature.

And there is a lot of new SQL functionality being added. If we focus on just the past few versions of Db2, here is a representative list of new SQL programming related enhancements that have been introduced: TRUNCATE, new data types (DECFLOAT, VARBINARY), optimistic locking, FETCH CONTINUE, ROLE, MERGE, SELECT from MERGE, pureXML, FETCH FIRST & ORDER BY in subselect and fullselect, INTERSECT, EXCEPT, Indicator Variables, TIMESTAMP precision and time zones, Moving sums and averages, Inline and Non-inline SQL scalar functions, SQL table functions, extended implicit casting, RANK(), ROW_NUMBER(), XQuery, transparent archive query, APPLCOMPAT, IDAA/analytics, grouping sets, ROLLUP, Hadoop access, FTBs, LISTAGG…

That is a lot to learn and this is just a selection of what has been added!

Summary

Things are moving at a fast and furious pace for application developers these days. And the resultant changes can introduce problems that impact your business unless you adapt to align your management and DBA capabilities with the new development methods.

Tuesday, October 19, 2021

How Do You Store Boolean Data in Db2?

Recently, I was reading an article written by Ben Brumm on using Boolean data in SQL and it got me thinking about Db2 and Boolean data. You see, if you click over to the article, you'll see that Ben talks about several popular database systems, but not Db2!

So first of all, what is Boolean data? Sometimes all you want to do is to be able to store a Yes/No, True/False, 1/0 value in a column of a Db2 table. But that can be problematic. What data type should you choose? How should you model that column? Does it matter?

Well, up until recently, Db2 did not support a Boolean data type. But I guess I should be more clear: Db2 LUW offers Boolean support, but Db2 for z/OS does not have a built-in Boolean data type.

As of V9.7, you can use a Boolean data type for local variables, global variables, parameters, or return types in compound SQL (compiled) statements. So it is really only a viable option for SQL PL. As of V11.1, you can use the Boolean data type for a column in a Db2 table.

OK, so what should you do if you do not have Boolean support in the Db2 you are currently using? In my opinion, the best approach is to use a numeric solution with 1 as TRUE and 0 as FALSE. You might want to consider using a Character column with YES/NO or TRUE/FALSE, but there are several issues with that approach. Firstly, it will likely consume more storage, secondly without a constraint on the column users might specify “YES” or “Yes” or “y” or “Y”… or other variations to mean the same thing, and finally, if English is not your first language then users may have the urge to specify “Si” or “Da” or… well, you get the idea.

A good choice for a Boolean data type could be BINARY(1) which only stores binary data, so no constraint is needed. Or you might want to specify SMALLINT with a check constraint only allowing 0 and 1 as options.

If you want to allow "Unknown" as an option, then you can make the column nullable.

Summary

So, share your experiences with us here on the blog. Tell us, have you implemented Boolean data in your Db2 tables? How did you do it? And how is it working out for you?

Monday, October 11, 2021

What is Db2?

For regular readers of this blog the title of this particular post might seem to be a bit odd. After all, why visit the Db2Portal blog if you don't know what Db2 is, right?

Well, the purpose of this short post is to refer my regular readers to an article I wrote for TechTarget, a website that regularly posts "What Is" type of articles.

So, you guessed it, I wrote the latest, updated "What is Db2?" article for TechTarget... and I figured I'd share it with my blog readers.

Now whenever somebody asks a high-level question about Db2, like "what is it?" or "what are the advantages and disadvantages of Db2" or even "where can I run Db2?" you now have a place to send them...

Here ---> What is Db2?

Monday, August 30, 2021

What Type of Changes Cause Db2 Packages to Get Invalidated?

Db2 DBAs are constantly working with database objects such as Databases, Tablespaces, Tables, and Indexes. And many requirements cause DBAs to have to modify these objects. Some modifications may be simple, such as just issuing an ALTER statement. Others may be more in-depth, even to the point of having to DROP and re-CREATE the object.

Now I've blogged here before about the many different types of Db2 database changes and how to manage them. For reference, here is a blog post that summarizes the posts I've made on this topic.

My purpose today is not to rehash all of that information again, but to discuss one aspect of change management that probably causes DBAs the most grief: package invalidation.

Packages can be invalidated by Db2 for many reasons. For example, when a privilege that is required by a package is revoked, the package is invalidated by Db2.

When a package is invalidated it cannot be executed until it has been rebound. This can be automatic, but it is usually better to be proactive and to Rebind packages when you take an action that invalidates packages.

And as we all know, rebinding can cause access paths to change. Hopefully for the better... but not always. If access paths always got better then there would be no DBA grief, right? Whenever DBAs perform Rebinds they are always dreading that call from the developer or end-user that says "Hey, this transaction (or job) that used to run quickly is now taking forever."

So it makes sense that DBAs need to be aware of what changes cause packages to be invalidated. Of course, if you have to DROP an object that the package accessed it is obvious that a Rebind is required. But there are many other types of changes that will invalidate packages.

Fortunately, the IBM Db2 documentation is good and easy to find. Here is a link to the Db2 12 for z/OS documentation for Changes that invalidate packages. If you are a DBA, I recommend that you click on that link and bookmark that page!

I'm not going to copy and paste all of the information from the manual here (no reason to and over time it could change). But here are some of the things to keep in mind that you may not at first think will affect packages, but can:

Altering, dropping, or renaming a column
Adding date/time columns with defaults
Adding a constraint
Adding, changing, or rotating partitions in a partitioned or partition-by range UTS tablespace
Temporal and transparent archiving changes
Adding, altering, or dropping a materialized query table (MQT)
Activating or deactivating row-level or column-level access control
Enabling or disabling masks if column access control is in effect
Increasing a table space's MAXPARTITIONS attribute
Changing a table space's SEGSIZE or DSSIZE
Changing the buffer pool for a tablespace (with a different page size)
Altering indexes to add a column, change the PADDED attribute, or changing the limit key value of a partitioning index
Regenerating an index
Running the REORG utility with the REBALANCE keyword
Running the REPAIR utility on a database with the DBD REBUILD option

Again, these are just some of the admin changes that can invalidate packages. There are others and you should always refer to the current Db2 documentation for the list of things that will invalidate packages before you make any changes. Failing to do so might mean that you will have to run a mass Rebind... maybe at a time when you'd rather not!

Finally, I'll leave you with a couple of helpful queries you can run to help as you manage changes.

To identify all packages that will be invalidated by a change to a specific object, run the following query:

SELECT DISTINCT DCOLLID, DNAME, DTYPE FROM SYSIBM.SYSPACKDEP WHERE BQUALIFIER = ? AND BNAME = ? AND BTYPE = ? ORDER BY DCOLLID, DNAME;

Simply plug in the qualifier and name of the object, along with type of the object (appropriate values can be found in the Catalog table documentation in the appendix of the IBM Db2 SQL Reference manual).

And if you want to identify all invalid packages, try running this query:

SELECT COLLID, NAME, VALID
FROM SYSIBM.SYSPACKAGE
WHERE VALID <> 'Y'
ORDER BY COLLID, NAME;

Monday, August 16, 2021

SQL to Return the nth Maximum Value

Sometimes what seems like a simple request might take a little bit of thinking to devise and implement a solution. Recently, I was asked how to write SQL that returns the nth maximum value from a column. If you think about it, it isn't too difficult.

Assume that you have a table with 10 rows, and there is a column in that table that holds the values 1 through 10. The idea is to create a query that for the 7th maximum value would return a 7, the 8th an 8, and so on.

This can be accomplished with a common table expression (CTE) and the following query:

WITH TOPN AS

(
SELECT YOUR_COLUMN
FROM YOUR_TABLE
ORDER BY YOUR_COLUMN DESC
FETCH FIRST 7 ROWS ONLY

)
SELECT MIN(YOUR_COLUMN)
FROM TOPN;

Simply change the value of FETCH FIRST from 7 to whatever you wish n to be.

If there are duplicate values and you want to eliminate them from consideration, just change the SELECT to SELECT DISTINCT in the TOPN common table expression.

This should work for any type of values (numeric, character, date/time) that you wish to query.

Thursday, July 29, 2021

New IBM Storage Systems Boost Ability to Gain Value from Your Mainframe Data

Gain more value from your mainframe data with IBM Storage

Every year the amount of data that is created continues to expand. Analysts at IDC estimate that data will grow at a compound annual growth rate of 23 percent through the year 2025. Furthermore, efficient access to critical business data can mean the difference between success and failure, yet we sometimes forget about the crucial role that storage systems play in our everyday business transactions.

While storage systems have gotten more intelligent and fault-tolerant over the years, there’s always room for advances that can deliver an improved user experience. This can be seen by IBM’s latest storage announcements. The announcement highlights new and improved storage capabilities including cloud-like consumption models, data resiliency, and mainframe storage. This post will focus on the mainframe aspects of the announcement.

Why mainframe? Well, the platform continues to prosper and grow. According to the latest BMC Mainframe Survey, 90 percent of the IT leaders surveyed see the mainframe as a long-term platform for growth. The world’s largest organizations rely on the mainframe to deliver superior performance, reliability, and security. Mainframes are being used not just for traditional transaction processing and batch workloads, but also for new workloads running business analytics and AI applications on structured data. Not to mention that these large shops store most of their data on the mainframe!

What’s New

On July 20, 2021, IBM announced the next generation of its Storage for the IBM Z, the IBM DS8980F analytics class storage system. Engineered to excel for modern workloads that span transaction processing, analytical processing, and AI for native cloud and on-premises computing. The DS8980F offers high-speed and high availability as a single all-flash storage solution.

As part of this announcement, IBM is introducing improvements in Safeguarded Copy to the entire family of IBM DS8900F systems – including DS8910F, DS8950F and the new DS8980F – to greatly reduce the recovery time from a remote location to the production environment. Additionally, IBM is bringing the Safeguarded Copy function in IBM Spectrum Virtualize software to the IBM FlashSystem family and IBM SAN Volume Controller.

Bringing the focus back to the mainframe: the new IBM DS8980F storage system has been developed by IBM with its z15 mainframe hardware in mind. That means it is optimized for mainframe-class workloads. Organizations are continually looking for ways to improve the performance of their mainframe applications, and the DS8980F provides the fastest mainframe application response times. Therefore, a key method of improving performance can be to upgrade your storage system. Indeed, the new IBM DS8980F, compared to the last generation of IBM storage systems (DS8888F series), can improve response time by up to 25 percent.

Minimizing downtime is another critical requirement of modern business applications, especially for those that run on mainframes. The new IBM DS8980F delivers 7 nines of availability (99.99999 percent), an improvement of 10x over the previous generation.

Additional improvements include more than twice the amount of system cache and greater bandwidth capacity, all while requiring less energy consumption and in a lighter-weight box.

At the same time, IBM also announced a new tape library system, the IBM TS7770, with all flash cache. The most significant new feature of the TS7770 is that it provides better performance with only 1 flash drawer than the previous 10 SAS HDDs drawers, delivering faster data protection with less infrastructure.

Finally, it is possible to combine the TS7770 tape library and the IBM DS8910F (the entry version within the DS8900F family) into a single 19-inch industry-standard rack. This enables smaller and medium-sized organizations to deploy an end to end storage solution for mainframe environments, into a smaller amount of floor space with important savings in operating costs.

Summary

Data growth continues unabated, and organizations continue to use mainframes expecting them to deliver unparalleled performance and availability for their mission-critical workloads of all types.

To achieve this level of performance and availability, while managing data growth, organizations need the latest and greatest storage technology. And IBM’s latest DS8980F and TS7770 will help organizations achieve the performance and availability they require for all their application workloads.

If you’d like to learn more about the latest from IBM storage, you can read the full details in the IBM announcement.

Friday, July 16, 2021

Keeping Track of Data Movement in Db2 for z/OS

Creating and managing test data for Db2 application development and testing requirements can be a significant challenge. To enable not only the development of new programs, but to be able to maintain existing ones, organizations must ensure that there is an adequate amount of accurate test data always available. Without relevant, useful data, there is no way to test applications to make sure they are operating correctly.

Although this duty must be a shared one between the application developers and the DBAs, managing and controlling all of the data movement tasks typically falls on the DBAs. And keeping track of what data moved where, when it moved, and why can at times be as much of a challenge as moving the data itself.

Fortunately, there are test data management tools available to not only move the data, but to keep track of it. I’ve written about one of the better Db2 for z/OS data movement tools here in the blog before: Fast and Effective Db2 for z/OS Test Data Management with BCV5. I hope you'll take a moment to click and read that post.

Now BCV5 has been improved with a new reporting feature, to enable users to track the movement of data across their Db2 subsystems. This is a significant new feature that can be used to glean useful information for DBAs, storage administrators, and even by data stewards for data governance.

There are six tables of metadata that BCV5 populates to track the data movement and the copy tasks it performs. These tables are:

BCV531.TASK_EXECUTIONS
BCV531.TASK_EXECUTIONS_OBJECTS
BCV531.TASK_EXECUTIONS_PARAMETERS
BCV531.TASK_EXECUTIONS_JOBS
BCV531.TASK_EXECUTIONS_RULES
BCV531.TASK_EXECUTIONS_MASKING

The information in these tables is updated whenever BCV5 runs a task to copy Db2 data. Users can query these tables just like any other Db2 tables to monitor the details of the BCV5 tasks you have run. This information may be useful for many different IT and business professionals, but let’s take a look at three specific use cases:

database administration (DBA),
storage administration, and
data governance.

DBA tracking

DBAs tasked with moving and refreshing data from one Db2 environment to another are the typical users of BCV5, and therefore they will be one of the primary users of the new reporting tables. Most sites that use BCV5 use it to refresh test data, for example, copying production data to test, or copying unit test data to an integration test set of tables.

Regardless of the type of data movement that is being undertaken, it is usually being done for multiple tables, tablespaces, and databases. Usually, there will be regularly scheduled processes that copy some of the data, but this is rarely sufficient as there will be on-off requests, special situations, and emergency data refreshes happening all the time. Keeping track of such a hectic morass of copying and refreshing data can be difficult.

Fortunately, if you are using BCV5 the new Usage Tracker tables can simplify keeping track of data refreshes for DBAs. For example, a DBA looking to find out which BCV5 copy tasks were run during the month of May could code a query like:

SELECT T.ROWDATE, T.USER, T.TASKNAME

FROM BCV531.TASK_EXECUTIONS T

WHERE T.ROWDATE BETWEEN ´2021-05-01´ AND ´2021-05-31´

 ORDER BY T.ROWDATE  ;

This will show all the BCV5 copy tasks that ran during that timeframe, and will look something like this:

ROWDATE USER TASKNAME

---------+---------+---------+---------+---------

2021-05-12 USERID1 TSK0001

2021-05-12 USERID1 TSK0002

2021-05-20   USERID5  TSKPROD1 
2021-05-21   USERID9  TSKPROD4

…

The results shown here are just a sample and will likely be a subset of the actual results of running such a query.

Of course, this is rudimentary information and it is likely that the DBA will want to know more, such as which objects were impacted by these tasks. A query such as the following will come in handy:


  SELECT SUBSTR(SRCSCHEMA,1,8) AS SS,

SUBSTR(SRCNAME,1,12) AS SN,

SUBSTR(TGTSCHEMA,1,8) AS TS,

SUBSTR(TGTNAME,1,12) AS TN,

SUBSTR(OBJTYPE,1,1),

ROWDATE AS DATE_COPIED,

SIZEKB

FROM BCV531.TASK_EXECUTIONS_OBJECTS

ORDER BY DATE_COPIED ;

The results here show all the Db2 objects copied by BCV5 showing the source and target names as well as the object type, date copied, and amount of data (in KB) copied:


SS      SN       TS     TN         DATE_COPIED   SIZEKB

---------+---------+---------+---------+---------+-----

DB500XA TS500X01 DBA001 TS500XA1 S 2021-05-17 1462480

QUALID XCL59011 TESTID XCL59011 X 2021-05-17 325040

QUALID XCL59012 TESTID XCL59012 X 2021-05-17 125200

QUALID XCL59013 TESTID XCL59013 X 2021-05-17 301460

QUALID XCL5901C TESTID XCL5901C X 2021-05-17 98400

QUALID TEST_TBL TESTID TEST_TBL T 2021-05-17 20

…

Again, the results have been truncated as this is intended as an example.

A DBA looking to track down the results of a specific copy task that ran on a specific date might want to run a query like this, to verify which objects were copied. Simply plug in the name of your task and the date it ran:


  SELECT T.TASKNAME, O.OBJTYPE, O.SRCSCHEMA, O.SRCNAME,

O.TGTSCHEMA, O.TGTNAME, O.PARTITIONS

FROM BCV531.TASK_EXECUTIONS T,

BCV531.TASK_EXECUTIONS_OBJECTS O

WHERE T.ID = O.EXECID

AND T.TASKNAME = ?

AND    T.ROWDATE = ? ;

Storage Administration Tracking

Another type of user who might find the Usage Tracker capabilities of BCV5 useful is the storage administrator. Storage administrators are responsible for managing an organization’s disk and tape systems. Additionally, they are also responsible for monitoring storage usage and capacity to ensure that sufficient storage is available for the organization’s IT requirements.

As such, the storage administrator will likely want to keep an eye on the data movement activities of BCV5. For example, a query such as this one can be used to report on the total amount of data (TS and IX) copied by date:


  SELECT ROWDATE AS DATE_COPIED,

SUM(SIZEKB) AS TOTAL_KB_COPIED

FROM BCV531.TASK_EXECUTIONS_OBJECTS

WHERE OBJTYPE IN ('S', 'X')

GROUP BY ROWDATE ;

Which will return data similar to this:


DATE_COPIED  TOTAL_KB_COPIED

---------+---------+---------+---------+------

2021-06-12 106231270

2021-06-19 106231270

2021-06-21 4451457810

2021-06-26 106231270

…

Another potentially useful query, not only for storage administrators and DBAs, but also for application managers, is tracking the amount of actual (tablespace) data copied by date and application. Finding the application name or identifier can be tricky, but if we assume that an application identifier is embedded in the second 2 chars of database name then a query like this can be run:


  WITH SIZEBYAPP AS (

SELECT ROWDATE AS DATE_COPIED,

SUBSTR(TGTSCHEMA,2,2) AS APPL_NAME,

SIZEKB AS SIZE_IN_KB

FROM BCV531.TASK_EXECUTIONS_OBJECTS

WHERE OBJTYPE = 'S'

)

SELECT DATE_COPIED, APPL_NAME,

SUM(SIZE_IN_KB) AS TOTAL_KB_COPIED

FROM SIZEBYAPP

GROUP BY DATE_COPIED, APPL_NAME ;

Which might return a report looking something like this:

DATE_COPIED APPL_NAME TOTAL_KB_COPIED

---------+------------------+---------+-------

2021-05-11 EN 46805760

2021-05-22 BA 242791056

2021-05-22 BX 4094640

2021-05-22 CM 1008720

2021-05-22 DA 270390816

2021-05-22 OR 90528

2021-05-26 PR 55737376

2021-05-26 XX 537647328

…

You can adjust this query if you want to know the amount of index data copied by data and application like so (under the same assumption as above for application identifier):


   WITH SIZEBYAPP AS (

SELECT B.ROWDATE AS DATE_COPIED,

SUBSTR(T.DBNAME,2,2) AS APPL_NAME,

B.SIZEKB AS SIZE_IN_KB

FROM BCV531.TASK_EXECUTIONS_OBJECTS B,

SYSIBM.SYSINDEXES X,

SYSIBM.SYSTABLES T

WHERE B.OBJTYPE = 'X'

AND B.TGTNAME = X.NAME

AND B.TGTSCHEMA = X.CREATOR

AND X.TBNAME = T.NAME

AND X.TBCREATOR = T.CREATOR

)

SELECT DATE_COPIED, APPL_NAME,

SUM(SIZE_IN_KB) AS TOTAL_KB_COPIED

FROM SIZEBYAPP

GROUP BY DATE_COPIED, APPL_NAME ;

Data Governance Tracking

Although tracking data movement activities is useful for DBAs, it is also important for data governance reporting. Data governance refers to the processes and standards of ensuring access to high-quality data throughout an organization. Data governance encompasses all aspects of data quality including its accuracy, availability, consistency, integrity, security, and usability. The role of data governance has expanded as data privacy rules and regulations have expanded in response to an increasing number of data breaches and hacker attacks. For example, in early July 2021, Colorado passed the Colorado Privacy Act, meaning that now Colorado, Virginia, and California have passed data privacy legislation that impacts how personal data must be governed. More states are certain to follow their lead… and let’s not forget the European GDPR act!

These type of regulations provide rights for access, deletion, correction, portability, and protection for personally identifiable information, or PII. Note that “portability” is one aspect of data protection covered under the auspices of such regulations… and BCV5 is a mover of data, so you need to be able to track what data moved where, especially when the data that moved contains any PII.

So, what types of queries can be run using the new BCV5 reporting tables to help satisfy the needs of data governance?

Well, if you have identified specific tables that have personally identifiable information, and therefore requires specific policies to ensure its privacy and protection, a data steward might want to run a query that shows all of the times that a specific protected table was copied:


   SELECT  SUBSTR(SRCSCHEMA,1,8) AS SS,

SUBSTR(SRCNAME,1,12) AS SN,

SUBSTR(TGTSCHEMA,1,8) AS TS,

SUBSTR(TGTNAME,1,12) AS TN,

ROWDATE AS DATE_COPIED,

SIZEKB

FROM BCV531.TASK_EXECUTIONS_OBJECTS

WHERE OBJTYPE = 'T'

AND SRCSCHEMA = ?

AND SRCNAME = ?

ORDER BY DATE_COPIED ;

Simply code the appropriate schema (SRCSCHEMA) and table name (SRCNAME) and this query will show all the times that the particular table (say PROD.CUSTOMERS) was copied. A data governance professional with a list of tables that contains PII could alter this query to accept that list as an IN clause instead of the simple equality clause shown here.

Additionally, BCV5 can de-identify sensitive data using masking. Whenever a task that requires data to be masked is run, information is captured in the TASK_EXECUTIONS_MASKING table. So, a data governance professional might want to run a query like this one to report on all the masking of sensitive data.


    SELECT SUBSTR(TBCREATOR,1,8) AS TBCREATOR,

SUBSTR(TBNAME,1,12) AS TBNAME,

SUBSTR(COLNAME,1,18) AS COLNNAME,

METHOD AS MASKING_METHOD,

SQLEXPR

FROM BCV531.TASK_EXECUTIONS_MASKING

ORDER BY ROWDATE ;

This can always be modified to join it to the TASK_EXECUTIONS table to obtain the task name if that is important. And with a little manipulation of the query it is possible to look for tables that contain sensitive data that have been copied using BCV5, but have not had masking applied.

Summary

BCV5 has offered powerful data movement and masking capabilities for Db2 data for a long time, but now it also offers the ability to track and report on your organization’s Db2 data movement and copy tasks. This new functionality opens a plethora of useful information for BCV5 users.

Tuesday, July 06, 2021

How Many Temporal Tables Does Your Site Have?

Sometimes it can be difficult to remember where information is stored in the Db2 Catalog. Usually, with a little rumination and a little review of Appendix A of the IBM Db2 SQL Reference manual (SC27-8859), you can come up with a solution.

For example, I was talking to some DBAs who were trying to remember if they had ever created any business-time temporal tables. A comment was made that we could surely find that in the Db2 Catalog and the conversation moved along... but then I thought, hmmm, let me see what I can do about coming up with a catalog query.

The first step was to think about where this information might be found, which took me to SYSTABLES. A good first thought, but no, it isn't there. So I thought, how about SYSCOLUMNS? And lo' and behold, there was the answer.

The columns identified as the start and end date/time for the temporal range are documented in SYSCOLUMNS in the PERIOD column. PERIOD is defined as a CHAR(1) column and it contains one of the following values for every column defined for each table:

Value	Meaning
B	Column is the start of period BUSINESS_TIME
C	Column is the end of period BUSINESS_TIME with an exclusive endpoint
I	Column is the end of period BUSINESS_TIME with an inclusive endpoint
S	Column is the start of period SYSTEM_TIME
T	Column is the end of period SYSTEM_TIME
blank	Column is not used as either the start or the end of a period

So using this information, here is a query that will show information about all of the business-time temporal tables you have created:

SELECT SUBSTR(TBCREATOR,1,8) || '.' || SUBSTR(TBNAME,1,30) 
       AS TABLENAME,
       SUBSTR(NAME,1,40) AS COLUMNNAME, 
       COLNO, 
       PERIOD
FROM   SYSIBM.SYSCOLUMNS
WHERE  PERIOD IN ('B', 'C', 'I')
ORDER BY TABLENAME, PERIOD, COLUMNNAME;

If you want to find the system-time temporal tables, just swap out the WHERE clause with this one:

WHERE PERIOD IN ('S', 'T')

By becoming adept at querying the Db2 Catalog tables you can find out just about everything you want to know about the databases and objects defined in your Db2 subsystems!