Thursday, May 14, 2026

The Mullins 5‑Step Performance Diagnosis Framework

In the complex ecosystem of enterprise mainframes and hybrid cloud environments, performance tuning often feels like chasing ghosts in the machine. However, achieving peak efficiency in systems like Db2 doesn't require magic. No, it requires a repeatable, disciplined methodology.

Whether you are managing massive digital transactions or high-concurrency analytics, the following 5-Step Performance Diagnosis Framework provides a blueprint for resolving the vast majority of enterprise performance bottlenecks. This framework consists of the following 5 steps...

  1. Analyze workload patterns
  2. Review SQL and access paths
  3. Inspect locking and contention
  4. Evaluate buffer pool efficiency
  5. Check for recent changes

...and it can help to resolve the majority of enterprise database/application performance issues.

Let's examine each step in a little more detail.

1. Analyze Workload Patterns

Before diving into tweaking and turning technical knobs, you first have to understand the nature of the demand. Is the performance degradation systemic or isolated? 

  • Systemic issues are environment-wide or application-wide degradations. They affect a broad range of users and processes simultaneously, regardless of the specific task being performed.

  • Isolated issues are localized to a specific transaction, query, or user group. While one part of the system is failing to meet service level agreements (SLAs), the rest of the environment may be operating at peak efficiency.

By differentiating between these two types of problems, you can avoid the "shotgun approach" to tuning, such as changing global settings (e.g. DSNZPARMs) to fix a local problem. This often results in unintended consequences for the rest of the enterprise.

When you analyze workload patterns, you can differentiate between a runaway batch job and a steady-state increase in transaction volume. Understanding the "shape" of your data traffic is the first step in identifying if the issue is a sudden spike or a gradual trend toward resource exhaustion.

2. Review SQL and Access Paths

The most significant performance gains are almost always found in the SQL. A single poorly written query or a sub-optimal access path chosen by the optimizer can cripple an application. Reviewing the EXPLAIN output to verify index usage, join methods, and sort requirements is critical. In a world of evolving database engines, ensuring your SQL remains efficient is a primary responsibility of the modern DBA.

Using tools that create a visual diagram and "explanation" of your EXPLAIN output can greatly improve the efficiency of your SQL tuning efforts. The primary tool used by Db2 for z/OS sites these days is IBM Db2 Administration Foundation for z/OS, which is the modern, browser-based replacement for IBM Data Studio. It is built on the Zowe framework (open source) and provides a streamlined interface for mainframe DBAs. It displays the SQL execution plans as a graphical tree, showing table scans, index lookups, and join methods.

Consider reviewing and using my recently published blog containing my SQL Review Workflow.

3. Inspect Locking and Contention

In a high-availability Parallel Sysplex or clustered environment, concurrency is king. However, where there is concurrency, there is the potential for contention. Inspecting lock wait times, deadlocks, and timeout statistics can reveal hidden bottlenecks where processes are fighting over the same resources. Tuning your commit frequency and isolation levels can often alleviate these "soft" performance hits that don't show up in CPU metrics alone.

Indeed, one of the most common problems I find in my consulting engagements is what I call Bachelor Programming Syndrome or "Fear of COMMITing."

Unless you plan for, and issue regular COMMITs in your programs that access Db2 data, you will be causing locking problems. It is important for every programmer to issue COMMIT statements in all application programs where data is modified (INSERT, UPDATE, and DELETE). 


Failing to code COMMITs in a data modification program
is what I like to call Bachelor Programming Syndrome...
in other words, Fear of Committing.

The COMMIT externalizes the modifications that occurred in the program since the beginning of the program or the last COMMIT. A COMMIT ensures that all modifications have been physically applied to the database, thereby ensuring data integrity and recoverability. Without COMMITs, data remains locked making it inaccessible to others, eventually causing timeout failures. 

 4. Evaluate Buffer Pool Efficiency

Data that stays in memory is data that moves fast. Evaluating buffer pool hit ratios and asynchronous I/O rates is essential for ensuring that your most frequently accessed "hot" data isn't being constantly paged out to disk. Efficient memory management reduces the I/O overhead that often serves as the hard ceiling for database performance.

When allocating Db2 buffer pools, keep some basic rules of thumb in mind:

  • Don't allocate everything to a single buffer pool (e.g., BP0); use a multiple buffer pool strategy.
  • Explicitly specify a buffer pool for every table space and index.
  • Isolate the Db2 Catalog in BP0 (and BP8K0 as of V8); put user and application Db2 objects into other buffer pools.
  • It is a best practice to separate indexes from table spaces with each in their own dedicated buffer pools.
  • Consider isolating heavily hit data into its own buffer pool to better control performance.
  • Consider isolating sorts into a single buffer pool and tuning it for mostly sequential access (e.g. BP7).
  • Consider separating Db2 objects into separate buffer pools that have been configured for sequential vs random access.
  • Consider pinning small, high usage tables in memory by dedicating a buffer pool to each and using the PGSTEAL(NONE) option. 

5. Check for Recent Changes

The most common cause of a performance "event" is change. Whether it’s a recent REORG, a change in Db2 function level (or even maintenance being applied), or a modified application module, you must look at the delta between "yesterday" and "today". Always audit the most recent modifications to the environment to see if a configuration shift has triggered an unintended consequence in the execution plan.


Summary: While AI-driven tools are beginning to assist in database administration, the foundational logic of performance tuning remains rooted in these core pillars. By systematically moving through workload, SQL, locking, memory, and change management, you can stabilize and optimize even the most demanding enterprise environments.

 

How has your team integrated these diagnostic steps into your current hybrid cloud monitoring workflow?

Wednesday, April 29, 2026

A SQL review workflow


It is important to review SQL statements during testing... certainly before they get to production! A good SQL review workflow shouldn’t feel like bureaucracy, instead it should feel like a safety net that consistently catches performance problems before they hit production. The key is to structure the workflow so that it’s repeatable, lightweight enough to follow, and grounded in how databases actually execute queries.

Here’s a practical, end-to-end workflow you can implement.

1. Define the Context First (Before Looking at SQL)

Before reviewing the query itself, establish intent:

  • What is the query supposed to do?
  • Expected result set size (1 row, 100 rows, millions?)
  • Frequency of execution (ad hoc vs. thousands per minute)
  • Performance expectations (sub-second? batch window?)

Without this, you can’t judge whether a query is “fast enough.”

2. Perform a Structural Review

This is a quick, surface-level pass to catch obvious issues:

  • Avoid SELECT * (unless absolutely justified... and it almost never is)
  • Ensure proper filtering (WHERE clauses exist and are meaningful)
  • Check join conditions (no accidental cross joins)
  • Validate correct use of INNER vs OUTER joins
  • Look for unnecessary nested subqueries or overly complex CTE chains
  • Confirm consistent aliasing and readability

This step is less about performance tuning and more about preventing obvious inefficiencies.

3. Validate Data Access Paths

Now move into performance-critical territory:

  • Are indexes available for:
    • Join columns?
    • Filter predicates?
    • Sorting/grouping columns?
  • Are functions being applied to indexed columns (breaking index usage)?
    • For example: WHERE YEAR(order_date) = 2024 (unless you have an index on the expression)
  • Are implicit data type conversions occurring?

If access paths are wrong, everything else becomes irrelevant.

4. Review the Execution Plan

This is probably the most important step. You need to be able to review PLAN_TABLE output and understand what the columns mean. A visual EXPLAIN capability (from any number of tools) is also useful to simplify this step (and to include additional information like statistics, column type/length, etc.)

Look for:

  • Table scans on large tables (red flag unless justified)
  • Index scans vs. index seeks
  • Join methods (nested loop vs. merge scan vs. hybrid)
  • High-cost operators
  • Cardinality estimation issues (expected vs. actual rows)

Questions to ask:

  • Is the optimizer choosing the expected path?
  • Are there warning signs like excessive sorts or large intermediate results?
An Aside: The PLAN_TABLE

A lot of information is contained in the PLAN_TABLE. After the optimizer creates the access paths and populates the PLAN_TABLE with data representing those access paths. Many questions can be answered by analyzing the PLAN_TABLE for the query (or package) being tested, questions like:
  • if we are joining what type of join is used (NLJ, MS, Hybrid),
  • was an index used, and if so how many columns matched,
  • are we doing a scan, and if so what type of scan (full or page range)
  • is prefetch being used, and if so what type (sequential, list)
  • was a hint used
  • was parallelism used, and if so what degree and type (I/O, CPU, Sysplex)
  • was a sort required, and if so why (Join, Unique, Group By, Order By)
  • what type of locking is required
And that just covers the main PLAN_TABLE. Keep in mind that EXPLAIN can also populate multiple additional metadata tables, if they exist. These are documented in the addendum at the end of this blog post.

5. Evaluate Data Volume Impact

Because the size of development test data/tables is usually nowhere near the size of production, be sure to test the query against realistic data volumes (and/or realistically set statistics). Small dev/test volumes frequently hide problems!

  • Check:
    • Execution time
    • I/O (logical/physical reads)
    • Memory usage
  • Consider worst-case scenarios:
    • End-of-month data spikes
    • Historical queries over large ranges
Another Aside: Testing Like Production

Keep in mind that additional differences (other than data volume) between test and production can impact access path selection. For example, different buffer pool sizes or running under different Db2 versions can cause different access paths to be chosen.

6. Check for Scalability Risks

A query that works today may fail tomorrow.

Look for:

  • Stage 2 and non-indexable predicates
  • Repeated correlated subqueries
  • Row-by-row processing 
  • Excessive temp table usage
  • Sorting or grouping on large data sets without support

Ask yourself questions like What happens when data grows 10x?

7. Validate Concurrency and Locking Behavior

Performance isn’t just speed, it’s also how queries behave under load. I frequently say that to optimize performance you want to be decreasing I/O, decreasing CPU usage, and increasing concurrency. Examine your SQL and access paths for the following

  • Does the query:
    • Lock large portions of a table?
    • Escalate locks?
  • Is isolation level appropriate?
  • Are there risks of blocking or deadlocks?

8. Recommend Improvements

At this point, provide actionable feedback:

  • Query rewrites (e.g., replace subqueries with joins or vice versa)
  • Index recommendations
  • Partitioning suggestions (if applicable)
  • Caching or materialization options
  • Pre-aggregation strategies

Keep recommendations prioritized:

  1. High impact / low effort
  2. High impact / high effort
  3. Nice-to-have

9. Re-Test and Benchmark

After changes:

  • Compare before vs. after:
    • Execution time
    • Logical reads
    • CPU usage
  • Validate that improvements hold under realistic conditions

10. Document and Approve

Capture:

  • Final version of the query
  • Key decisions (e.g., why a specific index was added)
  • Known limitations or assumptions

Then approve for deployment.

11. Post-Deployment Monitoring (Often Missed)

After release:

  • Monitor query performance in production
  • Watch for:
    • Plan regressions
    • Data growth effects
    • Parameter sensitivity issues

Optional: Lightweight Checklist Version

If you want something quick and enforceable:

No SELECT *
Proper joins and filters
Indexes support predicates and joins
Execution plan reviewed
No large table scans (unless justified)
Tested with realistic data volume
No obvious scalability risks
Concurrency impact considered

Final Thought

The biggest mistake in SQL reviews is treating them like code reviews. They’re not. SQL performance lives at the intersection of data distribution, indexing, and optimizer behavior. A good workflow forces reviewers to engage with all three—not just the syntax.


Addendum

Additional EXPLAIN Tables


Table Description
DSN_COLDIST_TABLE contains non-uniform column group statistics that are obtained dynamically by Db2 from non-index leaf pages.
DSN_DETCOST_TABLEcontains information about detailed cost estimation of the mini-plans in a query.
DSN_FILTER_TABLE contains information about how predicates are used during query processing
DSN_FUNCTION_TABLE contains descriptions of functions that are used in specified SQL statements
DSN_KEYTGTDIST_TABLE contains non-uniform index expression statistic that are obtained dynamically by the Db2 optimizer
DSN_FUNCTION_TABLE contains descriptions of functions that are used in specified SQL statements
DSN_PGRANGE_TABLE contains information about qualified partitions for all page range scans in a query
DSN_PGROUP_TABLE contains information about about the parallel groups in a query
DSN_PREDICAT_TABLE contains information about all of the predicates in a query; also used as input when you issue a BIND QUERY command to override the selectivity of predicates for matching SQL statements
DSN_PREDICATE_SELECTIVITY contains information about the selectivity of predicates that are used for access path selection; also used as an input table for the BIND QUERY command when selectivity overrides are specified
DSN_PTASK_TABLE contains information about all of the parallel tasks in a query
DSN_QUERYINFO_TABLE contains information about the eligibility of query blocks for automatic query rewrite, information about the MQTs that are considered for eligible query blocks, reasons why ineligible query blocks are not eligible, and information about acceleration of query blocks
DSN_QUERY_TABLE contains information about a SQL statement, and displays the statement before and after query transformation
DSN_SORTKEY_TABLE contains information about sort keys for all of the sorts required by a query
DSN_SORT_TABLE contains information about sort operations required by a query
DSN_STATEMENT_CACHE_TABLE contains information about the SQL statements in the statement cache, information captured as the results of an EXPLAIN STATEMENT CACHE ALL statement
DSN_STATEMENT_TABLE contains information about the estimated cost of specified SQL statements
DSN_STATFEEDBACK contains recommendations for capturing missing or conflicting statistics that are defined during EXPLAIN
DSN_STRUCT_TABLE contains information about all of the query blocks in a query
DSN_VIEWREF_TABLE contains information about all of the views and materialized query tables that are used to process a query

Wednesday, April 08, 2026

A High-Level Guide for Db2 Database Health Checks

 Here’s a practical, field-tested 10-point Db2 database health check you can use to quickly assess the state of your environment and identify risk areas.


1. Buffer Pool Efficiency

  • Check hit ratios (GETPAGE vs. I/O)
  • Look for excessive synchronous reads
  • Ensure critical objects reside in appropriately sized buffer pools

2. SQL Performance & Access Paths

  • Identify top CPU-consuming and elapsed time SQL
  • Look for table scans, non-matching index scans, and sort activity
  • Validate access paths haven’t regressed (especially after RUNSTATS or REBIND)

3. Index Effectiveness

  • Check for unused or duplicate indexes
  • Identify missing indexes for high-frequency queries
  • Review clustering ratios and index levels

4. RUNSTATS Currency & Quality

  • Ensure statistics are up to date on critical tables and indexes
  • Verify distribution stats (FREQVAL, HISTOGRAM) where needed
  • Watch for stale or missing stats causing poor optimizer choices

5. Locking & Concurrency

  • Monitor lock waits, timeouts, and deadlocks
  • Identify hot objects or pages
  • Evaluate isolation levels and commit frequency

6. I/O Performance

  • Review I/O response times by dataset and volume
  • Check for hotspots or imbalance across storage groups
  • Ensure proper striping and dataset placement

7. Logging & Recovery Health

  • Monitor active log utilization and archive activity
  • Check for long-running units of work delaying log reuse
  • Validate backup and recovery procedures (image copies, PIT recovery readiness)

8. Utilities & Maintenance

  • Ensure REORGs are run where needed (based on RTS, not just schedule)
  • Validate COPY, RUNSTATS, and REORG cadence
  • Look for objects in advisory REORG or COPY pending states

9. System Resource Utilization

  • Track CPU consumption (general purpose vs. zIIP if applicable)
  • Monitor memory usage and EDM pool (for Db2 for z/OS)
  • Watch for thread reuse and connection pooling efficiency

10. Configuration & ZPARM Settings

  • Review key ZPARMs (or Db2 LUW configuration parameters)
  • Validate they align with workload characteristics
  • Check for outdated defaults that may no longer be optimal

Bonus Tip: Trend, Don’t Snapshot

A single point-in-time check is useful—but trending over time is where the real insight lives. Look for gradual degradation, not just obvious breakage.

Tuesday, March 31, 2026

The Quiet Erosion of Db2 for z/OS Expertise

There’s a shift happening in Db2 for z/OS environments that isn’t getting nearly enough attention. It’s not a new feature. It’s not a new release. It’s not even about modernization.

It’s about knowledge. And more specifically, the steady erosion of institutional knowledge. I’ve been talking with Db2 professionals across organizations, and the pattern is remarkably consistent. Teams are smaller. The work hasn’t shrunk. And the depth of experience is thinning out in ways that are subtle, but significant.

Let’s start with staffing.

It’s not unusual to hear about Db2 teams that have gone from a dozen or more DBAs down to a handful. For example, a BMC blog post reports of one team that dropped from 15 DBAs to 5 while supporting essentially the same application portfolio. That’s not just a reduction in headcount. That’s a two-thirds reduction in accumulated experience.

And experience matters in ways that are hard to quantify. Because the most valuable Db2 knowledge is rarely documented. It lives in people’s heads. Things like:

  • Why that index was created
  • Why that query was stabilized
    Why those statistics were manually modified in the Db2 Catalog
  • Why that subsystem parameter was set the way it was

When the people with that institutional knowledge leave, the system doesn’t stop running. But something important is lost. The understanding of why things work the way they do. And that’s where the real risk begins.

We’re also seeing a shift in how responsibilities are handled. Traditional DBA roles are being fragmented. Some responsibilities are automated. Others are pushed to developers. Some simply fall through the cracks. The result is that fewer people have a holistic understanding of the entire Db2 environment. And that kind of fragmentation creates blind spots.

At the same time, new talent is entering the workforce. That’s a good thing. But they’re coming in with a very different skill profile. They know "modern" languages. They understand distributed systems. They’re comfortable with cloud platforms.

But deep Db2 for z/OS internals? Buffer pool behavior? Access path stability? The nuances of statistics and optimization? Those skills aren’t widely taught. And they aren’t quickly learned. So while headcount may be replaced, expertise is not.

And then there’s knowledge transfer. Most organizations recognize the problem. There are plans. There are mentoring initiatives. There are documentation efforts. But in practice, knowledge transfer is often inconsistent at best. I've worked at shops where a long-tenured Db2 DBA or sysprog retires without fully sharing their expertise. This can happen because it is difficult to transfer decades of knowledge in a couple of weeks or months. It can happen because time is not allotted for the transfer. And sometimes it can happen because the retiree wants to feel needed so they don't share unless absolutely required.

The bottom line is that it takes time and focus. And it requires that experienced professionals are available long enough to pass along what they know. Too often, that doesn’t happen. And so the knowledge simply walks out the door.

Perhaps the most concerning aspect of all this is how quietly it happens. Systems continue to run. Applications continue to process transactions. SLAs are still met. On the surface, everything looks fine. But underneath, there’s a gradual shift.

Decisions become more reactive. Tuning becomes more superficial. Defaults replace judgment. Over time, “optimal” becomes “good enough.” And “good enough” has a way of becoming the standard. Until something breaks. Or performance degrades in ways that no one quite knows how to diagnose. Or a modernization effort uncovers layers of complexity that no one fully understands anymore.

This isn’t a call for alarm. But it is a call for awareness. If you’re running Db2 for z/OS, the question isn’t just whether your systems are stable today. It’s whether your organization still understands them deeply enough to keep them stable tomorrow.

Because institutional knowledge isn’t something you can rebuild overnight.

And once it’s gone, you don’t just lose the past.

You lose the ability to make better decisions in the future.

Wednesday, March 25, 2026

The Silent Performance Killers in Db2 for z/OS: What You’re Probably Not Monitoring

 If you’ve been working with Db2 for z/OS long enough, you’ve likely experienced this scenario.

Everything looks fine. CPU is within expected ranges. Buffer pools aren’t thrashing. Locking isn’t out of control. The dashboards are green. Nothing is obviously broken.

And yet, users start to complain. Batch jobs run a little longer. Online response times stretch just enough to be noticeable. Not a crisis—but not quite right, either.

So you dig in. At first, nothing stands out. But as you peel back the layers, you discover the truth: performance didn’t fall off a cliff. It eroded. Quietly. Gradually. Almost invisibly.

These are the silent performance killers. And in many Db2 for z/OS environments, they’re not being monitored closely enough.

The Illusion of Stability

One of the strengths of Db2 for z/OS is its stability. With features like plan management, access path stability, and mature instrumentation, it is entirely possible to run a system that appears steady for long periods of time.

But stability can be deceptive.

When access paths remain “stable,” it is easy to assume they remain “optimal.” When performance metrics stay within historical ranges, it is tempting to believe everything is under control.

In reality, small inefficiencies can accumulate. A slightly suboptimal access path here. A bit of extra I/O there. A few more GETPAGEs than last quarter. Individually insignificant. Collectively impactful.

And because the degradation is gradual, it often escapes notice... until it becomes difficult to ignore.

Predicate Processing: Stage 1 vs. Stage 2 Still Matters

It is fashionable in some circles to assume that the old distinctions no longer matter. But predicate processing still plays a critical role in performance.

Stage 2 predicates, non-indexable expressions, and function-wrapped columns can quietly force Db2 to do more work than necessary. Even when an index is present, it may not be used as efficiently as expected.

What makes this particularly insidious is that the SQL hasn’t necessarily changed. But data distribution might have. Or perhaps a new release introduced subtle optimizer behavior differences. Or a developer added a seemingly harmless function to a predicate.

The query still runs. It just runs a little slower. Multiply that across hundreds or thousands of query executions, and the cost can become significant.

Statistics Drift: RUNSTATS Is Not a “Set It and Forget It”

Most Db2 shops run RUNSTATS. That’s not the problem. The problem is assuming that any RUNSTATS is sufficient.

Over time, data changes. Skew increases. New values appear. Old assumptions become invalid. Yet many organizations rely on static RUNSTATS profiles that no longer reflect reality. When statistics drift, the optimizer’s decisions drift with them.

Access paths that were once ideal become merely acceptable. Then marginal. Then problematic. But because RUNSTATS is still running, there is a false sense of confidence.

The real question is not whether RUNSTATS is being executed. It is whether the right statistics are being collected at the right time with the right level of detail.

zIIP Offload: Are You Getting What You Expect?

zIIP utilization is often viewed as a win when the numbers look high. But high utilization does not necessarily mean optimal utilization.

Some workloads that should be offloaded are not. Others are only partially eligible. In some cases, system or application changes inadvertently reduce zIIP eligibility without anyone noticing.

The result is subtle but important: more work shifts back to general-purpose CPUs.

You may not see a dramatic spike. But you may see a steady increase over time—one that is difficult to explain if you are not explicitly tracking zIIP efficiency at a granular level.

Buffer Pool Health: Beyond Hit Ratios

Buffer pool hit ratios have long been a staple metric. But by themselves, they can be misleading.

A high hit ratio does not guarantee efficient memory usage. It is entirely possible to have acceptable hit ratios while still experiencing excessive page churn, suboptimal page residency, or inefficient object placement.

As workloads evolve, the way data is accessed changes. Tables grow. Index usage shifts. New applications introduce different access patterns.

If buffer pools are not periodically re-evaluated and tuned in light of these changes, inefficiencies can creep in unnoticed.

Again, nothing breaks. Things just get a little slower.

The Application Factor

Perhaps the most significant, and most likely least visible, source of silent degradation is the application layer. Modern development practices introduce new challenges. ORMs generate SQL that is technically correct but not always efficient. Microservices increase the number of distinct access patterns. Dynamic SQL proliferates. And often, these changes occur outside the traditional DBA change control process.

The DBA sees the symptoms—more executions, different access paths, increased resource consumption... but not always the cause. Without tighter collaboration between development and database teams, these inefficiencies can persist indefinitely.

Plan Stability: A Double-Edged Sword

Plan stability features are invaluable. They protect against sudden access path regressions and provide a safety net during change. But they can also create complacency.

When a plan is “locked in,” it is easy to stop questioning whether it is still the best plan. Over time, as data and workloads evolve, a once-optimal access path may no longer be ideal.

Yet it persists. Not because it is the best choice, but because it is thought to be the safest.

Periodic review is essential. Stability should not mean stagnation.

What Should You Be Doing?

The answer is not to monitor more metrics indiscriminately. It is to monitor the right things with the right perspective. Look for trends, not just thresholds. Compare performance over time, not just against static baselines. Question assumptions that have gone unchallenged for years.

Examine access paths periodically—even for stable workloads. Revisit RUNSTATS strategies. Validate zIIP expectations. Reassess buffer pool configurations.

And perhaps most importantly, engage with application teams. Understand how the workload is changing, not just how it is performing.

Final Thoughts

The most dangerous performance problems in Db2 for z/OS are not the ones that cause immediate failures. They are the ones that quietly erode efficiency over time.

They do not trigger alarms. They do not demand urgent attention. They simply persist. Until one day, “everything looks fine” is no longer true. And by then, the fix is far more involved than it would have been if the problem had been caught earlier.

The challenge... and the opportunity... for today’s DBA is to look beyond the obvious. To question the comfortable. And to recognize that in a mature, stable system, the biggest risks are often the ones you are not actively watching.