The Db2 Portal Blog: SLA

Showing posts with label SLA. Show all posts

Tuesday, August 08, 2023

Top 10 Db2 Performance Tips - No. 6 Monitoring and Performance Metrics

Monitoring and measuring performance metrics are essential practices for maintaining and optimizing the performance of an IBM Db2 environment. By regularly monitoring and analyzing performance data, DBAs can identify bottlenecks, proactively address performance issues, and make informed decisions to improve overall system efficiency.

It is important to monitor and measure performance metrics in order to gain insights into the behavior of the applications and databases in use at your site. By examining their behavior and identifying areas for improvement, you can improve the overall satisfaction of your customers and end users.

Performance metrics provide valuable information about resource utilization, query response times, disk I/O, CPU usage, and other critical aspects of database performance. By tracking these metrics over time, administrators can detect patterns, identify trends, and pinpoint potential performance bottlenecks.

A Strategy

The first part of any Db2 performance management strategy should be to provide a comprehensive approach to the monitoring of the Db2 subsystems operating at your site. This approach involves monitoring not only the threads accessing Db2 and the SQL they issue, but also the Db2 address spaces. You can accomplish this task in three ways:

Batch reports run against DB2 trace records. While DB2 is running, you can activate traces that accumulate information, which can be used to monitor both the performance of the DB2 subsystem and the applications being run.

Online access to DB2 trace information and DB2 control blocks. This type of monitoring also can provide information on DB2 and its subordinate applications.

Sampling DB2 application programs as they run and analyzing which portions of the code use the most resources.

Using all three approaches can be a reasonable approach if you have analyzed the type of workloads in use and which types of monitoring are most effective for each.
You also need to establish a strategy for your organization's tuning goals. And your tuning goals should be set using the discipline of service level management (SLM). A service level is a measure of operational behavior. SLM ensures applications behave accordingly by applying resources to those applications based on their importance to the organization. Depending on the needs of the organization, SLM can focus on availability, performance, or both. In terms of availability, the service level can be defined as “99.95% up time, during the hours of 9:00 AM to 10:00 PM on weekdays.” Of course, a service level can be more specific, stating “average response time for transactions will be two seconds or less for workloads of strategic external clients.”

For a service level agreement (SLA) to be successful, all of the parties involved must agree upon stated objectives for availability and performance. The end users must be satisfied with the performance of their applications, and the DBAs and technicians must be content with their ability to manage the system to the objectives. Compromise is essential to reach a useful SLA.

Furthermore, the objectives of the SLA must be defined and measurable. For example, in the earlier SLA you must define what a “strategic client” is and differentiate strategic from nonstrategic. Although doing this may be difficult, failing to do so can result in worthless SLAs that cannot be achieved.

In the end, if you do not identify service levels for each transaction, then you will always be managing to an unidentified requirement. Without a predefined and agreed upon SLA, how will the DBA and the end users know whether an application is performing adequately? Without SLAs, business users and DBAs might have different expectations, resulting in unsatisfied business executives and frustrated DBAs. Not a good situation.

Db2 Traces

One of the first types of performance metrics to consider is monitoring based on reading trace information. You can think of a Db2 trace as a window into the performance characteristics of aspects of your Db2 workload. Db2 traces record diagnostic information describing particular events. As Db2 operates, it writes trace information that can be read and analyzed to obtain performance information.

Db2 provides six types of traces, and each describes information about the Db2 environment:

Accounting - Records performance information about the execution of DB2 application programs
Audit - Provides information about DB2 DDL, security, utilities, and data modification
Global - Provides information for the servicing of DB2
Monitor - Records data useful for online monitoring of the DB2 subsystem and DB2 application programs
Performance - Collects detailed data about DB2 events, enabling database and performance analysts to pinpoint the causes of performance problems
Statistics - Records information regarding the DB2 subsystem’s use of resources

You can start Db2 traces in two ways: by specifying the appropriate DSNZPARMs at Db2 startup or by using the -START TRACE command to initiate specific traces when Db2 is already running.

Each trace is broken down further into classes, each of which provides information about aspects of that trace. Additional informatiohn about the classes for each type of trace can be found here, here, and here.

You should understand what type of information is traced and the approximate overhead of each trace class before starting any of the Db2 traces. Some traces are better left off until or unless a performance problem is occurring, at which point, the trace can be started to capture details about the situation. Others are better to have turned on before problems occur. Keep in mind that you will have some trace classes (and IFCIDs) that are always started, and other that are only used in emergencies.

What are IFCIDs?

Each trace class is associated with specific trace events known as Instrumentation Facility Component Identifier (IFCIDs), pronounced “if-kid.” An IFCID defines a record that represents a trace event. IFCIDs are the single smallest unit of tracing that can be invoked
by Db2.

In some cases, it can make sense to avoid activating trace classes altogether and start traces specifying only the IFCIDs needed. This way, you can reduce the overhead associated with tracing by recording only the trace events needed.

There are several hundred different IFCIDs. Most IFCIDs contain data fields that describe events pertinent to the event being traced. Some IFCIDs have no data; instead they merely mark a specific time. Certain trace events of extended durations require a pair of IFCIDs: one for the beginning of the event and another for the end. These pairs enable the computation of elapsed times. Other trace events that are not as lengthy require only a single IFCID.

You can find the IFCIDs associated with each trace class in the IBM Db2Command Reference manual in the section describing the -START TRACE command. But that manual does not describe the purpose of each IFCID. A list describing each IFCID can be found in the data set named SDSNIVPD(DSNWMSGS), which is part of the Db2 installation.

Db2 Performance Monitors

Several popular performance monitoring solutions are available for Db2 for z/OS. IBM’s Omegamon, BMC Software’s MainView, Broadcom’s Sysview, and Rocket Software's TMON are the leading performance monitors. When selecting a performance monitor, be sure to examine the online components as well as the batch reporting capabilities of the monitor.

An online performance monitor is a tool that provides real-time reporting on Db2 performance statistics as Db2 operates. In contrast, a batch performance monitor reads previously generated trace records from an input data set. Most performance monitors today can be used to serve both purposes.

With online DB2 performance monitors, you can perform proactive performance management tasks. In other words, you can set up the monitor such that when it detects a problem it alerts a DBA and possibly takes actions on its own to resolve the problem.

The most common way to provide online performance monitoring capabilities is by online access to DB2 trace information in the MONITOR trace class. Some online DB2 performance monitors also provide direct access to Db2 performance data by reading the control blocks of the Db2 and application address spaces. This type of monitoring provides a “window” to up-to-the-minute performance statistics while Db2 runs. This information is important if a quick reaction to performance problems is required.

Most online Db2 performance monitors provide a menu-driven interface accessible from TSO or VTAM. It enables online performance monitors to start and stop traces as needed based on the menu options chosen by the user. Consequently, you can reduce overhead and diminish the learning curve involved in understanding Db2 traces and their correspondence to performance reports.

Following are some typical uses of online performance monitors. Many online performance monitors can establish effective exception-based monitoring. When specified performance thresholds are reached, triggers can offer notification and take action. For example, you could set a trigger when the number of lock suspensions for the TXN00002 plan is reached; when the trigger is activated, a message is sent to the console, and a batch report is generated to provide accounting detail information for the plan. You can set any number of triggers for many thresholds. Following are some examples of thresholds you might choose to set:

When a buffer pool threshold is reached (PREFETCH DISABLED, DEFERRED WRITE THRESHOLD, or DM CRITICAL THRESHOLD).
For critical transactions, when predefined performance objectives are not met. For example, if TXN00001 requires subsecond response time, set a trigger to notify a DBA when the transaction receives a class 1 accounting elapsed time exceeding 1 second by more than 25%.
Many types of thresholds can be established. Most online monitors support this capability. As such, you can customize the thresholds for the needs of your DB2 environment.

Most online performance monitors can produce real-time EXPLAINs for long-running SQL statements. If an SQL statement takes a significant amount of time to process, an analyst can display the SQL statement as it executes and dynamically issue an EXPLAIN for the statement. Even as the statement executes, an understanding of why it takes so long to run can be achieved. A complete discussion of the EXPLAIN statement is provided in the next chapter.

Some online performance monitors provide historical trending. These monitors track performance statistics and store them in DB2 tables or in VSAM files with a timestamp. They also provide the capability to query these stores of performance data to assist in the following:

Analyzing recent history. Most SQL statements execute quickly, making difficult the job of capturing and displaying information about the SQL statement as it executes. However, you might not want to wait until the SMF data is available to run a batch report. Quick access to recent past-performance data in these external data stores provides a type of online monitoring that is as close to real time as is usually needed.
Determining performance trends, such as a transaction steadily increasing in its CPU consumption or elapsed time.
Performing capacity planning based on a snapshot of the recent performance of DB2 applications.

Some monitors also run when Db2 is down to provide access to the historical data accumulated by the monitor.

A final benefit of online DB2 performance monitors is their capability to interface with other z/OS monitors for IMS/TM, CICS, z/OS, or VTAM. This way, an analyst gets a view of the entire spectrum of system performance. Understanding and analyzing the data from each of these monitors, however, requires a different skill. Quite often, one person cannot master all these monitors

Conclusion

Although this blog entry was brief, and there are many additional aspects to Db2 performance monitoring, monitoring and measuring performance metrics should be viewed as a vital requirement for all organizations using Db2 databases. By collecting and analyzing performance data, DBAs can detect performance bottlenecks, identify areas for improvement, and make informed decisions to enhance overall system efficiency.

Thursday, September 04, 2014

The Importance of SLAs and RTOs

Assuring optimal performance is one of the most frequently occurring tasks for DB2 DBAs. Being able to assess the effectiveness and performance of various and sundry aspects of your DB2 systems and applications is one of the most important things that a DBA must be able to do. This can include online transaction response time evaluation, sizing of the batch window and determining whether it is sufficient for the workload, end-to-end response time management of distributed workload, and so on.

But in order to accurately gauge the effectiveness of your current environment and setup, Service Level Agreements, or SLAs, are needed. SLAs are derived out of the practice of Service-level management (SLM), which is the “disciplined, proactive methodology and procedures used to ensure that adequate levels of service are delivered to all IT users in accordance with business priorities and at acceptable cost.”

In order to effectively manage service levels, a business must prioritize its applications and identify the amount of time, effort, and capital that can be expended to deliver service for those applications.

A service level is a measure of operational behavior. SLM ensures that applications behave accordingly by applying resources to those applications based on their importance to the organization. Depending on the needs of the organization, SLM can focus on availability, performance, or both. In terms of availability, the service level might be defined as “99.95 percent uptime from 9:00 a.m. to 10:00 p.m. on weekdays.” Of course, a service level can be more specific, stating that “average response time for transactions will be 2 seconds or less for workloads of 500 or fewer users.”

For an SLA to be successful, all parties involved must agree on stated objectives for availability and performance. The end users must be satisfied with the performance of their applications, and the DBAs and technicians must be content with their ability to manage the system to the objectives. Compromise is essential to reach a useful SLA.

In practice, though, many organizations do not institutionalize SLM. When new applications are delivered, there may be vague requirements and promises of subsecond response time, but the prioritization and budgeting required to assure such service levels are rarely tackled (unless, perhaps, if the IT function is outsourced). It never ceases to amaze me how often SLAs simply do not exist. I always ask for them whenever I am asked to help track down performance issues or to assess the performance of a DB2 environment.

Let's face it, if you do not have an established agreement for how something should perform, and what the organization is willing to pay to achieve that performance, then how can you know whether or not things are operating efficiently enough? The simple answer is: you cannot.

It may be possible for a system assessment to offer up general advice on areas where performance gains can be achieved. But in such cases -- where SLAs are non-existent -- it you cannot really deliver guidance on whether the effort to remediate the "problem areas" is worthwhile. Without the SLAs in place you simply do not know if current levels of performance are meeting agreed upon service levels, because there are no agreed-upon service levels (and, no, "subsecond respond time" is NOT a service level! Additionally, you cannot know what level of spend is appropriate for any additional effort needed to achieve the potential performance, because no budget has been agreed upon.

Another potential problem is the context of the service being discussed. Most IT professionals view service levels on an element-by-element basis. In other words, the DBA views performance based on the DBMS, the SysAdmin views performance based on the operating system or the transaction processing system, and so on. SLM properly views service for an entire application. However, it can be difficult to assign responsibility within the typical IT structure. IT usually operates as a group of silos that do not work together very well. Frequently, the application teams operate independently from the DBAs, who operate independently from the SAs, and so on.

To achieve end-to-end SLM, these silos need to be broken down. The various departments within the IT infrastructure need to communicate effectively and cooperate with one another. Failing this, end-to-end SLM will be difficult to implement.

The bottom line is that the development of SLAs for your batch windows, your transactions and business processes is a best practice that should be implemented at every DB2 shop (indeed, you can remove DB2 from that last sentence and it is still true).

Without SLAs, how will the DBA and the end users know whether an application is performing adequately? Not every application can, or needs to, deliver subsecond response time. Without an SLA, business users and DBAs may have different expectations, resulting in unsatisfied business executives and frustrated DBAs—not a good situation.

With SLAs in place, DBAs can adjust resources by applying them to the most-mission-critical applications as defined in the SLA. Costs will be controlled and capital will be expended on the portions of the business that are most important to the business. Without SLAs in place, an acceptable performance environment will be ever elusive. Think about it; without an SLA in place, if the end user calls up and complains to the DBA about poor performance, there is no way to measure the veracity of the claim or to gauge the possibility of improvement within the allotted budget.

Recovery Time Objectives (RTOs)

Additionally, the effectiveness of backup and recovery should be a concern to all DB2 DBAs. This requires that RTOs (Recovery Time Objectives) be established. An RTO is basically an SLA for the recovery of your database objects. Without RTOs, it is difficult (if not impossible) to gauge the state of recoverability and the efficacy of image copies being taken.

Each database object should have an RTO assigned to it. The RTO needs to take into account the same type of things that an SLA considers. In other words, the business must prioritize its applications, DBAs must map database objects to the applications, and together they must identify the amount of time, effort, and capital that can be expended to assure the minimization of downtime for those applications.

Again, we are measuring operational behavior. The RTO ensures that, when problems occur requiring database recovery, the application outage is limited to what has been defined as tolerable for the business (in terms of uptime and cost to provide that uptime).

Again, as with an SLA, for the RTO to be successful, all parties involved must agree on stated objectives for downtime and time to recovery. The end users must be satisfied with the potential duration of their application’s downtime, and the DBAs and technicians must be content with their ability to recover the system to the objectives. And again, cost is a contributing factor. The RTO cannot simply be I need my application up in 5 minutes and I can’t spend any more money to do that, because that is not reasonable (or possible).

Without written RTOs, DBAs can provide due diligence to make sure that database objects are backed up and recoverable, but cannot really provide any guarantee in terms of how quickly the data can be recovered (or perhaps, to what point in time) when an outage occurs. Of course, the DBA can create and review backup policies and procedures to encourage a recoverable environment. But there won't be any way to ensure with any consistency that the backup plan can deliver the time-to-recovery needed by the business.

So why don't organizations create SLAs and RTOs as a regular course of business?

And if your organization does create SLAs and RTOs, please share with us how doing so became a standard at your shop...