Assuring optimal performance is one of the most frequently occurring tasks for DB2 DBAs. Being able to assess the effectiveness and performance
of various and sundry aspects of your DB2 systems and applications is one of the most important things that a DBA must be able to do. This can include online transaction response time evaluation, sizing of the batch window and determining whether it is sufficient for the workload, end-to-end response time management of distributed workload, and so on.
But in order to accurately gauge the effectiveness of your current environment and setup, Service Level Agreements, or SLAs, are needed.
SLAs are derived out of the practice of Service-level
management (SLM), which is the “disciplined, proactive methodology and
procedures used to ensure that adequate levels of service are delivered to all
IT users in accordance with business priorities and at acceptable cost.”
In order to effectively manage service levels, a business
must prioritize its applications and identify the amount of time, effort, and
capital that can be expended to deliver service for those applications.
A service level is a measure of operational behavior. SLM
ensures that applications behave accordingly by applying resources to those
applications based on their importance to the organization. Depending on the
needs of the organization, SLM can focus on availability, performance, or both.
In terms of availability, the service level might be defined as “99.95 percent
uptime from 9:00 a.m. to 10:00 p.m. on weekdays.” Of course, a service level
can be more specific, stating that “average response time for transactions will
be 2 seconds or less for workloads of 500 or fewer users.”
For an SLA to be successful, all parties involved must agree
on stated objectives for availability and performance. The end users must be
satisfied with the performance of their applications, and the DBAs and
technicians must be content with their ability to manage the system to the
objectives. Compromise is essential to reach a useful SLA.
In practice, though, many organizations do not
institutionalize SLM. When new applications are delivered, there may be vague
requirements and promises of subsecond response time, but the prioritization
and budgeting required to assure such service levels are rarely tackled (unless, perhaps, if the IT function is outsourced). It never ceases to amaze me how often SLAs simply do not exist. I always ask for them whenever I am asked to help track down performance issues or to assess the performance of a DB2 environment.
Let's face it, if you do not have an established agreement for how something should perform, and what the organization is willing to pay to achieve that performance, then how can you know whether or not things are operating efficiently enough? The simple answer is: you cannot.
It may be possible for a system assessment to offer up general advice on areas where
performance gains can be achieved. But in such cases -- where SLAs are non-existent -- it you cannot really deliver guidance on whether the effort to remediate the "problem areas" is worthwhile. Without the SLAs in place you simply do not know if current levels of performance are meeting agreed upon service levels, because there are no agreed-upon service levels (and, no, "subsecond respond time" is NOT a service level! Additionally, you cannot know what level of spend is appropriate for any additional effort needed to achieve the potential performance, because no budget has been agreed upon.
Another potential problem is the context of the
service being discussed. Most IT professionals view service levels on an
element-by-element basis. In other words, the DBA views performance based on
the DBMS, the SysAdmin views performance based on the operating system or the
transaction processing system, and so on. SLM properly views service for an
entire application. However, it can be difficult to assign responsibility
within the typical IT structure. IT usually operates as a group of silos that
do not work together very well. Frequently, the application teams operate
independently from the DBAs, who operate independently from the SAs, and so on.
To achieve end-to-end SLM, these silos need to be broken
down. The various departments within the IT infrastructure need to communicate
effectively and cooperate with one another. Failing this, end-to-end SLM will
be difficult to implement.
The bottom line is that the development of SLAs for your
batch windows, your transactions and business processes is a best practice that
should be implemented at every DB2 shop (indeed, you can remove DB2 from that last sentence and it is still true).
Without SLAs, how will the DBA and the
end users know whether an application is performing adequately? Not every
application can, or needs to, deliver subsecond response time. Without an SLA,
business users and DBAs may have different expectations, resulting in
unsatisfied business executives and frustrated DBAs—not a good situation.
With SLAs in place, DBAs can adjust resources by applying
them to the most-mission-critical applications as defined in the SLA. Costs
will be controlled and capital will be expended on the portions of the business
that are most important to the business. Without SLAs in place, an acceptable
performance environment will be ever elusive. Think about it; without an SLA in
place, if the end user calls up and complains to the DBA about poor
performance, there is no way to measure the veracity of the claim or to gauge the
possibility of improvement within the allotted budget.
Recovery Time
Objectives (RTOs)
Additionally, the effectiveness of backup and recovery should be a concern to all DB2 DBAs. This requires that RTOs (Recovery Time Objectives) be established. An RTO is basically an SLA for the recovery of your database objects. Without RTOs, it is difficult (if not impossible) to gauge the state of recoverability and
the efficacy of image copies being taken.
Each database object should have an RTO assigned to it. The
RTO needs to take into account the same type of things that an SLA considers.
In other words, the business must prioritize its applications, DBAs must map
database objects to the applications, and together they must identify the
amount of time, effort, and capital that can be expended to assure the
minimization of downtime for those applications.
Again, we are measuring operational behavior. The RTO
ensures that, when problems occur requiring database recovery, the application
outage is limited to what has been defined as tolerable for the business (in
terms of uptime and cost to provide that uptime).
Again, as with an SLA, for the RTO to be successful, all
parties involved must agree on stated objectives for downtime and time to
recovery. The end users must be satisfied with the potential duration of their
application’s downtime, and the DBAs and technicians must be content with their
ability to recover the system to the objectives. And again, cost is a
contributing factor. The RTO cannot simply be I need my application up in 5
minutes and I can’t spend any more money to do that, because that is not
reasonable (or possible).
Without written RTOs, DBAs can provide due diligence to make sure that database objects are backed up and recoverable, but cannot really provide any guarantee in terms of how quickly the data can be recovered (or perhaps, to what point in time) when an outage occurs. Of course, the DBA can create and review backup policies and procedures to encourage a recoverable environment. But there won't be any way to ensure with any consistency that the backup plan can deliver the time-to-recovery needed by the business.
So why don't organizations create SLAs and RTOs as a regular course of business?
And if your organization does create SLAs and RTOs, please share with us how doing so became a standard at your shop...