Tuesday, September 30, 2008

A Perfect Storm?

There is something of a perfect storm brewing in the world of data today. The world is becoming more automated, more connected, more wireless, and more complex. The next wave of database administration is intelligent automation. I refer to this as implementing software scrubbing bubbles that “work hard, so you don’t have to.” (Remember that commercial!)

As more of the tasks required of DBAs become more automated, the DBA will be freed to expand into other areas. So one front on this storm is the autonomic computing initiatives that automate DBA tasks. At the same time, IT professionals are being asked to know more about the business instead of just knowing the technology. So DBAs need to understand the business purpose and definition of the data they manage, as well as the technological underpinnings of the DBMS. The driving force here is predominantly regulatory compliance. This second front of the perfect storm will cause DBAs to work more closely with metadata to drive database archiving, data auditing, and security to ensure their organization complies with regulations like Sarbanes-Oxley, HIPAA, and others.

Regarding the wireless aspect of things, pervasive devices (PDA, handhelds, cell phones, etc.) will increasingly interact with database systems. DBAs will need to get involved there to ensure successful data synchronization. And database systems will work with disconnected data seamlessly, such as data generated by RFID tags.

Yet another big database trend is technology "suck." By that I mean the DBMS is as it sucks up technologies and functions that previously required you to purchase separate software. Remember when the DBMS had no ETL or OLAP functionality? Those days are gone. This will continue as the DBMS adds capabilities to tackle more and more IT tasks.

Another trend impacting DBAs will be a change in some of their roles as more and more of the recent DBMS features actually start being used in more production systems.

The net result of this perfect storm of changes is that data professionals are absolutely being required to do more... sometimes with less (less time, less money, less staff, etc.)

If you know the technology but are then required to know the business, this is doing more – much more. But the technology, in many cases, is also expanding. For example, DB2 9 incorporates native XML. Most DBAs are not XML savvy, but increasingly they will have to learn more about XML as the DBMS technology expands. And this is just one example.

Additionally, data is growing at an ever-increasing rate. Every year the amount of data under management increases (some analysts peg the compound annual rate of data growth at 125%) and in many cases the number of DBAs to manage that growing data is not increasing, and indeed, could be decreasing.

And, budgetary limitations can cause DBAs to have to do more work, to more data, with less resources. When a company reduces budget but demands more work, automation is an absolute necessity. Turning work over to the computer can help (although it is unlikely to solve every administrative issue). Sometimes IT professionals fight against the very thing they excel in – that is, automating work. If you think about it, every computer program is written to automate someone’s work – the write (word processing), the accountant (financials, payroll, spreadsheets), and so on. This automation did not put the executives whose work was automated out of a job; instead it made them more efficient. Yet, for some reason, there is a notion in the IT industry that automating IT tasks will eliminate jobs. You cannot automate a DBA out of existence – but you can make that DBA’s job more effective and efficient with DBA tools and autonomic computing.

And the sad truth of the matter is that there is still a LOT more than can, and should, be done in most companies. We can start with better automation of DBA tasks, but we shouldn't stop there!

Corporate governance is hot – that is, technologies to help companies comply with governmental regulations. Software to enable archiving for long-term data retention, auditing to determine who did what to which piece of data, and security to better protect data are all hot data technologies right now. But database security need to be improve and technologies for securing and auditing data need to be more pervasively implemented.

Metadata is increasing in importance. As data professionals really begin to meld together technology and business, they find that metadata is imperative. But most organizations do not have a metadata repository fully-populated and up-to-date that acts as a lexicon for business data.

And finally, something that isn’t nearly hot enough is data quality and integrity. Tools, processes, and database options that can be used to make data more accurate and reliable are not implemented appropriately with any regularity. So the data stored in our corporate databases is suspect. According to Thomas C. Redman, data quality guru, poor data quality costs the typical company at least ten percent (10%) of revenue. That is a significant cost! Data quality is generally bad in most organizations – and more needs to be done to address that problem.

With all of these thoughts in mind, are you prepared to weather this perfect storm?

Tuesday, September 23, 2008

Who Did What to Which Data When... and How?

As the list of government regulations impacting IT grows organizations must adapt to understand and comply with new rules. This increasing compliance pressure is particularly intense on data stored in corporate databases. As such, organization need to be ever more vigilant in the techniques used to protect their data, and monitor access.

Database auditing, sometimes called data access auditing, is one technique growing in popularity as a response to the demands of regulatory compliance. At a high level, database auditing is basically a facility to track the use of database resources and authority. It can be used to help answer questions like “Who accessed or changed data?” and “What was actually changed?” and “When did it change?”

But how you implement your database auditing, especially in a mainframe environment, will have a significant impact on not just "the completeness" of what you capture in the audit trail, but on the performance and availability of your entire environment.

Join me on Wednesday, September 24, 2008 at 10:30 am, Central Daylight Time, for a free webinar where I will discuss the issues and requirements driving database auditing. This presentation can help to serve as a roadmap of sorts for your data access auditing needs.

Friday, September 05, 2008

The Most Important Thing is Recoverability

I know that many readers will question the title of this blog posting. But it is true. Oh, many DBAs think that managing performance is the most important thing they do, but they are confusing frequency with importance. Yes, many are managing performance more often than building backup plans – and they better be managing performance more frequently than they are actually recovering their databases or their company has big problems!

Anyway, why do I place recoverability at the very top of the DBA task list? Well, if you cannot recover your databases after a problem then it won’t matter how fast you can access them, will it? Anybody can deliver fast access to the wrong information. It is the job of the DBA to keep the information in our company’s databases accurate, secure, and accessible.

So what do we need to do to assure the integrity of our database data? First we need to understand the availability needs of our data in terms of the business. In the event of a failure how rapidly must we be able to recover from that failure? Keep in mind that the failure could be either physical, such as a failed disk drive, or logical, such as applying the wrong input to a process which corrupts the database.

Only after we know the impact to the business can we develop an appropriate backup and recovery plan. We need service level agreements (SLAs) for recovery just like we have SLAs for performance. The recovery SLA, or Recovery Time Objective (RTO) needs to be from an application perspective, such as “Time to restore application availability after a failure for application X cannot exceed 2 hours (or 10 minutes or …)”

To create effective RTOs you must be able to answer the question “What is the cost of not having this data available?” When we know the expectations of the business we can work to create a backup and recovery plan that matches the requirements. There are multiple techniques and methods for backing up and recovering databases. Some techniques, while more costly, can enhance availability by recovering data more rapidly.

It is imperative that the DBA team creates an appropriate recovery strategy for each database object. This requires mapping database objects to applications so we can adopt the proper strategy in accordance with RTOs. Some database objects will participate in multiple applications, and their recovery strategy will therefore be more complex.

Not all data is created equal. Some of your databases and tables contain data that is necessary for the core of your business. Other database objects contain data that is less critical or easily derived from other sources. Armed with this information -- and our RTOs -- a DBA can create a recovery plan that matches the needs of the business.

Establishing a reasonable backup schedule requires you to balance two competing demands: the need to take image copy backups frequently to assure reasonable recovery time, while at the same time dealing with the need to take image copies infrequently so as not to interrupt daily business. All the while keeping in mind, if you make fewer image copies you will need to apply more log records during the recovery, and the recovery will take longer. The DBA must balance these competing objectives based on SLAs, usage criteria, and the capabilities of the DBMS.

When was the last time you re-evaluated and tested your backup and recovery plans? Oh, you may have looked at disaster plans, but have you examined your ability to recover locally? Do you know how long it would take to recover your most important primary customer tables, for example, if you took a hit in the middle of the day?

Regular recoverability health checking should be a standard documented responsibility for every DBA staff; and if you can acquire software to automate the health-check process, all the better.

SEGUS offers a nice option for checking the recoverability of your DB2 databases called Recovery AssuranceExpert. Using this automated tool you can monitor the recoverability of your DB2 environment including DB2 settings (such as DB2 logging, buffer pools, DSNZPARMs), recovery prerequisites, recovery service levels, and recover time objectives for your database objects.

When was the last time you tested recovery? Are you sure you can recover your DB2 databases within a satisfactory timeframe? Wouldn’t you sleep better if you had a methodology and process in place for doing so? I know I would…

Thursday, September 04, 2008

Database Performance and Row Size

Recently I was reading through some posts in a database-related newsgroup or mailing list (actually, right now I can't remember which one it was). The conversation I was reading was in response to a question like "Does the number of columns or size of the row matter in terms of performance?"

Actually, the question asked what kind of a performance impact might be expected if a query was issued against two similar tables. The first table had (say) 20 columns, and the second table had the same 20 columns, as well as 35 additional columns.

Well, most of the basic responses were similar. The consensus was that as long as the query was going against the same columns then performance should be about the same. I disagree. Here is why.

You also need to factor in the I/O requests that are required to return the data. The DBMS will perform I/O at the block (or page) level - this is so whether you return one row or millions of rows. For multi-row results, accessing data from the table with the wider row (more columns) will usually be less efficient. This is so because fewer rows will exist on each page (the row with 100 columns is smaller than the row with 150 columns so more rows can reside in a single, pre-sized block/page). The bigger the result set, the more pronounced the performance degradation can be (because more physical I/Os are required to retrieve the data).

Think about it this way. Is it faster to pull smaller peaches out of a basket than bigger peaches? That is about the same type of question and anybody can envision the process. Say you want 100 peaches. Small peaches fit 25 per basket; big peaches fit ten per basket. To get 100 small peaches you'd need to pull 4 baskets from the shelf. To get 100 big peaches you'd need to pull 10 baskets from the shelf. The second task will clearly take more time.

Of course, the exact performance difference is difficult to calculate - especially over an online forum and without knowledge of the specific DBMS being used. But there will, more than likely, be a performance effect on queries when you add columns to a table.

Wednesday, September 03, 2008

When Not to Index

Answering a question I got via e-mail on indexing...

Every now and then I take the opportunity to blog about a question I get through e-mail. This time the question was:

When does it make more sense not to build an index for a DB2 table?

I'll attempt to answer this question for any SQL DBMS, not just for DB2:

First of all, this is a very open-ended question, so I will give a high-level answer. Let's start by saying that most of the time you will want to build at least one - and probably multiple - indexes on each database table that you create. Indexes are crucial for optimizing performance of SQL access. Without an index, queries must scan every row of the table to come up with a result. And that can be very slow.

OK, that being said, here are some times when it might makes sense to have no indexes defined on a table:

  • When all accesses retrieve every row of the table. Because every row will be retrieved every time you want to use the table an index (if used) would just add extra I/O and would decrease, not enhance performance. Though not extremely common, you may indeed come across such tables in your organization.

  • For a very small table with only a few pages of data and no primary key or uniqueness requirements. A very small table (perhaps 10 or 20 pages) might not need an index because simply reading all of the pages is very efficient already.

  • When performance doesn't matter and the table is only accessed very infrequently. But, hey, when do you ever have those type of requirements in the real world?

Other than under these circumstances, you will most likely want to build one or more indexes on each table, not only to optimize performance, but also to ensure uniqueness, to support referential integrity, and perhaps to drive data clustering.

Of course, indexes do not come without cost. Indexes take up disk space and adding a lot of indexes will consume disk space. For some DBMS products, adding many indexes can impact the working set size and perhaps raise memory problems. Additionally, although indexes speed up queries they degrade inserts and deletes, as well as any modification to indexed columns.

What do you think? Are there other situations where a table should have no indexes? Are there any pertinent high-level issues I missed? Feel free to add your thoughts and comments below!