Thursday, September 04, 2008

Database Performance and Row Size

Recently I was reading through some posts in a database-related newsgroup or mailing list (actually, right now I can't remember which one it was). The conversation I was reading was in response to a question like "Does the number of columns or size of the row matter in terms of performance?"

Actually, the question asked what kind of a performance impact might be expected if a query was issued against two similar tables. The first table had (say) 20 columns, and the second table had the same 20 columns, as well as 35 additional columns.

Well, most of the basic responses were similar. The consensus was that as long as the query was going against the same columns then performance should be about the same. I disagree. Here is why.

You also need to factor in the I/O requests that are required to return the data. The DBMS will perform I/O at the block (or page) level - this is so whether you return one row or millions of rows. For multi-row results, accessing data from the table with the wider row (more columns) will usually be less efficient. This is so because fewer rows will exist on each page (the row with 100 columns is smaller than the row with 150 columns so more rows can reside in a single, pre-sized block/page). The bigger the result set, the more pronounced the performance degradation can be (because more physical I/Os are required to retrieve the data).

Think about it this way. Is it faster to pull smaller peaches out of a basket than bigger peaches? That is about the same type of question and anybody can envision the process. Say you want 100 peaches. Small peaches fit 25 per basket; big peaches fit ten per basket. To get 100 small peaches you'd need to pull 4 baskets from the shelf. To get 100 big peaches you'd need to pull 10 baskets from the shelf. The second task will clearly take more time.

Of course, the exact performance difference is difficult to calculate - especially over an online forum and without knowledge of the specific DBMS being used. But there will, more than likely, be a performance effect on queries when you add columns to a table.

Wednesday, September 03, 2008

When Not to Index

Answering a question I got via e-mail on indexing...

Every now and then I take the opportunity to blog about a question I get through e-mail. This time the question was:

When does it make more sense not to build an index for a DB2 table?

I'll attempt to answer this question for any SQL DBMS, not just for DB2:

First of all, this is a very open-ended question, so I will give a high-level answer. Let's start by saying that most of the time you will want to build at least one - and probably multiple - indexes on each database table that you create. Indexes are crucial for optimizing performance of SQL access. Without an index, queries must scan every row of the table to come up with a result. And that can be very slow.

OK, that being said, here are some times when it might makes sense to have no indexes defined on a table:

  • When all accesses retrieve every row of the table. Because every row will be retrieved every time you want to use the table an index (if used) would just add extra I/O and would decrease, not enhance performance. Though not extremely common, you may indeed come across such tables in your organization.

  • For a very small table with only a few pages of data and no primary key or uniqueness requirements. A very small table (perhaps 10 or 20 pages) might not need an index because simply reading all of the pages is very efficient already.

  • When performance doesn't matter and the table is only accessed very infrequently. But, hey, when do you ever have those type of requirements in the real world?

Other than under these circumstances, you will most likely want to build one or more indexes on each table, not only to optimize performance, but also to ensure uniqueness, to support referential integrity, and perhaps to drive data clustering.

Of course, indexes do not come without cost. Indexes take up disk space and adding a lot of indexes will consume disk space. For some DBMS products, adding many indexes can impact the working set size and perhaps raise memory problems. Additionally, although indexes speed up queries they degrade inserts and deletes, as well as any modification to indexed columns.

What do you think? Are there other situations where a table should have no indexes? Are there any pertinent high-level issues I missed? Feel free to add your thoughts and comments below!

Friday, August 22, 2008

Upcoming Webinar on Data Breaches and Databases

Anyone who has been paying attention lately knows at least something about the large number of data breaches that have been in the news. Data breaches and the threat of lost or stolen data will continue to plague organizations until comprehensive plans are enacted to combat them. Although many of these breaches have not been at the database level, some have, and more will be unless better data protection policies and procedures are enacted on operational databases.

If you are interested in this topic I will be conducting a free webinar titled Data Breach Protection: From a Database Perspective on Wednesday, August 27, 2008 at 10:30 am CDT. This presentation will provide an overview of the data breach problem, providing examples of data breaches, their associated cost, and series of best practices for protecting your valuable production data.

This webinar offers you the opportunity to:
  • Understand the various laws that have been enacted to combat data breaches and the trends toward increasing legislation
  • Learn how to calculate the cost of a data breach based on industry best practices and research from leading analysts
  • Gain knowledge of several best practices for managing data with the goal of protecting the data from surreptitious or nefarious access (and/or modification)
  • Learn about the available techniques for securing, encrypting, and masking data to minimize exposure of critical data
  • Uncover new data best practices for auditing access to database data and for protecting data stored for long-term retention
Hope to see you on-line next Wednesday!

Monday, July 28, 2008

Selecting Every Other Row

One of the fun things about publishing is getting questions from readers that make you think. A recent question I received went something like this: "Can I get the odd and even number of rows from a DB2 table?"

Well, my first reaction was to think "this guy doesn't understand the way a SQL DBMS like DB2 works." The data in DB2 tables is not ordered, so there is no way to guarantee that the rows are odd or even numbered. While that observation may (or may not) have been true, it didn't help the guy. So I thought about it and came up with a possible work-around solution.

The first thing we have to do is to mimic row numbers in DB2. Until V9, DB2 did not support the row number construct (such as you can find in Oracle), and we'd like this to work for the versions in support today (V8 and V9).

So, to do this we start by using the COUNT(*) function and a table expression. A table expression is when you substitute SQL in place of the table in the FROM clause of another SQL statement. For example, consider this SQL:

SELECT  DEPTNO, ROWNUM
FROM DSN8810.DEPT A,
TABLE (SELECT COUNT(*) + 1 AS ROWNUM
FROM DSN8810.DEPT B
WHERE B.DEPTNO < A.DEPTNO) AS TEMP_TAB;

That puts a pseudo-row number on the table that we can access in our SQL predicates. If, say, we only want to return the even results, we could write the following query:

SELECT  DEPTNO, ROWNUM
FROM DSN8810.DEPT A,
TABLE (SELECT COUNT(*) + 1 AS ROWNUM
FROM DSN8810.DEPT B
WHERE B.DEPTNO < A.DEPTNO) AS TEMP_TAB
WHERE MOD(ROWNUM,2) = 0
ORDER BY ROWNUM;

The MOD function returns the remainder of dividing the second argument into the first. So, if the remainder is zero, we have an even number. So, this query returns every other row to the result set. If you want the odd rows only, change the predicate with the MOD function to this:

WHERE MOD(ROWNUM,2) <> 0

Of course, there is no guarantee that the same exact rows will be even (or odd) for subsequent executions of this query. It all depends how DB2 optimizes the query for execution. But it does provide a nice way to produce samples of the data (perhaps to populate a test bed of data).

Thursday, July 24, 2008

Free Webinar - Database Auditing for DB2 z/OS - July 29, 2008

Protecting corporate data is a requirement of doing business in today's regulatory and security-minded business environment. Protecting corporate data -- an especially sensitive data -- is a matter of knowing who is accessing data and what are they doing with it. There have been many solutions for addressing this need on distributed databases, but no reasonable solution for protecting mainframe data until now.

Learn all about an exciting new solution for auditing your DB2 for z/OS databases and resources - Guardium for Mainframes - at this free webinar on July 29, 2008.

Guardium for Mainframes provides 100% visibility into mainframe database activities without impacting normal business operations. This webinar will show you how to get better insight into database activity without the performance penalty of typical database trace utilities and without relying on inadequate log file data.

I'll be introducing the webinar and giving a quick overview of the issues, and Bill Baker, a senior software consultant with NEON Enterprise Software, will walk through a demonstration of the Guardium for Mainframes in action!