Hello everybody, and welcome
to part 5 in my on-going series where we take a look at The Most Misunderstood
Features of DB2. You can find the earlier installments in this series here
(Part 1 Part 2 Part 3 Part 4).
Today’s topic, on choosing an appropriate clustering index, might be a bit confusing to some of you. So let’s start at
the beginning and describe what clustering is.
What is Clustering?
In DB2 for z/OS, you use an
index to indicate how DB2 should try to store table space data physically on
disk. This is called clustering. A DB2 index is a clustering index if the CLUSTER keyword is specified when the
index is created. Clustering causes inserted rows to be stored contiguously in
sequence whenever possible. Additionally, when the table space is reorganized
the data will be sequenced according to the clustering index. Since there can
only be one physical sequence for data on disk, there can only be one
clustering index per table.
For tables that do not specify
APPEND YES and do not have hashing
defined, if you do not specify a clustering index, DB2 chooses to cluster the
data using the oldest existing index. Therefore, it is wise to explicitly
specify a clustering index instead of letting DB2 decide because you can almost
always choose better than the (basically random) choice DB2 makes.
Clustering of data is
important because when the data is sequentially accessed, if the data exists on
the same page (or range of pages) then I/O is minimized because fewer pages
are read than if the data were spread throughout the entire table space on
“random” pages. And fewer I/Os means improved performance.
So then, how should a DBA go
about deciding how to cluster the data? Well, the best answer is to analyze the
various ways in which the data is to be accessed. This means reviewing the
actual SQL along with frequency and importance of the execution patterns. When
the most important/frequently executed sequential accesses are identified, then
you can choose to cluster the data according to that information. By following this approach then the most frequent and/or important sequential accesses will be optimized by clustering.
Of course, frequently the
decision on how to cluster is made during the database design phase when few,
if any, SQL statements or access patterns are readily available. So the DBA
sometimes makes a knee-jerk decision to just cluster based on the primary key (PK)
and be done with it. This is usually not a good idea.
Why is Clustering by PK not a very good idea?
Think about what we just said
about sequentially accessing data. How often is data sequentially accessed by PK?
Sometimes this might be the case, maybe if you are producing a report in order
by PK, but even that example is not a very good one. Think about it. If you are
producing a customer report you probably do not create it in CUSTNO order but
probably in order by customer name (or something more interesting to humans
than a key number).
So the best option is to
figure out the most common and important sequential access patterns and cluster
accordingly. But what if this information is not available? My assertion is
that clustering by Foreign Key (FK), instead of by PK, makes a lot more sense.
Usually a FK is on the many side of a one-to-many relationship. Think about
what that means. When you join PK to FK there will be one PK row being joined
to multiple FK rows. If those multiple FK rows are all on the same page (or on
contiguous pages) due to clustering, then we can impact I/O.
The Bottom Line
The bottom line here is that
you should think about (and even re-think) your general assumptions and why you
make any database design decision. Sometimes your immediate gut reaction won’t
be the right one. And it is always better to think with our heads instead of
our guts, right?