The Db2 Portal Blog: clustering

Showing posts with label clustering. Show all posts

Wednesday, December 18, 2019

High Level Db2 Indexing Advice for Large and Small Tables

In general, creating indexes to support your most frequent and important Db2 SQL queries is a good idea. But the size of the table will be a factor in decided whether to index at all and/or how many indexes to create.

For tables more than 100 (or so) pages, it usually is best to define at least one index. This gives Db2 guiidance on how to cluster the data. And, for the most part, you should follow the general advice of having a primary key for every table... and that means at least one unique index to support the primary key.

If the table is large (more than 20,000 pages or so), you need to perform a balancing act to limit the indexes to those absolutely necessary for performance. When a large table has multiple indexes, data modification performance can suffer. When large tables lack indexes, however, access efficiency will suffer. This fragile balance must be monitored closely. In most situations, more indexes are better than fewer indexes because most applications are query-intensive rather than update-intensive. However, each table and application will have its own characteristics and requirements.

For tables containing a small number of pages (up to 100 or so pages) consider limiting indexes to those required for uniqueness and perhaps to support common join criterion. This is a reasonable approach because such a small number of pages can be scanned as, or more, efficiently than using an index.

For small tables you can add indexes when the performance of queries that access the table suffers. Test the performance of the query after the index is created, though, to ensure that the index helps. When you index a small table, increased I/O (due to index accesses) may cause performance to suffer when compared to a complete scan of all the data in the table.

Friday, March 04, 2016

The Most Misunderstood Features of DB2 – Part 5: Choosing the Clustering Key

Hello everybody, and welcome to part 5 in my on-going series where we take a look at The Most Misunderstood Features of DB2. You can find the earlier installments in this series here (Part 1 Part 2 Part 3 Part 4).

Today’s topic, on choosing an appropriate clustering index, might be a bit confusing to some of you. So let’s start at the beginning and describe what clustering is.

What is Clustering?

In DB2 for z/OS, you use an index to indicate how DB2 should try to store table space data physically on disk. This is called clustering. A DB2 index is a clustering index if the CLUSTER keyword is specified when the index is created. Clustering causes inserted rows to be stored contiguously in sequence whenever possible. Additionally, when the table space is reorganized the data will be sequenced according to the clustering index. Since there can only be one physical sequence for data on disk, there can only be one clustering index per table.

For tables that do not specify APPEND YES and do not have hashing defined, if you do not specify a clustering index, DB2 chooses to cluster the data using the oldest existing index. Therefore, it is wise to explicitly specify a clustering index instead of letting DB2 decide because you can almost always choose better than the (basically random) choice DB2 makes.

Clustering of data is important because when the data is sequentially accessed, if the data exists on the same page (or range of pages) then I/O is minimized because fewer pages are read than if the data were spread throughout the entire table space on “random” pages. And fewer I/Os means improved performance.

So then, how should a DBA go about deciding how to cluster the data? Well, the best answer is to analyze the various ways in which the data is to be accessed. This means reviewing the actual SQL along with frequency and importance of the execution patterns. When the most important/frequently executed sequential accesses are identified, then you can choose to cluster the data according to that information. By following this approach then the most frequent and/or important sequential accesses will be optimized by clustering.

Of course, frequently the decision on how to cluster is made during the database design phase when few, if any, SQL statements or access patterns are readily available. So the DBA sometimes makes a knee-jerk decision to just cluster based on the primary key (PK) and be done with it. This is usually not a good idea.

Why is Clustering by PK not a very good idea?

Think about what we just said about sequentially accessing data. How often is data sequentially accessed by PK? Sometimes this might be the case, maybe if you are producing a report in order by PK, but even that example is not a very good one. Think about it. If you are producing a customer report you probably do not create it in CUSTNO order but probably in order by customer name (or something more interesting to humans than a key number).

So the best option is to figure out the most common and important sequential access patterns and cluster accordingly. But what if this information is not available? My assertion is that clustering by Foreign Key (FK), instead of by PK, makes a lot more sense. Usually a FK is on the many side of a one-to-many relationship. Think about what that means. When you join PK to FK there will be one PK row being joined to multiple FK rows. If those multiple FK rows are all on the same page (or on contiguous pages) due to clustering, then we can impact I/O.

The Bottom Line

The bottom line here is that you should think about (and even re-think) your general assumptions and why you make any database design decision. Sometimes your immediate gut reaction won’t be the right one. And it is always better to think with our heads instead of our guts, right?