The Db2 Portal Blog: primary key

Showing posts with label primary key. Show all posts

Wednesday, December 18, 2019

High Level Db2 Indexing Advice for Large and Small Tables

In general, creating indexes to support your most frequent and important Db2 SQL queries is a good idea. But the size of the table will be a factor in decided whether to index at all and/or how many indexes to create.

For tables more than 100 (or so) pages, it usually is best to define at least one index. This gives Db2 guiidance on how to cluster the data. And, for the most part, you should follow the general advice of having a primary key for every table... and that means at least one unique index to support the primary key.

If the table is large (more than 20,000 pages or so), you need to perform a balancing act to limit the indexes to those absolutely necessary for performance. When a large table has multiple indexes, data modification performance can suffer. When large tables lack indexes, however, access efficiency will suffer. This fragile balance must be monitored closely. In most situations, more indexes are better than fewer indexes because most applications are query-intensive rather than update-intensive. However, each table and application will have its own characteristics and requirements.

For tables containing a small number of pages (up to 100 or so pages) consider limiting indexes to those required for uniqueness and perhaps to support common join criterion. This is a reasonable approach because such a small number of pages can be scanned as, or more, efficiently than using an index.

For small tables you can add indexes when the performance of queries that access the table suffers. Test the performance of the query after the index is created, though, to ensure that the index helps. When you index a small table, increased I/O (due to index accesses) may cause performance to suffer when compared to a complete scan of all the data in the table.

Friday, March 04, 2016

The Most Misunderstood Features of DB2 – Part 5: Choosing the Clustering Key

Hello everybody, and welcome to part 5 in my on-going series where we take a look at The Most Misunderstood Features of DB2. You can find the earlier installments in this series here (Part 1 Part 2 Part 3 Part 4).

Today’s topic, on choosing an appropriate clustering index, might be a bit confusing to some of you. So let’s start at the beginning and describe what clustering is.

What is Clustering?

In DB2 for z/OS, you use an index to indicate how DB2 should try to store table space data physically on disk. This is called clustering. A DB2 index is a clustering index if the CLUSTER keyword is specified when the index is created. Clustering causes inserted rows to be stored contiguously in sequence whenever possible. Additionally, when the table space is reorganized the data will be sequenced according to the clustering index. Since there can only be one physical sequence for data on disk, there can only be one clustering index per table.

For tables that do not specify APPEND YES and do not have hashing defined, if you do not specify a clustering index, DB2 chooses to cluster the data using the oldest existing index. Therefore, it is wise to explicitly specify a clustering index instead of letting DB2 decide because you can almost always choose better than the (basically random) choice DB2 makes.

Clustering of data is important because when the data is sequentially accessed, if the data exists on the same page (or range of pages) then I/O is minimized because fewer pages are read than if the data were spread throughout the entire table space on “random” pages. And fewer I/Os means improved performance.

So then, how should a DBA go about deciding how to cluster the data? Well, the best answer is to analyze the various ways in which the data is to be accessed. This means reviewing the actual SQL along with frequency and importance of the execution patterns. When the most important/frequently executed sequential accesses are identified, then you can choose to cluster the data according to that information. By following this approach then the most frequent and/or important sequential accesses will be optimized by clustering.

Of course, frequently the decision on how to cluster is made during the database design phase when few, if any, SQL statements or access patterns are readily available. So the DBA sometimes makes a knee-jerk decision to just cluster based on the primary key (PK) and be done with it. This is usually not a good idea.

Why is Clustering by PK not a very good idea?

Think about what we just said about sequentially accessing data. How often is data sequentially accessed by PK? Sometimes this might be the case, maybe if you are producing a report in order by PK, but even that example is not a very good one. Think about it. If you are producing a customer report you probably do not create it in CUSTNO order but probably in order by customer name (or something more interesting to humans than a key number).

So the best option is to figure out the most common and important sequential access patterns and cluster accordingly. But what if this information is not available? My assertion is that clustering by Foreign Key (FK), instead of by PK, makes a lot more sense. Usually a FK is on the many side of a one-to-many relationship. Think about what that means. When you join PK to FK there will be one PK row being joined to multiple FK rows. If those multiple FK rows are all on the same page (or on contiguous pages) due to clustering, then we can impact I/O.

The Bottom Line

The bottom line here is that you should think about (and even re-think) your general assumptions and why you make any database design decision. Sometimes your immediate gut reaction won’t be the right one. And it is always better to think with our heads instead of our guts, right?

Sunday, September 01, 2013

Top Ten Common SQL Mistakes (with DB2 for z/OS)

There are many different types of development mistakes that can be made when you are coding a complex application system. But probably the most pervasive mistakes being made when developing an application using DB2 for z/OS are SQL mistakes... well, actually, that statement is probably true regardless of the DBMS, but I am going to focus on DB2 for z/OS given that this is a blog that focuses on DB2 and mainframe issues.

Assuming an ORDER BY is not necessary for ordered results - if you want your results always to be returned in a specific order, you must include the ORDER BY clause. If you fail to do so, a future rebind can change the access path and also possibly change the order of your results set.
Forgetting the NULL indicator - when your query can return a null, but sure to include a null indicator or you will get a -305 SQLCODE (22002 SQLSTATE). And be sure to check the null indicator to see if the data is null or not!
Incorrect expectations when using the NOT IN predicate with NULLs - Nulls can be confusing when using the IN predicate. For example, what if we want to find all Colors who are not assigned to a particular Product using a query like shown below. THe problem arises when the P.color can be NULL. The NULL causes the predicate to be UNKNOWN so the results set is always empty.

SELECT C.color
FROM Colors AS C
WHERE C.color NOT IN (SELECT P.color
FROM Products AS P);

Coding predicates appropriately in Outer Joins - Sometimes it can be difficult to figure out how to code outer joins appropriately if you don't code them very often. Terry Purcell has written a nice article on the topic that can be found here.
Not coding a cursor for a multi-row result - When more than one row can be returned by your query you must use a cursor (or specify FETCH FIRST 1 ROW ONLY)
Recompiling but not binding - if you make changes to the SQL, you have to BIND it again. Otherwise it won't work!
Forgetting to use single quotes around strings (instead of double quotes) - DB2 SQL expects single quotes around character strings.
Trying to modify a Primary Key column - you cannot modify a primary key. A primary key should be unique within the table and immutable.
Forcing dynamic SQL into static SQL (sometimes hundreds of static SQL statements) - you should analyze the type and nature of your database queries to determine whether they should be static or dynamic. Consider using the advice here (static SQL v. dynamic SQL) to guide you.
Asking for more data than you need (columns and/or rows) - specify only the columns that you actually need in your SELECT-list... and use WHERE clauses (predicates) to filter the data to just that data that you need before bringing it into the program. The lesser the amount of data that DB2 needs to transfer from the database to your program, the more efficient things will be!

Monday, April 04, 2011

What About Surrogate Keys?

As is so often the case with my blog, today's topic came about as the result of an e-mail question I received from a DBA I know. His question was this:

"A great debate rages here about the use of ‘synthetic’ keys. We read all sorts of articles on the wild wild web but none seem to address the database performance impacts of designs using synthetic keys. I wondered if you could point me to any information on this…"

If you've ever Googled the term "surrogate key" you know the hornet's nest of opinions that swirls around "out there" about the topic. For those who haven't heard the term, here is my attempt at a quick summary: a surrogate key is a generated unique value that is used as the primary key of a database table; database designers tend to consider surrogate keys when the natural key consists of many columns, is very long, or may need to change.

And here is the response I sent to my e-mail inquisitor:

I doubt that there is any “final word” on this topic. It has been raging on for years and years; so folks pro, others con. This Wikipedia article offers up a nice start: http://en.wikipedia.org/wiki/Surrogate_key

However, when I get to the performance area of this article I don’t think I agree. The article puts a lot of emphasis on there being fewer columns to join and therefore better performance.. If you’ve got an index on those multiple columns how much “worse” will the performance be, really? Sure, the SQL is more difficult to write, but will a join over 4 or 5 indexed columns perform that much worse than a join on one indexed column? I suppose as the number of columns required for the natural key increases the impact could be greater (e.g. 10 columns???)

I guess I can see the argument if you are swapping a variable length key with a surrogate having a fixed length key – that should improve things!

Furthermore consider this: the natural key columns are still going to be there, after all, they are naturally part of the data, right? So the surrogate (synthetic) key gets added to each row. This will likely reduce the number of rows per page (maybe not, but probably). And that, in turn, will negatively impact the performance of sequential access because more I/O will be required to read the “same” number of rows.

And what about the impact of adding data? If there are a significant number of new rows being added at the same time by different processes there will be locking issues as they all try to put the new data on the same page (unless, of course, your surrogate key is not a sequential number and is, instead, something like the microseconds portion of the current timestamp [that must be tested to avoid duplicates]).

The one thing that usually causes me to tend to favor natural keys is just that – they are natural. If the data is naturally occurring it becomes easier for end users to remember it and use it. If it is a randomly generated surrogate nobody will actually know the data. Yes, this can be masked to a great deal based on the manner in which you build your applications to access the data, but ad hoc access becomes quite difficult.

I guess the bottom line is that “it depends” on a lot of different things! No surprise there, I suppose.

Here are a few other resources with information (not so much on performance though) that you may or may not have reviewed already:

What do you think about natural keys versus surrogate keys? Surely some readers here have an opinion on this topic! If so, post them as comments...