Friday, April 29, 2011
I'll Be Tweeting Live From IDUG
If you aren't planning to go, you can follow my Tweets to hear what is going on... and if you are attending the show, you can follow my Tweets to hear my perspective on things...
I arrive in Anaheim Tuesday afternoon, so I will miss the kickoff, but I'll be there the rest of the week.
Tuesday, April 26, 2011
100 Years of IBM
If you have anything at all to do with computers or information technology, you have something to thank IBM for. Watch this video to find out what!
Monday, April 04, 2011
What About Surrogate Keys?
As is so often the case with my blog, today's topic came about as the result of an e-mail question I received from a DBA I know. His question was this:
"A great debate rages here about the use of ‘synthetic’ keys. We read all sorts of articles on the wild wild web but none seem to address the database performance impacts of designs using synthetic keys. I wondered if you could point me to any information on this…"
If you've ever Googled the term "surrogate key" you know the hornet's nest of opinions that swirls around "out there" about the topic. For those who haven't heard the term, here is my attempt at a quick summary: a surrogate key is a generated unique value that is used as the primary key of a database table; database designers tend to consider surrogate keys when the natural key consists of many columns, is very long, or may need to change.
And here is the response I sent to my e-mail inquisitor:
I doubt that there is any “final word” on this topic. It has been raging on for years and years; so folks pro, others con. This Wikipedia article offers up a nice start: http://en.wikipedia.org/wiki/Surrogate_key
However, when I get to the performance area of this article I don’t think I agree. The article puts a lot of emphasis on there being fewer columns to join and therefore better performance.. If you’ve got an index on those multiple columns how much “worse” will the performance be, really? Sure, the SQL is more difficult to write, but will a join over 4 or 5 indexed columns perform that much worse than a join on one indexed column? I suppose as the number of columns required for the natural key increases the impact could be greater (e.g. 10 columns???)
I guess I can see the argument if you are swapping a variable length key with a surrogate having a fixed length key – that should improve things!
Furthermore consider this: the natural key columns are still going to be there, after all, they are naturally part of the data, right? So the surrogate (synthetic) key gets added to each row. This will likely reduce the number of rows per page (maybe not, but probably). And that, in turn, will negatively impact the performance of sequential access because more I/O will be required to read the “same” number of rows.
And what about the impact of adding data? If there are a significant number of new rows being added at the same time by different processes there will be locking issues as they all try to put the new data on the same page (unless, of course, your surrogate key is not a sequential number and is, instead, something like the microseconds portion of the current timestamp [that must be tested to avoid duplicates]).
The one thing that usually causes me to tend to favor natural keys is just that – they are natural. If the data is naturally occurring it becomes easier for end users to remember it and use it. If it is a randomly generated surrogate nobody will actually know the data. Yes, this can be masked to a great deal based on the manner in which you build your applications to access the data, but ad hoc access becomes quite difficult.
I guess the bottom line is that “it depends” on a lot of different things! No surprise there, I suppose.
Here are a few other resources with information (not so much on performance though) that you may or may not have reviewed already:
- http://www.agiledata.org/essays/keys.html
- http://www.techrepublic.com/forum/discussions/10-85582-3172364
- http://www.infobright.org/forums/viewthread/348/
- http://www.bcarter.com/intsurr1.htm
- http://weblogs.sqlteam.com/mladenp/archive/2009/10/06/Why-I-prefer-surrogate-keys-instead-of-natural-keys-in.aspx
- http://www.sqlservercentral.com/articles/Primary+key/70747/ - this one talks about SQL Server performance
What do you think about natural keys versus surrogate keys? Surely some readers here have an opinion on this topic! If so, post them as comments...
Wednesday, March 09, 2011
DB2 Symposium 2011
Today's blog post is about a great symposium dedicated to the topic of DB2. It is called, appropriately enough, the DB2 Symposium. DB2 Symposium is a three day training event with one day seminars presented by well-known DB2 consultants. I was fortunate enough to be asked to participate this year by the primary organizer of the event, Klaas Brant. For those of you who don't know him, Klaas is a well-respected DB2 consultant based in the Netherlands... and an all around great guy.
Why should I attend the DB2 Symposium you may ask? Don't IDUG and IOD provide everything I need in the way of events? Well, DB2 Symposium fills the gap between a conference and a multi-day training course. The DB2 Symposium is unique because you can participate for 1, 2, or 3 days, depending on your needs and budget.
Although it has not been to the USA the past few years, the DB2 Symposium is a regular, well-known event in Europe! And after a period of absence the DB2 Symposium is back in the USA.
The USA DB2 Symposium is happening soon, so you'll need to act fast if you want to participate. It occurs March 21-23, 2011 in the Dallas, Texas area. More precisely, at the Hilton Arlington (2401 East Lamar Boulevard, Arlington, Texas, USA 76006-7503). Each day the training sessions start at 9.00am and end at around 5.00pm.
But registration on site is not possible, you must pre-register online... so plan ahead!
My session is on March 21st and it is called DB2 Developer's Guide Comes Alive! This one day session, covers tips, techniques, and procedures you need to know in order to excel at administering and using DB2 on the mainframe.The material is based upon DB2 Developer's Guide, the best-selling DB2 for z/OS book on the market. Additionally, the course material will contain references to sections of the book for students to find additional material on each topic after the sessions. Topics to be covered will include:
- A performance tuning roadmap for managing DB2 application, database and system performance. You will learn SQL coding and tuning techniques, guidance for database optimization and reorganization, coverage of buffer pool settings and parameters for performance.
- Logical and physical database design recommendations for DB2, so you can build and maintain effective DB2 databases immediately. Includes discussion of standards, logical to physical translation, data types, usage of nulls, and more.
- Information and guidance on BINDing and REBINDing, along with a discussion of the most important parameters.
- Along the way we'll look at locking, access paths, statistics, indexing and more.
- And even though the current edition of the book covers through DB2 V8, this course adds coverage of some of the newer features added to DB2 in versions 9 and 10 that can boost your productivity and performance.
If you own the book already, bring it along and I'll be happy to autograph it for you. And then you can use it along with the course materials... and if you don't own it already, you'll probably want to grab a copy after attending the seminar... you can always find a link to buy my books on the front page of my web site at http://www.craigsmullins.com.
So register for the DB2 Symposium today... and I'll see you in Dallas, pardner!
Monday, February 21, 2011
Not Your Standard Sorting Requirement
At this point, some of you might be asking "What the heck is he talking about?" Fair enough. Let’s take a look at an example to bring the issue into focus.
Assume that you have a table containing transactions, or some other type of interesting facts. The table has a CHAR(3) column containing an abbreviation for the name of the day on which the transaction happened; let’s call this column DAY_NAME. So, for example, the DAY_NAME column would contain MON for Monday data, and so on.
Now, let’s further assume that we want to write queries against this table that orders the results by DAY_NAME. We’d want Sunday first, followed by Monday, Tuesday, Wednesday, and so on. How can this be done?
Well, the first step is usually to write the first query that comes to mind, or something like this:
SELECT DAY_NAME, COL1, COL2 . . .
FROM TXN_TABLE
ORDER BY DAY_NAME;
Of course, the results will be sorted improperly. ORDER BY will sort the results alphabetically; in other words: FRI MON SAT SUN THU TUE WED
This is what I mean by an irregular sorting requirement. Here we have an example that occurs commonly enough, but without an obvious immediate solution. Furthermore, many businesses and applications have similar requirements for which the business needs dictate a different sort order than strictly alphabetical or numeric. So what is the solution here?
Of course, one solution would be to design the table with an additional numeric or alphabetic column that would sort properly. By this, I mean that we could add a DAY_NUM column that would be 1 for Sunday, 2 for Monday, and so on. But this requires a database design change, and it becomes possible for the DAY_NUM and DAY_NAME to get out of sync if you are not careful.
There is another solution that is both elegant and does not require any change to the database. To implement this solution, all you need is an understanding of SQL and SQL functions -- in this case, the LOCATE function. Consider this SQL:
SELECT DAY_NAME, COL1, COL2 . . .
FROM TXN_TABLE
ORDER BY LOCATE('SUNMONTUEWEDTHUFRISAT',DAY_NAME);
The trick here is to understand how the LOCATE function works. The LOCATE function returns the starting position of the first occurrence of one string within another string.
So, in our example, LOCATE finds the position of the DAY_NAME value within the string 'SUNMONTUEWEDTHUFRISAT', and returns the integer value of that position. So, if DAY_NAME is WED, the LOCATE function returns 10. (Note: Some other database systems have a similar function called INSTR.) Sunday would return 1, Monday 4, Tuesday 7, Wednesday 10, Thursday 13, Friday 16, and Saturday 19. This means that our results would be in the order we require.
Of course, you can go one step further if you’d like. Some queries may need to actually return the day of week. You can use the same technique with a twist to return the day of week value, given only the day’s name. To turn this into the appropriate day of the week number (that is, a value of 1 through 7), we divide by three, use the INT function on the result to return only the integer portion of the result, and then add one:
INT(LOCATE('SUNMONTUEWEDTHUFRISAT',DAY_NAME)/3) + 1;
Let’s use our previous example of Wednesday again. The LOCATE function returns the value 10. So, INT(10/3) = 3 and add 1 to get 4. And sure enough, Wednesday is the fourth day of the week.
Summary
With a sound understanding of the features of DB2 SQL and a little imagination many irregular requirements are achievable using nothing more than SQL!