Wednesday, May 25, 2011

A Quick SQL Trick: Find The Number of Commas

Today's blog post is a short one. I was recently asked how to return a count of specific characters in a text string column. For example, given a text string, return a count of the number of commas in the string.

This can be done using the LENGTH and REPLACE functions as follows:

SELECT LENGTH(TEXT_COLUMN) - LENGTH(REPLACE(TEXT_COLUMN, ',' ''))

The first LENGTH function simply returns the length of the text string. The second iteration of the LENGTH function in the expression returns the length of the text string after replacing the target character (in this case a comma) with a blank.

So, let's use a string literal to show a concrete example:

SELECT LENGTH('A,B,C,D') - LENGTH(REPLACE('A,B,C,D', ',', ''))

This will translate into 7 - 4... or 3. And there are three commas in the string.

When confronted with a problem like this it is usually a good idea to review the list of built-in SQL functions to see if you can accomplish your quest using SQL alone.


Friday, May 13, 2011

DB2 -- What's in a Name?

Versions of DB2 exist for a large array of platforms, of which the mainframe (z/OS) is only one. Of course, it is my favorite one since I’ve been working on mainframe technology now for decades and have worked with DB2 since Version 1.

It used to be easy: DB2 meant IBM’s mainframe SQL database management system based on the relational model. But you can’t just say the term “DB2” any more and expect people to understand what you mean.

Today there are variations of DB2 that run on the iSeries (AS/400), on Linux, Unix, and Windows (LUW) platforms, and even one that runs on PDAs and smart phones called DB2 Everyplace. Not to mention the mainframe variations that run on z/OS, VM, and VSE.
These products are all collectively referred to by IBM as the DB2 Family. Individually, each DBMS is referred to as DB2, or sometimes DB2 Universal Database Server. There was a period of time when DB2 for LUW was called UDB and DB2 for z/OS was just called DB2. Then IBM tried to rebrand both as DB2 UDB. But that seems to have gone away several versions ago now.
The proper way to refer to any individual offering in the DB2 family is DB2 for (operating system) (for example, DB2 for z/OS or DB2 for Windows).

Different Code Bases

There are four distinct code bases for the products under the DB2 brand. The mainframe has its own code base, as does the iSeries, and VSE/VM. The fourth code base is for Linux, Unix, and Windows (LUW) platforms—and the other DB2 offerings (e.g. DB2 Everyplace) originate from this code base.

Having a separate code base means that each of these DB2 “products” was developed independently from the others. So, for example, the process used by DB2 for z/OS to optimize SQL differs from the process used by DB2 for Linux. Usually, though, the result is similar—an efficient SQL statement.

But keep in mind that there will be some differences between the DB2s.

Some of the Differences

It is obvious that the different DB2 products are not “plug and play” commodities simply because they all share the name DB2. There are some big differences among these products in their current releases. The biggest differences are relatively easy to detect and include the following:
  • Differences imposed due to operating system constraints
    (OS/400 versus z/OS versus AIX)
  • Back-level compatibility issues
  • Workstation orientation differences such as GUI interfaces and drag-and-drop menus
  • Subsystem-centric implementation (z/OS) versus database-centric implementation (workstation)
Most of these differences are minor and easy to handle. Indeed, IBM has slowly but surely been making these disparate implementations of DB2 more and more alike with each new release and version. The interface (or API) by which most people access any of the DB2 Family is SQL and there is broad compatibility among the SQL implementations of the members of the DB2 Family (though not 100 percent, of course).

A misconception “out there” in DB2-land is that the LUW platform drives new features, but a review of the changes that have been introduced to DB2 over the past several versions and releases does not bear that out. Some features are introduced on the mainframe first; others on the distributed platforms first.

Of the basic differences mentioned earlier, the only one that might not be obvious is the focus of the DBMS implementation. DB2 for LUW is database-centric. This implies that each new database carries its own system catalog with it. Additionally, it is not possible to simply access tables across different databases; distributed access is required.

On z/OS, DB2 is subsystem-centric. A single system catalog spans databases. Each subsystem has a unique identification, and you can create multiple databases within it. Distributed requests are not required to access databases within the same subsystem (or, indeed, across multiple subsystems in a data-sharing environment).

Another concept that is different at the workstation level is that of a directory. The DB2 for z/OS Directory houses DBMS system-related information regarding DBD structure, skeleton plan and skeleton package tables, RBA log ranges, and utility control data. The information cannot be updated by the user but is managed and controlled by DB2.

At the workstation level, a directory is another matter altogether. For example, the directory structure used by DB2 for LUW controls the overall environment.
  • The System Database Directory identifies the databases that can be accessed from the workstation and contains an entry for each local and remote one. Each database entry contains the database name, alias, entry type, and location.
  • One Volume Database Directory is allocated per disk drive that contains a workstation database. Each entry identifies the location of a specific database on the drive.
  • The Workstation Directory is used to make a connection to a remote database server. It is used in conjunction with the Database Connection Services Directory to make a connection to a remote host server.
  • The Database Connection Services Directory is used by DB2 Connect to make a connection to a remote host server.
Not only is it possible for the user to update these directories, it is required. The workstation directories define the environment of DB2 for LUW. Without the proper information recorded in these directories, DB2 might not function in the desired manner. The information in these directories is somewhat analogous to DB2 for z/OS DSNZPARMs and the SYSDDF system catalog tables.

Database Structures

Not all the objects available to DB2 for z/OS users are supported at the workstation level. For example, hardware-specific DB2 objects such as table spaces and storage groups are not available for DB2 on other platforms, at least not in the same way that mainframers are used to dealing with them. Partitioning and segmenting as it is done on z/OS is not done on other platforms.

However, DB2 for LUW does provide a feature known as a segmented table. But this is not the same concept as a DB2 for z/OS segmented table space. DB2 for LUW segmented tables are used to span volumes, enabling DB2 to get around file size limitations.

The file structure used for databases differs from platform to platform. For example, DB2 for z/OS uses VSAM Linear Data Sets (LDS) or Entry Sequenced Data Sets (ESDS). A database deployed on DB2 for LUW uses two files for table data: one for normal data and a second to store long fields. These workstation files are flat files, not VSAM files.

Although tables are basically the same for all of the DB2 environments, not all of the DDL options are provided in all of the environments.

Optimizer Differences

One of the most significant benefits of relational databases is that they provide built-in optimization. The DB2 for z/OS optimizer is well-known to mainframe DB2 users, but how similar are the other DB2 optimizers?

DB2 for LUW uses the latest and greatest optimization technology from IBM -- the Starburst optimizer (which arose from IBM’s Almaden research lab). Starburst is a database optimization research project that has been covered quite extensively in the academic press.

As one example of the difference, consider that the DB2 for LUW optimizer has varying levels of optimization that can be selected by the user. This concept is not implemented in DB2 for z/OS.

Although some Starburst technology will find its way to DB2 for z/OS, the mainframe DB2 optimizer will not be completely replaced by Starburst technology. Doing so would not be wise because the DB2 for z/OS optimizer has been finely tuned for its environment over the course of almost three decades.

Another interesting tidbit is that DB2 for iSeries provides an access method for programmers in which they can bypass the relational engine. This is not encouraged, but it is available.

Other Differences

Other differences exist between the different implementations of DB2. Some of these are caused by the different release cycles IBM has created for the differing platforms. The bottom line is that you need to be aware that there are differences between the DB2s on different platforms. Whenever you use a specific implementation of DB2, you need to be aware of the features it supports that other DB2 platforms do not, as well as the features it does not support that other DB2 platforms do support.

Packaging and Naming Issues

The actual name of the DB2 edition can be tricky to master on non-mainframe platforms. On the mainframe you just say “I want DB2,” and that is what you get. Well, almost. You also have to decide whether you want IBM’s utilities or not, too.

But things are more difficult in the LUW world. The following packages are all available for DB2 on Linux, Unix, and Windows:

DB2 Workgroup Server Edition (WSE) is a multi-user, single host, DBMS at the departmental user. It should be deployed for smaller systems with a limited number of users.

DB2 Enterprise Server Edition (ESE) is the highest level of DB2 database version with intra-partition parallelism support (the database engine can process SQL statement segments in parallel), and inter-partition parallelism support (process a query in parallel across all of the nodes). ESE has Partitioning and Clustering options as additional add-on features. So, this is the enterprise DB2.

DB2 Advanced Enterprise Server Edition (AESE) sounds like a step up from ESE, and it is, kind of... but not really in terms of key DBMS technology. The advanced means that IBM integrates Optim and InfoSphere technologies into the product.

DB2 Express Edition is targeted at entry level users at a low price point. Small shops, partners, and new users can build applications on top of DB2 Express.

And DB2 Express-C is IBM’s “free” DBMS offering providing all the “core” capabilities of DB2 at no charge. So why use an open source DBMS when you can get a free version of DB2?
A handy comparison of the editions is available on IBM’s web site.

Summary

So you see, saying DB2 is no enough any more. Which DB2? They’re all great, but it can take some time to wrap your arms around all of this…

Friday, April 29, 2011

I'll Be Tweeting Live From IDUG

For those of you who use Twitter, make sure you are following me next week (http://www.twitter.com/craigmullins) as I will be tweeting my experiences from the IDUG conference in Anaheim.

If you aren't planning to go, you can follow my Tweets to hear what is going on... and if you are attending the show, you can follow my Tweets to hear my perspective on things...

I arrive in Anaheim Tuesday afternoon, so I will miss the kickoff, but I'll be there the rest of the week.

Tuesday, April 26, 2011

100 Years of IBM


If you have anything at all to do with computers or information technology, you have something to thank IBM for. Watch this video to find out what!

Monday, April 04, 2011

What About Surrogate Keys?

As is so often the case with my blog, today's topic came about as the result of an e-mail question I received from a DBA I know. His question was this:

"A great debate rages here about the use of ‘synthetic’ keys. We read all sorts of articles on the wild wild web but none seem to address the database performance impacts of designs using synthetic keys. I wondered if you could point me to any information on this…"

If you've ever Googled the term "surrogate key" you know the hornet's nest of opinions that swirls around "out there" about the topic. For those who haven't heard the term, here is my attempt at a quick summary: a surrogate key is a generated unique value that is used as the primary key of a database table; database designers tend to consider surrogate keys when the natural key consists of many columns, is very long, or may need to change.

And here is the response I sent to my e-mail inquisitor:

I doubt that there is any “final word” on this topic. It has been raging on for years and years; so folks pro, others con. This Wikipedia article offers up a nice start: http://en.wikipedia.org/wiki/Surrogate_key

However, when I get to the performance area of this article I don’t think I agree. The article puts a lot of emphasis on there being fewer columns to join and therefore better performance.. If you’ve got an index on those multiple columns how much “worse” will the performance be, really? Sure, the SQL is more difficult to write, but will a join over 4 or 5 indexed columns perform that much worse than a join on one indexed column? I suppose as the number of columns required for the natural key increases the impact could be greater (e.g. 10 columns???)

I guess I can see the argument if you are swapping a variable length key with a surrogate having a fixed length key – that should improve things!

Furthermore consider this: the natural key columns are still going to be there, after all, they are naturally part of the data, right? So the surrogate (synthetic) key gets added to each row. This will likely reduce the number of rows per page (maybe not, but probably). And that, in turn, will negatively impact the performance of sequential access because more I/O will be required to read the “same” number of rows.

And what about the impact of adding data? If there are a significant number of new rows being added at the same time by different processes there will be locking issues as they all try to put the new data on the same page (unless, of course, your surrogate key is not a sequential number and is, instead, something like the microseconds portion of the current timestamp [that must be tested to avoid duplicates]).

The one thing that usually causes me to tend to favor natural keys is just that – they are natural. If the data is naturally occurring it becomes easier for end users to remember it and use it. If it is a randomly generated surrogate nobody will actually know the data. Yes, this can be masked to a great deal based on the manner in which you build your applications to access the data, but ad hoc access becomes quite difficult.

I guess the bottom line is that “it depends” on a lot of different things! No surprise there, I suppose.

Here are a few other resources with information (not so much on performance though) that you may or may not have reviewed already:

What do you think about natural keys versus surrogate keys? Surely some readers here have an opinion on this topic! If so, post them as comments...