Tuesday, November 21, 2006

Character Versus Numeric Data Types

Most DBAs have faced the situation where one of their applications requires a four-byte code that is used to identify products, accounts, or some other business object, and all of the codes are numeric and will stay that way. But, for reporting purposes, users or developers wish the codes to print out with leading zeroes. So, the users request that the column be defined as CHAR(4) to ensure that leading zeroes are always shown. But what are the drawbacks, if any, to doing this?

Well, there are drawbacks! Without proper edit checks, INSERTs and UPDATEs could place invalid alphabetic characters into the alphanumeric column. This can be a very valid concern if ad hoc data modifications are permitted. Although it is uncommon to allow ad hoc modifications to production databases, data problems can still occur if all of the proper edit checks are not coded into every program that can modify the data. But let's assume (big assumption) that proper edit checks are coded and will never be bypassed. This removes the data integrity question.

There is another problem, though, that is related to performance and filter factors. Consider the possible number of values that a CHAR(4) column and a SMALLINT column can assume. Even if programmatic edit checks are coded for each, DB2 is not aware of these and assumes that all combinations of characters are permitted. DB2 uses base 37 math when it determines access paths for character columns, under the assumption that 26 alphabetic letters, 10 numeric digits, and a space will be used. This adds up to 37 possible characters. For a four-byte character column there are 3**74 or 1,874,161 possible values.

A SMALLINT column can range from -32,768 to 32,767 producing 65,536 possible small integer values. The drawback here is that negative codes and/or 5 digit codes could be entered. Both do not conform to the 4 digit maximum. However, if we adhere to our proper edit check assumption, the data integrity problems will be avoided here, as well.

DB2 will use the HIGH2KEY and LOW2KEY values to calculate filter factors. For character columns, the range between HIGH2KEY and LOW2KEY is larger than numeric columns because there are more total values. The filter factor will be larger for the numeric data type than for the character data type which may influence DB2 to choose a different access path. For this reason, favor the SMALLINT over the CHAR(4) definition.

The leading zeroes problem should be able to be solved using other methods. It is not necessary to store the data in the same format that users wish to display it. For example, when using QMF, you can ensure that leading zeroes are shown in reports by using the "J" edit code. Other report writes offer similar functionality. And report programs can be coded to display leading zeroes easily enough by moving the host variables to appropriate display fields.

In general, it is wise to choose a data type which is closest to the domain for the column. If the column is to store numeric data, favor choosing a numeric data type: SMALLINT, INTEGER, DECIMAL, or floating point. Same goes for temporal data (that is, choose DATE, TIME, or TIMESTAMP instead of a character or numeric data type). In addition, always be sure to code appropriate edit checks to ensure data integrity - but remember, fewer need to be coded if you choose the correct data type because DB2 automatically prohibits data that does not conform to the data type for each column.


Anonymous said...

I didnt find thing that i need... :-(

Craig S. Mullins said...

Sorry to hear that... It would be a lot better if you stated what you were looking for though, then maybe someone could help.

Slacker said...

Awesome Blog!!!!!

"Anonymous said...
I didnt find thing that i need... :-("

I think either you did not read the Title of the Blog or you were searching for apples on a Blog that talks about oranges!

Anonymous said...

I have a table with one of the column as below:

This column though char type contains only numeric values.

Now when running any query on this column giving quotes(being a CHAR data type) is NOT fetching the concerned record but without quotes it is. Can someone explain me ?

WHERE ACC_NMBR = ‘00000001000266262376’; ----- ZERO RECORDS FOUND;no error

WHERE ACC_NMBR = 00000001000266262376; ------ 1 record found.

Medha Agrawal said...

Can someone please explain why the CHAR column with all numeric values does not fetch record if value is in quotes?

Field Definition of ACC_NMBR( all values stored in the table are numeric but defined as CHAR type)

WHERE ACC_NMBR = ‘00000001000123456789’; -------- returns zero records

WHERE ACC_NMBR = 00000001000123456789; ---- returns 1 record.

Craig S. Mullins said...

Consider looking into the FOR SBCS specification and code pages. Here is a starting place: https://developer.ibm.com/technologies/databases/articles/dm-0506chong/