Monday, July 09, 2012

DB2 Hashing and Hash Organized Tables


Up until DB2 10, all DB2 data was retrieved using some form of indexing or scanning. With DB2 Version 10, a new access method called hashing is available. Of course, referring to hashing as “new” is somewhat disingenuous because it is an old, time-tested data processing technique. Indeed, IMS databases are founded upon hashing technology.

A hash, or hash function, is an algorithm that converts a defined set of data elements into a small number, usually a single integer that can serve as an index to an array or a storage location on disk. The values returned by a hash function are called hash values, hash sums, or simply hashes.

Figure 1 depicts basic hash functionality. Key values are processed by a hash algorithm (also known as a hash function, hash routine, or randomizer). The hash algorithm translates the key value into a storage location. When data is inserted the algorithm tells DB2 where to physically store the data; when data is accessed by the key, the algorithm tells DB2 where to find the data.



Figure 1.  How hashing operates.

Hashing is used primarily to optimize random I/O, such as for looking up a code table value or accessing a single row based on the value of its primary key. A hash access can outperform indexed access because fewer I/O operations are required to retrieve the data. The hash requires 1 I/O operations (possibly 2 if a hash collision occurs). An index requires at least 2 I/O operations, one to read the index page and one to read the data page. Of course, only the smallest of indexes consists of just a single root page; most consist of a root page, one or more non-leaf pages, and a leaf page, thereby requiring 3 or more I/O operations.

The traditional notion of clustering is not pertinent for hash organized tables. The data will be organized based on the hash algorithm. Index clustering is not permitted for tables that are hash organized.

The Hash Space

A hash space is a defined storage area used for organizing table data for hash access. Sufficient disk space must be allocated to the hash space to accommodate the hashed rows that will be added to the table. 

When the hash space is full, new rows will be relocated to the overflow index. As the amount of data in the overflow increases, hash access performance declines. This is so because DB2 must first go to the hashed location and then follow the pointer in the overflow index to find the data, thereby increasing the number of I/O operations required to retrieve the data. Refer to Figure 2 for an example. In this case, a new row for NEELD needs to be added, but there is no room in the hash space. So the hash space points to an overflow area where the new row can be stored.


Figure 2.  Hash overflow.

For this reason, a hash organized table might consume more disk space than a traditional table to minimize overflow rows.
Hash spaces can contain only a single table and must be defined in a universal table space — either partitioned by range or partitioned by growth. The hash overflow index for a table in a partition-by-range Universal table space will be a partitioned index. The hash overflow index for a table in a partition-by-growth Universal table space will be a non-partitioned index.

Creating Hash Organized Tables

Hash access can be specified for a table using the organization-clause of the CREATE TABLE statement: ORGANIZE BY HASH UNIQUE (column-names) HASH SPACE (hash-space-value).

The ORGANIZE BY HASH specification tells DB2 to use hashing for organizing the data of this table. The hash key is specified in the list of column names. By specifying UNIQUE, the table cannot contain more than one row with the same hash key value. And the HASH SPACE parameter defines the amount of fixed hash space to be pre-allocated for the table. For tables defined in a partition-by-range UTS, this space is for each partition.
Caution
Exercise care when specifying UNIQUE if the hash key can contain nulls. Although it is generally considered bad design for keys to contain nulls, columns specified as a hash key can be NULL. If the hash key can contain nulls and UNIQUE is also specified, all nulls for a column are considered equal. To clarify, if the hash key is a single nullable column, it can contain only one null.

For example, consider the following SQL that would create the sample DEPT table as a hash-organized table:

CREATE TABLE DSN81010.DEPT
 (DEPTNO    CHAR(3) NOT NULL,
  DEPTNAME  VARCHAR(36) NOT NULL,
  MGRNO     CHAR(6),
  ADMRDEPT  CHAR(3) NOT NULL,
  LOCATION  CHAR(16),
  PRIMARY KEY (DEPTNO)
 )
  IN DSN8D10A.DSN8S10D
  ORGANIZE BY HASH UNIQUE (DEPTNO)
  HASH SPACE 64 M;

Take care to specify a sufficient amount of storage for the hash space in the organization-clause. By doing so, the overflow index should have a small number of entries, or better yet, no entries at all. Hashing works best for tables that are static in size with limited growth (if any).

If you choose to convert an existing table to use hashing, be sure to walk through the following steps:
  1. Analyze each table you are considering for conversion. Look for static tables with unique keys that are frequently used for single fetch queries.
  2. Before replacing any existing indexes be sure that they are not used for range queries. Hashing is not used for range queries, so these indexes should not be dropped. Any index used solely for direct lookup can be dropped.
  3. Estimate the number of rows and average row size and calculate a sufficient hash space size.
  4. Issue the ALTER ADD organization-clause statement.
  5. Reorganize the table space specifying AUTOESTSPACE YES. Doing so allows DB2 to automatically estimate the best size for the hash space using real-time statistics.
  6. REBIND any applications using SQL with equality predicates on the hash key.


After converting, monitor real-time statistics to evaluate the table’s hash usage efficacy. In SYSIBM.SYSTABLESPACESTATS the REORGHASHACCESS column indicates the number of times that hash access was used to access the table. Furthermore, you can compare the REORGHASHACCESS column to the REORGSCANACCESS column to determine the number of time the table was accessed by the hash versus other techniques. You can also review the HASHLASTUSED column, which contains the date when hash access was last used for SELECT, FETCH, searched UPDATE, searched DELETE, or to enforce referential constraints.

In addition, be sure to verify that hash access is used where you expected. To accomplish this, check the ACCESSTYPE column of the PLAN_TABLE looking for the values H, HN, or MH—these indicate that hash access is used.

When to Use Hashing

Consider using hash organization for tables that are stable or predictable in size and where the data is accessed most often by its primary key (that is, the hash key). Hash-organized tables require an overflow index for rows that exceed the specified hash space. Rows in the overflow index are not hashed—DB2 must scan the index to retrieve rows in the overflow.

Table Space Size Options for Hashing

Choose the table space sizing options and parameters wisely for hash organized tables. Care must be taken to ensure that sufficient space is available for the hashing. For PBR Universal table spaces, be sure to specify an appropriate DSSIZE because it will be used to validate the hash space for each partition.

Consider specifying PRIQTY as –1 to allow DB2 to use the default primary space allocation. Furthermore, set DEFINE YES to ensure that the fixed table space is allocated successfully before the first access.

Free space also is an important consideration for hash organized table spaces. The PCTFREE specification is used by DB2 when the table space is reorganized with AUTOSPACE(YES) specification. DB2 uses the value of the DATASIZE column in the SYSIBM.SYSTABLESPACESTATS as the initial size of the hash table space, and increases that value by the value of PCTFREE to calculate the new size when you reorganize the table space.


Note:This blog post was adapted from the newly published 6th edition of Craig's best-selling book, DB2 Developer's Guide.