Up until DB2 10, all DB2 data
was retrieved using some form of indexing or scanning. With DB2 Version 10, a
new access method called hashing is available. Of course, referring to hashing
as “new” is somewhat disingenuous because it is an old, time-tested data
processing technique. Indeed, IMS databases are founded upon hashing
technology.
A hash,
or hash function, is an algorithm that converts a defined set of data elements
into a small number, usually a single integer that can serve as an index to an
array or a storage location on disk. The values returned by a hash function are
called hash values, hash sums, or simply hashes.
Figure 1 depicts basic hash
functionality. Key values are processed by a hash algorithm (also known as a
hash function, hash routine, or randomizer). The hash algorithm translates the
key value into a storage location. When data is inserted the algorithm tells
DB2 where to physically store the data; when data is accessed by the key, the
algorithm tells DB2 where to find the data.
Figure 1. How
hashing operates.
Hashing is used primarily to
optimize random I/O, such as for looking up a code table value or accessing a
single row based on the value of its primary key. A hash access can outperform
indexed access because fewer I/O operations are required to retrieve the data.
The hash requires 1 I/O operations (possibly 2 if a hash collision occurs). An
index requires at least 2 I/O operations, one to read the index page and one to
read the data page. Of course, only the smallest of indexes consists of just a
single root page; most consist of a root page, one or more non-leaf pages, and
a leaf page, thereby requiring 3 or more I/O operations.
The traditional notion of
clustering is not pertinent for hash organized tables. The data will be
organized based on the hash algorithm. Index clustering is not permitted for
tables that are hash organized.
The
Hash Space
A hash space is a defined storage area used for organizing table data for hash
access. Sufficient disk space must be allocated to the hash space to accommodate
the hashed rows that will be added to the table.
When the hash space is full,
new rows will be relocated to the overflow index. As the amount of data in the
overflow increases, hash access performance declines. This is so because DB2
must first go to the hashed location and then follow the pointer in the
overflow index to find the data, thereby increasing the number of I/O
operations required to retrieve the data. Refer to Figure 2 for an example. In
this case, a new row for NEELD needs to be added, but there is no room in the hash space. So the
hash space points to an overflow area where the new row can be stored.
Figure 2. Hash
overflow.
For this reason, a hash
organized table might consume more disk space than a traditional table to
minimize overflow rows.
Hash spaces can contain only a
single table and must be defined in a universal table space — either
partitioned by range or partitioned by growth. The hash overflow index for a
table in a partition-by-range Universal table space will be a partitioned
index. The hash overflow index for a table in a partition-by-growth Universal
table space will be a non-partitioned index.
Creating
Hash Organized Tables
Hash access can be specified
for a table using the organization-clause of the CREATE
TABLE statement: ORGANIZE
BY HASH UNIQUE (column-names) HASH SPACE (hash-space-value).
The ORGANIZE BY HASH specification
tells DB2 to use hashing for organizing the data of this table. The hash key is
specified in the list of column names. By specifying UNIQUE, the table cannot contain
more than one row with the same hash key value. And the HASH SPACE parameter defines the amount
of fixed hash space to be pre-allocated for the table. For tables defined in a
partition-by-range UTS, this space is for each partition.
Caution
Exercise care when specifying UNIQUE if the hash key can contain nulls. Although it is generally
considered bad design for keys to contain nulls, columns specified as a hash
key can be NULL. If the hash key can contain nulls and UNIQUE is also specified, all nulls
for a column are considered equal. To clarify, if the hash key is a single
nullable column, it can contain only one null.
For example, consider the
following SQL that would create the sample DEPT table as a hash-organized
table:
CREATE TABLE DSN81010.DEPT
(DEPTNO CHAR(3) NOT NULL,
DEPTNAME VARCHAR(36) NOT NULL,
MGRNO CHAR(6),
ADMRDEPT CHAR(3) NOT NULL,
LOCATION CHAR(16),
PRIMARY KEY
(DEPTNO)
)
IN
DSN8D10A.DSN8S10D
ORGANIZE BY
HASH UNIQUE (DEPTNO)
HASH SPACE 64
M;
Take care to specify a
sufficient amount of storage for the hash space in the organization-clause. By doing so, the overflow index should have a small number
of entries, or better yet, no entries at all. Hashing works best for tables
that are static in size with limited growth (if any).
If you choose to convert an
existing table to use hashing, be sure to walk through the following steps:
- Analyze each table you are considering for conversion. Look for static tables with unique keys that are frequently used for single fetch queries.
- Before replacing any existing indexes be sure that they are not used for range queries. Hashing is not used for range queries, so these indexes should not be dropped. Any index used solely for direct lookup can be dropped.
- Estimate the number of rows and average row size and calculate a sufficient hash space size.
- Issue the ALTER ADD organization-clause statement.
- Reorganize the table space specifying AUTOESTSPACE YES. Doing so allows DB2 to automatically estimate the best size for the hash space using real-time statistics.
- REBIND any applications using SQL with equality predicates on the hash key.
After converting, monitor
real-time statistics to evaluate the table’s hash usage efficacy. In SYSIBM.SYSTABLESPACESTATS
the REORGHASHACCESS
column indicates the number of times that hash access was used to access the
table. Furthermore, you can compare the REORGHASHACCESS column to the REORGSCANACCESS column to determine the number of time the table was accessed by
the hash versus other techniques. You can also review the HASHLASTUSED column, which contains the
date when hash access was last used for SELECT, FETCH, searched UPDATE, searched DELETE, or to enforce referential constraints.
In addition, be sure to verify
that hash access is used where you expected. To accomplish this, check the ACCESSTYPE column of the PLAN_TABLE looking for the values H, HN, or MH—these indicate that hash
access is used.
When
to Use Hashing
Consider using hash
organization for tables that are stable or predictable in size and where the
data is accessed most often by its primary key (that is, the hash key).
Hash-organized tables require an overflow index for rows that exceed the
specified hash space. Rows in the overflow index are not hashed—DB2 must scan
the index to retrieve rows in the overflow.
Table
Space Size Options for Hashing
Choose the table space sizing
options and parameters wisely for hash organized tables. Care must be taken to
ensure that sufficient space is available for the hashing. For PBR Universal
table spaces, be sure to specify an appropriate DSSIZE because it will be used to validate the hash space for each
partition.
Consider specifying PRIQTY as –1 to allow DB2 to use the
default primary space allocation. Furthermore, set DEFINE YES to ensure that the fixed
table space is allocated successfully before the first access.
Free space also is an
important consideration for hash organized table spaces. The PCTFREE specification is used by DB2
when the table space is reorganized with AUTOSPACE(YES) specification. DB2 uses the value of the DATASIZE column in the SYSIBM.SYSTABLESPACESTATS as
the initial size of the hash table space, and increases that value by the value
of PCTFREE to calculate
the new size when you reorganize the table space.
Note:This blog post was adapted from the newly published 6th edition of Craig's best-selling book, DB2 Developer's Guide.