Another utility upgrade that has found its way into DB2 9 for z/OS is the ability to gather histogram statistics. This feature is already available in DB2 for Linux, Unix, and Windows… and after you migrate to DB2 V9 it will be available to you on z/OS.
What is it? Well, let’s first define what a histogram is for those of you who are not statistics experts. A histogram is a way of summarizing data that is measured on an interval scale. A histogram is particularly helpful to quickly highlight how data is distributed; to determine if data is symmetrical or skewed; and to indicate whether or not outliers exists.
The histogram is only appropriate for variables whose values are numerical and measured on an interval scale. It is generally used when dealing with large data sets. Histogram statistics can be quite useful to the optimizer for certain types of queries.
Instead of the frequency statistics, which are collected for only a subset of the data, sometimes DB2 can improve access path selection by estimating predicate selectivity from histogram statistics, which are collected over all values in a table space.
Consider collecting histogram statistics to improve access paths for troublesome queries with RANGE, LIKE, and BETWEEN predicates. They can also help in some cases for =, IS NULL, IN LIST and COL op COL predicates.
How to Collect Histogram Statistics
IBM RUNSTATS in DB2 V9 can collect statistics by quantiles. DB2 allows up to 100 quantiles. The user can specify how many quantiles DB2 is to use from 1 to 100. Of course, avoid 1 because it will not help.
You can tell RUNSTATS to collect histogram statistics by coding the HISTOGRAM keyword in conjunction with the COLGROUP option. In this way you can collect histogram statistics for a group of columns. You must also tell DB2 the number of quantiles to collect by specifying the NUMQUANTILES parameter. NUMQUANTILES can also be specified with the INDEX parameter, in which can it indicates that histogram statistics are to be collected for the columns of the index.
A single value can never broken into more than one interval. This means that the maximum number of intervals is equal to the number of distinct column values. Therefore, be sure that you do not specify a value for NUMQUANTILES that is greater than the total number of distinct values for the column (or column group) specified. Also, keep in mind that any NULLs will occupy a single interval.
So then, how do you decide on the number of quantiles to collect? If you do not specify NUMQUANTILES, the default value of 100 will be used, and then based on the number of records in the table, the number will be readjusted to an optimal number. Therefore, unless you have a good understanding of the application or a viable reason to deviate, a good rule of thumb is to simply let the NUMQUANTILES default and let DB2 work it out.
RUNSTATS will produce an equal-depth histogram. This means that each interval will have about the same number of rows. Please note that this does not mean the same number of values – it is the same number of rows. This means that in some cases a highly frequent single value could potentially occupy an interval all by itself.
The histogram statistics are collected in three new columns: QUANTILENO, LOWVALUE, and HIGHVALUE. These columns can be found in the following six DB2 Catalog tables:
- SYSIBM.SYSCOLDIST
- SYSIBM.SYSKEYTGTDIST
- SYSIBM.SYSCOLDIST_HIST
- SYSIBM.SYSCOLDISTSTATS
- SYSIBM.SYSKEYTGTDIST_HIST
- SYSIBM.SYSKEYTGTDISTSTATS.
Here is an example of a RUNSTATS to gather histogram statistics for the key columns of the indexes.:
RUNSTATS TABLESPACE DB07.CSMTS02
INDEX ALL
HISTOGRAM NUMCOLS 2 NUMQUANTILES 10
SHRLEVEL(CHANGE)
UPDATE ALL
REPORT YES
Summary
Histogram statistics is a very powerful new capability of the RUNSTATS utility that can be used to gather distribution statistics across all data values. These statistics can be helpful when you need additional distribution data to enable the optimizer to arrive at a better access path for certain queries/predicates.