The Db2 Portal Blog: access paths

Showing posts with label access paths. Show all posts

Friday, September 04, 2015

Influencing the DB2 Optimizer: Part 7 - Miscellaneous Additional Considerations

In this 7th, and final installment of this series on influencing the DB2 optimizer's access path choices, we will take a look at a couple of additional things to consider as you work toward improving your SQL performance.

Favor Optimization Hints Over Updating the DB2 Catalog

Optimization hints to influence access paths are less intrusive and easier to implement than changing data in the DB2 Catalog. However, that does not mean that you should use optimization hints all the time! Do not use optimization hints as a crutch to arrive at a specific access path. Optimization hints are best used when an access path changes and you want to go back to a previous, efficient access path.

Limit Ordering to Avoid Scanning

The optimizer is more likely to choose an index scan when ordering is important (ORDER BY, GROUP BY, or DISTINCT) and the index is clustered by the columns to be sorted.

Maximize Buffers and Minimize Data Access

If the inner table fits in 2% of the buffer pool, nested loop join should be favored. Therefore, to increase the chances of nested loop joins, increase the size of the buffer pool (or decrease the size of the inner table, if possible).

Consider Deleting Non-uniform Distribution Statistics

Sometimes non-uniform distribution statistics can cause dynamic SQL statements to fluctuate dramatically in terms of how they perform. To decrease these wild fluctuations, consider removing the non-uniform distribution statistics from the DB2 Catalog.

Although dynamic SQL makes the best use of these statistics, the overall performance of some applications that heavily use dynamic SQL can suffer. The optimizer might choose a different access path for the same dynamic SQL statement, depending on the values supplied to the predicates. In theory, this should be the desired goal. In practice, however, the results might be unexpected. For example, consider the following dynamic SQL statement:

SELECT EMPNO, LASTNAME

FROM DSN81010.EMP

WHERE WORKDEPT = ?

The access path might change depending on the value of WORKDEPT because the optimizer calculates different filter factors for each value, based on the distribution statistics. As the number of occurrences of distribution statistics increases, the filter factor decreases. This makes DB2 think that fewer rows will be returned, which increases the chance that an index will be used and affects the choice of inner and outer tables for joins.

These statistics are stored in the SYSIBM.SYSCOLDIST and SYSIBM.SYSCOLDISTSTATS tables and can be removed using SQL DELETE statements.

This suggested guideline does not mean that you should always delete the non-uniform distribution statistics. My advice is quite to the contrary. When using dynamic SQL, allow DB2 the chance to use these statistics. Delete these statistics only when performance is unacceptable. (They can always be repopulated later using RUNSTATS.)

Collect More Than Just the Top Ten Non-uniform Distribution Statistics

If non-uniform distribution impacts more than just the top ten most frequently occurring values, you should use the FREQVAL option of RUNSTATS to capture more than 10 values. Capture only as many as will prove to be useful for optimizing queries against the non-uniformly distributed data.

DB2 Referential Integrity Use

Referential integrity (RI) is the implementation of constraints between tables so that values from one table (the parent) control the values in another (the dependent, or child). A referential constraint between a parent table and a dependent table is defined by a relationship between the columns of the tables. The parent table’s primary key columns control the values permissible in the dependent table’s foreign key columns. For example, in the sample table, DSN8810.EMP, the WORKDEPT column (the foreign key) must reference a valid department as defined in the DSN8810.DEPT table’s DEPTNO column (the primary key).

You have two options for implementing RI at your disposal: declarative and application. Declarative constraints provide DB2-enforced referential integrity and are specified by DDL options. All modifications, whether embedded in an application program or ad hoc, must comply with the referential constraints. Favor using declarative RI as DB2 will then be aware of the relationship and can use that information during access path optimization.

Application-enforced referential integrity is coded into application programs. Every program that can update referentially-constrained tables must contain logic to enforce the referential integrity. This type of RI is not applicable to ad hoc updates.

With DB2-enforced RI, CPU use is reduced because the Data Manager component of DB2 performs DB2-enforced RI checking, whereas the RDS component of DB2 performs application-enforced RI checking. Additionally, rows accessed for RI checking when using application-enforced RI must be passed back to the application from DB2. DB2-enforced RI does not require this passing of data, further reducing CPU time.

In addition, DB2-enforced RI uses an index (if one is available) when enforcing the referential constraint. In application-enforced RI, index use is based on the SQL used by each program to enforce the constraint.

If you must use application RI instead of declarative RI, be sure to also define referential constraints with the NOT ENFORCED keyword. In that case, the constraints will not be enforced by DB2, but will be documented in the DDL. And it gives DB2 additional information that can be used by the Optimizer for query optimization.

Summary

Hopefully this 7-part series on influencing DB2 access paths provided you with a nice overview of the options available to you and considerations for their use. If you are interested in learning more about SQL tuning and DB2 performance, consider purchasing the book from which this series was drawn: DB2 Developer's Guide 6th edition.

Happy SQL performance tuning!

Wednesday, August 12, 2015

Influencing the DB2 Optimizer: Part 5 - Changing DB2 Catalog Statistics

When the standard methods of influencing DB2’s access path selection -- as discussed in the first 4 parts of this series (1, 2, 3, 4) -- do not produce satisfactory results, you can take matters into your own hands and resort to updating the statistics in the DB2 Catalog. But this should really be a last resort type of option because there are so many inter-related statistics these days and manually change them is fraught with peril. But sometimes you may just want to DIY -- do it yourself -- so let's talk about changing DB2 Catalog statistics.

First of all, the way you change a DB2 Catalog statistic is the way you would change any other piece of data in a DB2 table - using SQL UPDATE, INSERT, and DELETE statements. But only certain DB2 Catalog statistics can be modified using SQL instead of the normal method using RUNSTATS. Furthermore, SQL modification of the DB2 Catalog can be performed only by a SYSADM or SECADM.

Table 1 details the DB2 Catalog statistics that can be modified. You can use this table to determine which DB2 Catalog columns can be modified (updated or inserted using SQL) and which are used by the optimizer during sequential and parallel access path determination. Keep in mind, though, that certain DB2 Catalog tables that can be updated, for example SYSIBM.IPLIST, are not shown in this table because the data in those tables are not relevant to statistics and SQL performance tuning. Additionally, historical DB2 Catalog statistics (those tables ending in HIST) and data in the Real Time Stats tables can also be modified using SQL. But neither are used by the DB2 Optimizer.

Table 1. The Updateable DB2 Catalog Statistics

Catalog Table	Column	How Used?	Description
SYSCOLDIST	FREQUENCYF	Y	Percentage that COLVALUE in the column named in NAME occurs
	COLVALUE	Y	Column value for this statistic
	CARDF	Y	Number of distinct values
	COLGROUPCOLNO	Y	The set of columns for the statistics
	NUMCOLUMNS	Y	Number of columns for the statistics
	TYPE	Y	Type of stats: C for cardinality F for frequent value H for histogram N for non-padded frequent value
	STATSTIME	N	Indicates the time RUNSTATS was run to generate these statistics
SYSCOLDISTSTATS	PARTITION	N	The partition to which this statistic applies
	FREQUENCYF	N	Percentage that COLVALUE in the column named in NAME occurs
	COLVALUE	N	Column value for this statistic
	TYPE	N	Type of statistics (cardinality, frequent value, histogram, or non-padded frequent value)
	CARDF	N	Number of distinct values
	COLGROUPCOLNO	N	The set of columns for the statistics
	KEYCARDDATA	N	Representation of the estimate of distinct values in this partition
	STATSTIME	N	Indicates the time RUNSTATS was run to generate these statistics
SYSCOLSTATS	LOWKEY	Y	Lowest value for the column
	LOW2KEY	Y	Second lowest value for the column
	HIGHKEY	Y	Highest value for the column
	HIGH2KEY	Y	Second highest value for the column
	COLCARD	Y	Number of distinct values for the column
	STATSTIME	N	Indicates the time RUNSTATS was run to generate these statistics
SYSCOLUMNS	LOW2KEY	Y	Second lowest value for the column
	HIGH2KEY	Y	Second highest value for the column
	COLCARDF	Y	Number of distinct values for the column
	FOREIGNKEY	N	Indicates the subtype of CLOB data: B for bit data, M for mixed data, S for SBCS data.
	STATSTIME	N	Indicates the time RUNSTATS was run to generate these statistics
SYSINDEXES	CLUSTERRATIOF	Y	Percentage of rows in clustered order
	CLUSTERED	N	Indicates whether the table space is actually clustered
	FIRSTKEYCARDF	Y	Number of distinct values for the first column of the index key
	FULLKEYCARDF	Y	Number of distinct values for the full index key
	NLEAF	Y	Number of active leaf pages
	NLEVELS	Y	Number of index b-tree levels
	DATAREPEATFACTOR	Y	The anticipated number of data pages to be touched following an index key order.
	STATSTIME	N	Indicates the time RUNSTATS was run to generate these statistics
SYSINDEXPART	DSNUM	N	Number of data sets
	LEAFFAR	N	Number of leaf pages far from previous leaf page
	LEAFNEAR	N	Number of leaf pages near previous leaf page
	PSEUDO_DEL_ENTRIES	N	Number of pseudo deleted index keys
	SPACEF	N	Disk storage space
SYSINDEXSTATS	CLUSTERRATIOF	N	Percentage of rows in clustered order
	FIRSTKEYCARDF	N	Number of distinct values for the first column of the index key
	FULLKEYCARDF	N	Number of distinct values for the full index key
	FULLKEYCARDDATA	N	Representation of number of distinct values of the full key
	NLEAF	N	Number of active leaf pages
	NLEVELS	N	Number of index b-tree levels
	KEYCOUNTF	N	Number of rows in the partition
	DATAREPEATFACTOR	N	The anticipated number of data pages to be touched following an index key order
	STATSTIME	N	Indicates the time RUNSTATS was run to generate these statistics
SYSROUTINES	IOS_PER_INVOC	Y	Estimated number of I/Os per invocation of the routine
	INSTS_PER_INVOC	Y	Estimated number of instructions per invocation of the routine
	INITIAL_IOS	Y	Estimated number of I/Os for the first invocation of the routine
	INITIAL_INSTS	Y	Estimated number of instructions for the first invocation of the routine
	CARDINALITY	Y	Predicted cardinality of a table function
SYSTABLEPART	DSNUM	N	Number of data sets
	EXTENTS	N	Number of data set extents
	SPACEF	N	Disk storage space
SYSTABLES	CARDF	Y	Number of rows for a table
	NPAGES	Y	Number of pages on which rows of the table appear
	NPAGESF	Y	Number of pages used by the table
	PCTPAGES	N	Percentage of tablespace pages that contain rows for this table
	PCTROWCOMP	Y	Percentage of rows compressed
	AVGROWLEN	N	Average row length
	SPACEF	N	Disk storage space
	STATSTIME	N	Indicates the time RUNSTATS was run to generate these statistics
SYSTABLESPACE	NACTIVEF	Y	Number of allocated tablespace pages
	STATSTIME	N	Indicates the time RUNSTATS was run to generate these statistics
SYSTABSTATS	CARDF	Y	Number of rows for the partition
	NPAGES	Y	Number of pages used by the partition
	NACTIVE	Y	Number of active pages in the partition
	PCTPAGES	Y	Percentage of table space pages that contain rows for this partition
	PCTROWCOMP	Y	Percentage (100) of rows compressed
	STATSTIME	N	Indicates the time RUNSTATS was run to generate these statistics

Legend:

N = Not used by the optimizer

Y = Used by the optimizer

The two predominant reasons for changing DB2 Catalog statistics to influence access paths are to try to get DB2 to use an index and to influence DB2 to change the order in which tables are joined. In each case, the tuning methods require that you “play around” with the DB2 Catalog statistics to create a lower filter factor. You should keep in mind five rules when doing so.

Rule 1: As first key cardinality (FIRSTKEYCARDF) increases, the filter factor decreases. As the filter factor decreases, DB2 is more inclined to use an index to satisfy the SQL statement.

Rule 2: As an index becomes more clustered, you increase the probability that DB2 will use it. To enhance the probability of an unclustered index being used, increase its cluster ratio (CLUSTERRATIOF) to a value between 96 and 100, preferably 100.

So understanding these rules, if you wish to influence DB2 to use an index by changing statistics, consider adjusting the COLCARDF,FIRSTKEYCARDF, and FULLKEYCARDFcolumns to an artificially high value. As cardinality increases, the filter factor decreases. As the filter factor decreases, the chance that DB2 will use an available index becomes greater. DB2 assumes that a low filter factor means that only a few rows are being returned, causing indexed access to be more efficient. Adjusting COLCARDF, FIRSTKEYCARDF, and FULLKEYCARDF may also be useful for getting DB2 to choose an unclustered index because DB2 is more reluctant to use an unclustered index with higher filter factors. You also can change the value of CLUSTERRATIOF to 100 to remove DB2’s reluctance to use unclustered indexes from the access path selection puzzle.

Rule 3: In a join, DB2’s choice for inner and outer tables is a delicate trade-off. Because the inner table is accessed many times for each qualifying outer table row, it should be as small as possible to reduce the time needed to scan multiple rows for each outer table row. The more inner table rows, the longer the scan. But the outer table should also be as small as possible to reduce the overhead of opening and closing the internal cursor on the inner table.

It is impossible to choose the smallest table as both the inner table and the outer table. When two tables are joined, one must be chosen as the inner table, and the other must be chosen as the outer table. My experience has shown that as the size of a table grows, the DB2 optimizer favors using it as the outer table in a nested loop join. Therefore, changing the cardinality (CARDF) of the table that you want as the outer table to an artificially high value can influence DB2 to choose that table as the outer table.

Rule 4: As column cardinality (COLCARDF) decreases, DB2 favors the use of the nested loop join over the merge scan join. Lowering the value of COLCARDF can be used to try to favor the nested loop join.

Rule 5:HIGH2KEY and LOW2KEY can be altered to more accurately reflect the overall range of values stored in a column. This is particularly useful for influencing access path selection for data with a skewed distribution.

The combination of HIGH2KEY and LOW2KEYprovides a range of probable values accessed for a particular column. The absolute highest and lowest values are discarded to create a more realistic range. For certain types of predicates, DB2 uses the following formula when calculating filter factor:

Filter Factor = (Value-LOW2KEY) / (HIGH2KEY-LOW2KEY)

Because HIGH2KEY and LOW2KEYcan affect the size of the filter factor, the range of values that they provide can significantly impact access path selection.

For troublesome queries, check whether the distribution of data in the columns accessed is skewed. If you querySYSIBM.SYSCOLDIST, the most frequently occurring values are shown for indexed columns. To be absolutely accurate, however, obtain a count for each column value, not just the top values collected by RUNSTATS using a query such as:

SELECT     COL, COUNT(*)
FROM       your.table
GROUP BY   COLORDER BY COL;

This query produces an ordered listing of column values. You can use this list to determine the distribution of values. If a few values occur much more frequently than the other values, the data is not evenly distributed. In this circumstance, consider using dynamic SQL, hard coding predicate values, or binding with REOPT(ALWAYS). This enables DB2 to use nonuniform distribution statistics when calculating filter factors.

If neither dynamic SQL nor hard-coded predicates are practical, you might try to change HIGH2KEY to a lower value and/or LOW2KEY to a higher value to reduce the range of possible values, thereby lowering the filter factor. Alternatively, or additionally, you can increase COLCARDF, FIRSTKEYCARDF, and FULLKEYCARDF.

Remember that modifying DB2 Catalog statistics is not a trivial exercise. Simply making the changes indicated in this section might be insufficient to resolve your performance problems because of DB2’s knowledge of the DB2 Catalog statistics. Some statistical values have implicit relationships. When one value changes, DB2 assumes that the others have changed also. For example, consider these relationships:

When you change COLCARDF for a column in an index, be sure to also change the FIRSTKEYCARDF of any index in which the column participates as the first column of the index key, and the FULLKEYCARDF of any index in which the column participates.
Provide a value to both HIGH2KEY and LOW2KEY when you change cardinality information. When COLCARDF is not –1, DB2 assumes that statistics are available. DB2 factors these high and low key values into its access path selection decision. Failure to provide both a HIGH2KEY and a LOW2KEY can result in the calculation of inaccurate filter factors and the selection of inappropriate access paths.

Before deciding to update DB2 Catalog statistics to force DB2 to choose different access paths, be sure that you never change the DB2 Catalog statistics without documenting the following:

Why the statistics will be modified
How the modifications will be made and how frequently the changes must be run
The current values for each statistic and the values they will be changed to

Additionally, be aware that when you change DB2 Catalog statistics, you are robbing from Peter to pay Paul. In other words, your changes might enhance the performance of one query at the expense of the performance of another query. DB2 maintenance (PTFs, new releases, and new versions) might change the access path selection logic in the DB2 optimizer. As a result of applying maintenance, binding or rebinding static and dynamic SQL operations could result in different access paths, thereby invalidating your hard work. In other words, IBM might get around to correcting the problem in the logic of the optimizer (that you solved using trickery).

Choosing the correct values for the statistics and keeping the statistics accurate can be an intimidating task. Do not undertake this endeavor lightly. Plan to spend many hours changing statistics, rebinding plans, changing statistics again, rebinding again, and so on.

The situation that caused the need to tinker with the statistics in the DB2 Catalog could change. For example, the properties of the data could vary as your application ages. Distribution, table and column cardinality, and the range of values stored could change. If the statistics are not changing because they have been artificially set outside the jurisdiction of RUNSTATS, these newer changes to the data cannot be considered by the DB2 optimizer, and an inefficient access path could be used indefinitely.

When DB2 Catalog statistics have been changed to influence access path selection, it is a good idea to periodically executeRUNSTATS and rebind to determine if the artificial statistics are still required. If they are, simply reissue the DB2 Catalog UPDATE statements. If not, eliminate this artificial constraint from your environment. Failure to implement this strategy eventually results in inefficient access paths in your environment (as DB2 and your applications mature).

This blog post was adapted from material in Craig's best-selling book on DB2 for z/OS, DB2 Developer's Guide. If you are looking for more in-depth tuning, performance, and administration guidelines for your mainframe DB2 environment, be sure to buy yourself a copy!