Monday, July 09, 2012

DB2 Hashing and Hash Organized Tables

Up until DB2 10, all DB2 data was retrieved using some form of indexing or scanning. With DB2 Version 10, a new access method called hashing is available. Of course, referring to hashing as “new” is somewhat disingenuous because it is an old, time-tested data processing technique. Indeed, IMS databases are founded upon hashing technology.

A hash, or hash function, is an algorithm that converts a defined set of data elements into a small number, usually a single integer that can serve as an index to an array or a storage location on disk. The values returned by a hash function are called hash values, hash sums, or simply hashes.

Figure 1 depicts basic hash functionality. Key values are processed by a hash algorithm (also known as a hash function, hash routine, or randomizer). The hash algorithm translates the key value into a storage location. When data is inserted the algorithm tells DB2 where to physically store the data; when data is accessed by the key, the algorithm tells DB2 where to find the data.

Figure 1. How hashing operates.

Hashing is used primarily to optimize random I/O, such as for looking up a code table value or accessing a single row based on the value of its primary key. A hash access can outperform indexed access because fewer I/O operations are required to retrieve the data. The hash requires 1 I/O operations (possibly 2 if a hash collision occurs). An index requires at least 2 I/O operations, one to read the index page and one to read the data page. Of course, only the smallest of indexes consists of just a single root page; most consist of a root page, one or more non-leaf pages, and a leaf page, thereby requiring 3 or more I/O operations.

The traditional notion of clustering is not pertinent for hash organized tables. The data will be organized based on the hash algorithm. Index clustering is not permitted for tables that are hash organized.

The Hash Space

A hash space is a defined storage area used for organizing table data for hash access. Sufficient disk space must be allocated to the hash space to accommodate the hashed rows that will be added to the table.

When the hash space is full, new rows will be relocated to the overflow index. As the amount of data in the overflow increases, hash access performance declines. This is so because DB2 must first go to the hashed location and then follow the pointer in the overflow index to find the data, thereby increasing the number of I/O operations required to retrieve the data. Refer to Figure 2 for an example. In this case, a new row for NEELD needs to be added, but there is no room in the hash space. So the hash space points to an overflow area where the new row can be stored.

Figure 2. Hash overflow.

For this reason, a hash organized table might consume more disk space than a traditional table to minimize overflow rows.

Hash spaces can contain only a single table and must be defined in a universal table space — either partitioned by range or partitioned by growth. The hash overflow index for a table in a partition-by-range Universal table space will be a partitioned index. The hash overflow index for a table in a partition-by-growth Universal table space will be a non-partitioned index.

Creating Hash Organized Tables

Hash access can be specified for a table using the organization-clause of the CREATE TABLE statement: ORGANIZE BY HASH UNIQUE (column-names) HASH SPACE (hash-space-value).

The ORGANIZE BY HASH specification tells DB2 to use hashing for organizing the data of this table. The hash key is specified in the list of column names. By specifying UNIQUE, the table cannot contain more than one row with the same hash key value. And the HASH SPACE parameter defines the amount of fixed hash space to be pre-allocated for the table. For tables defined in a partition-by-range UTS, this space is for each partition.

Caution

Exercise care when specifying UNIQUE if the hash key can contain nulls. Although it is generally considered bad design for keys to contain nulls, columns specified as a hash key can be NULL. If the hash key can contain nulls and UNIQUE is also specified, all nulls for a column are considered equal. To clarify, if the hash key is a single nullable column, it can contain only one null.

For example, consider the following SQL that would create the sample DEPT table as a hash-organized table:

CREATE TABLE DSN81010.DEPT

(DEPTNO CHAR(3) NOT NULL,

DEPTNAME VARCHAR(36) NOT NULL,

MGRNO CHAR(6),

ADMRDEPT CHAR(3) NOT NULL,

LOCATION CHAR(16),

PRIMARY KEY (DEPTNO)

)

IN DSN8D10A.DSN8S10D

ORGANIZE BY HASH UNIQUE (DEPTNO)

HASH SPACE 64 M;

Take care to specify a sufficient amount of storage for the hash space in the organization-clause. By doing so, the overflow index should have a small number of entries, or better yet, no entries at all. Hashing works best for tables that are static in size with limited growth (if any).

If you choose to convert an existing table to use hashing, be sure to walk through the following steps:

Analyze each table you are considering for conversion. Look for static tables with unique keys that are frequently used for single fetch queries.
Before replacing any existing indexes be sure that they are not used for range queries. Hashing is not used for range queries, so these indexes should not be dropped. Any index used solely for direct lookup can be dropped.
Estimate the number of rows and average row size and calculate a sufficient hash space size.
Issue the ALTER ADD organization-clause statement.
Reorganize the table space specifying AUTOESTSPACE YES. Doing so allows DB2 to automatically estimate the best size for the hash space using real-time statistics.
REBIND any applications using SQL with equality predicates on the hash key.

After converting, monitor real-time statistics to evaluate the table’s hash usage efficacy. In SYSIBM.SYSTABLESPACESTATS the REORGHASHACCESS column indicates the number of times that hash access was used to access the table. Furthermore, you can compare the REORGHASHACCESS column to the REORGSCANACCESS column to determine the number of time the table was accessed by the hash versus other techniques. You can also review the HASHLASTUSED column, which contains the date when hash access was last used for SELECT, FETCH, searched UPDATE, searched DELETE, or to enforce referential constraints.

In addition, be sure to verify that hash access is used where you expected. To accomplish this, check the ACCESSTYPE column of the PLAN_TABLE looking for the values H, HN, or MH—these indicate that hash access is used.

When to Use Hashing

Consider using hash organization for tables that are stable or predictable in size and where the data is accessed most often by its primary key (that is, the hash key). Hash-organized tables require an overflow index for rows that exceed the specified hash space. Rows in the overflow index are not hashed—DB2 must scan the index to retrieve rows in the overflow.

Table Space Size Options for Hashing

Choose the table space sizing options and parameters wisely for hash organized tables. Care must be taken to ensure that sufficient space is available for the hashing. For PBR Universal table spaces, be sure to specify an appropriate DSSIZE because it will be used to validate the hash space for each partition.

Consider specifying PRIQTY as –1 to allow DB2 to use the default primary space allocation. Furthermore, set DEFINE YES to ensure that the fixed table space is allocated successfully before the first access.

Free space also is an important consideration for hash organized table spaces. The PCTFREE specification is used by DB2 when the table space is reorganized with AUTOSPACE(YES) specification. DB2 uses the value of the DATASIZE column in the SYSIBM.SYSTABLESPACESTATS as the initial size of the hash table space, and increases that value by the value of PCTFREE to calculate the new size when you reorganize the table space.

Note:This blog post was adapted from the newly published 6th edition of Craig's best-selling book, DB2 Developer's Guide.

Saturday, May 12, 2012

A Cookbook for DB2 LUW Developers

I recently came across an interesting new book on DB2 for Linux, Unix, and Windows titled IBMDB2 9.7 Advanced Application Developer Cookbook. The book shakes up the typical technology book with a cookbook metaphor and succeeds in delivering the best of both of those approaches.

The authors, Sanjay Kumar and Mohankumar Saraswatipura, have obviously used the technology and clearly explain to the reader step-by-step tactics for tackling most of the important components of the DB2 application development life cycle. The book is divided into 9 chapters, each delivering a series of practical recipes for using, creating and managing various aspects of DB2 application programming. The recipes are well-written and easy to understand with lots of supporting code to guide you as you work through the recipe. The cover of the book references “over 70 practical recipes” but it sure seems like there are more than that.

The first chapter of the book covers recipes that focus on the application development enhancements made to DB2 9.7. This is an important chapter because sometimes new features get introduced so rapidly that developers do not get the chance to learn them before the next version comes out. Indeed, with DB2 10 being released in early April 2012, it is even more important to learn what DB2 9.7 added to the mix.

Other chapters I found quite useful were Chapter 5, which covers recipes for coding Java applications against DB2 databases; and Chapter 8 and 9, which covers recipes for monitoring and tuning your DB2 applications.

Of course, the trouble with recipes is that you rely on them when you want to make something specific. A book that contained a recipe for everything you ever wanted to do with DB2 is not practical though. You can, however, use many of the recipes as starting points for beginning the “dish” you wish to “cook” and then add to the recipe the additional “flourishes” you need to make the dish your own.

Whether you are a novice or a long-time DB2 coder this book will be helpful as you design, plan, develop, and optimize your DB2 9.7 applications and databases.

Wednesday, April 25, 2012

IDUG DB2 Technical Conference 2012

The time is drawing near for this year’s North American IDUGDB2 Technical Conference . This year’s event is being held in Denver, CO the week of May 14-18, 2012. The IDUG conference planning committee has brewed up a wonderful agenda of educational sessions, presentations, and courses. Check ‘em all out online here. Whether you are a DBA, a developer, a or a systems programmer there are a plethora of learning opportunities. And there are sessions that run the gamut from DB2 for z/OS to DB2 LUW to data warehousing to Big Data to new version information and more! You can even try your hand at certification at the conference.

Even more importantly, IDUG offers a great chance to build and strengthen your DB2 network. You will meet everyone who is anyone at the IDUG conference. You can network with IBMers, consultants, and peers throughout the week at technical networking sessions and the products and services expo. And don’t forget the Dine Around where you can sign up to eat with your favorite DB2 experts.

One of my favorite stops every year is the IDUG book store, where you can browse the latest publications on DB2 and related technologies. I’m super excited about this year’s book store because the latest edition of my book, DB2 Developer’s Guide, covering through DB2 V10 will be available at the book store this year.

There is truly no other place where you can cram so much DB2 stuff into a week. So if you rely on DB2 in your professional life, the IDUGDB2 Technical Conference is the place to be.

Thursday, March 22, 2012

Working With Variable Data

One of the key design issues that is addressed in most DB2 applications is how to implement character data that varies in size from row to row. The basic trade-off is “ease-of-use and performance” versus “storage requirements.” It is possible to save storage space by using variable columns instead of placing small amounts of data in a large fixed space. Each variable column carries a 2-byte overhead, however, for storing the length of the data. Additionally, variable columns tend to increase CPU usage and can cause the update process to become inefficient. When a variable column is updated with a larger value, the row becomes larger; if not enough space is available to store the row, it must be moved to another page. This makes the update and any subsequent retrieval slower.

Follow these rules when defining variable character columns:

Avoid variable columns if a sufficient disk space is available to store the data using fixed columns. The general rule is: disk is cheap! Trading the cost of disk for more efficient development and subsequent performance is often an easy decision to make.
Though this may be obvious, do not define a variable column when the data does not vary from row to row.
Do not define a variable columns if its maximum length is less than 30 bytes. Furthermore, do not define a variable column if its maximum length is within 10 bytes of the average length of the column. Adhering to these two rules will cause you to choose VARCHAR data types only when they can potentially provide enough DASD savings to offset other costs.
Consider redefining variable columns by placing multiple rows of fixed length columns in another table or by shortening the columns and placing the overflow in another table.

If, after following these guidelines, VARCHAR columns need to be implemented, go ahead and do so. However, it is wise to continuously re-assess the decision to use variable character data.

So how can you monitor the effectiveness of variable columns? Well, it is possible to query the DB2 Catalog to determine the effectiveness of using VARCHAR for a column instead of CHAR. Consider, for example, the PROJNAME column of the DSN81010.PROJ table. It is defined as VARCHAR(24).

To gauge whether VARCHAR is appropriate for this particular column, consider issuing the following SQL query:

     SELECT    COL_LGTH, COUNT(*)
     FROM      (SELECT LENGTH(PROJNAME) AS COL_LENGTH
                FROM   DSN81010.PROJ

               )
     GROUP BY  COL_LGTH
     ORDER BY  COL_LGTH;

This query will produce a report listing the lengths (in this case, from 1 to 24, excluding those lengths which do not occur) and the number of times that each length occurs in the table. These results can be analyzed to determine the range of lengths stored within the variable column.

If you are not concerned about this level of detail, the following query can be used instead to summarize the space characteristics of the variable column in question:

    SELECT  24*COUNT(*),
            SUM(2+LENGTH(PROJNAME)),
            24*COUNT(*)-SUM(2+LENGTH(PROJNAME)),
            24,
            AVG(2+LENGTH(PROJNAME)),
            24-AVG(2+LENGTH(PROJNAME))
    FROM    DSN81010.PROJ;

This query produces a report similar to the one shown below:

SPACE      SPACE        TOTAL   AVG.       AVG.         AVG.
USED AS    USED AS      SPACE   SPACE AS   SPACE AS     SPACE
CHAR(24)   VARCHAR(24)  SAVED   CHAR(24)   VARCHAR(24)  SAVED
--------   -----------  ------  ---------  -----------  -----
158058        96515     61543     24          16          8

The following list itemizes the definition for each of the individual columns calculated by this query:

Definition	Calculation
Space Used As CHAR(24)	24COUNT()
Space Used As VARCHAR(24)	SUM(2+LENGTH(PROJNAME))
Total Space Saved Using VARCHAR	24COUNT()-SUM(2+LENGTH(PROJNAME))
Avg. Space Used As CHAR(24)	24
Avg. Space Used As VARCHAR(24)	AVG(2+LENGTH(PROJNAME))
Avg. Space Saved Using VARCHAR	24-AVG(2+LENGTH(PROJNAME))

The query can be modified to be used for any VARCHAR-defined column. The constant 24 can be changed to indicate the maximum length of the variable column as defined in the DDL. Using these tools, you can better judge the actual disk space savings accruing as a result of VARCHAR usage.

Synopsis

There are sound reasons for using variable data types within DB2 databases. Yet, it is important to remember that business conditions change and what may have been a sound reason for using VARCHAR data in the past, may no longer be sound. Use the queries in this article to determine if it is still appropriate to use VARCHAR data. In general, do not use VARCHAR for small columns or for columns whose length does not vary considerably.

Sunday, March 04, 2012

Fetching Multiple Rows

When you need to retrieve multiple rows, consider deploying a multi-row fetch to transfer more than one row using a single FETCH statement. This capability was added as of DB2 Version 8.

A multi-row FETCH retrieves multiple rows at one time into column arrays in your application program. By fetching multiple rows at once, your request can become more efficient, especially for distributed requests. The performance improvement using multi-row FETCH in general depends on several factors, such as whether the request is distributed, the number of rows to be fetched, the complexity of the SELECT statement, and the number of columns being fetched.

Nevertheless, using multi-row FETCH (in local environments) can improve performance with a significant reduction of CPU time possible. Tests conducted by IBM have shown performance gains of between 25% and 40% processing 10 rows per SQL statement for programs processing a considerable number of rows. With such significant gains possible, why hasn’t everyone moved to multi-row FETCH? Well, perhaps because it requires programming changes. A multi-row FETCH requires a cursor defined with rowset positioning. A rowset is a group of rows that are operated on as a set. Such a cursor enables your program to retrieve more than one row using a single FETCH statement. By fetching multiple rows at once, your request might become more efficient, especially for distributed requests.

To use this feature, you must DECLARE your cursor as using the WITH ROWSET POSITIONING parameter. For example

EXEC SQL
  DECLARE CURSOR SAMPCURS
  WITH ROWSET POSITIONING
  FOR
  SELECT DEPTNO
  FROM   DSN81010.DEPT
END-EXEC.

Furthermore, to use a multi-row fetch you must have defined the appropriate structures to receive multi-row data. This means you must have defined an array of host variables into which the fetched rows can be placed. Each column fetched requires its own host variable array into which its values will be placed. Be sure to match the array size to the rowset size. With the appropriate setup coded, FETCH statements can be written to retrieve more than a single row from the result set. For example

FETCH ROWSET FROM SAMPCURS
  FOR 5 ROWS
  INTO HOSTVAR-ARRAY;

As you can see, the multiple-row fetch block is identical to a single-row-fetch block, except that there are two additional clauses—ROWSET and FOR n ROWS. The ROWSET clause specifies that the orientation of this cursor is rowset positioning (instead of single row). The FOR n ROWS clause specifies the size of the rowset to be returned. The maximum rowset size is 32,767.

Rowset cursors are very useful when you need to retrieve many rows or large amounts of data in distributed systems. By retrieving multiple rows with a single FETCH, multiple trips between the application and the database can be eliminated, thereby improving network performance.

To learn more about multi-row FETCH consider attending my upcoming webinar on the topic. This presentation will introduce and define multi-row FETCH, how to use it, and the performance implications of doing so. The presentation will also touch upon multi-row UPDATE. And it will introduce the new SoftBase Attach Facility MRF Feature, which allows you to implement multi-row FETCH without coding changes. To attend, sign up at this link: https://www1.gotomeeting.com/register/754473201

Tuesday, February 28, 2012

Identifying Unused Indexes

Did you know that DB2 V9 added a new column to the Real Time Statistics to help identify unused indexes?

The LASTUSED column in the SYSINDEXSPACESTATS table contains a date indicating the last time the index was used. Any time the index is used to satisfy a SELECT, FETCH, searched UPDATE, searched DELETE, or to enforce a referential constraint, the date is updated.

This helps to solve the problem of determining whether or not an index is being used. Standard operating advice is to DROP or delete anything that is not going to be used. But trying to determine whether something is actually used or not can be difficult at times.

You could always query your PLAN_TABLEs or the plan and package dependency tables in the DB2 Catalog for static SQL. But what about dynamic SQL? That is more difficult. Now, as of DB2 V9, you can simply query the LASTUSED column to see when the index was last used. The LASTUSED date is by partition. So, for a partitioned index, the last used date of each partition in the index should be checked.

Of course, you will have to give it some time because you might have an index supporting a rarely used query. Most shops have queries and programs that run quarterly, or even annually, but nevertheless are very important... and you wouldn't want to drop indexes on those queries even though they do not run frequently because when they do run, they are important...

Examine the LASTUSED column over time to determine which indexes are truly not being used, and then DROP the unused indexes.

Thursday, February 16, 2012

Update on DB2 Developer's Guide, 6th edition

I know a lot of my readers are waiting on the updated edition of my book, DB2 Developer's Guide, so I thought I'd post a short update on the progress. The technical edits are over and production will be starting soon. The book is scheduled now for publication in early May 2012 and is available to be pre-ordered now on amazon com.

The book has been completely updated and is now up-to-date with DB2 10 for z/OS. Just think of the things that have been added to DB2 since the last time the book was updated: Universal table spaces, pureXML, SECADM, hashes, new data types, INSTEAD OF triggers, temporal support, and much, much more.

Consider pre-ordering a copy today so you'll get it as soon as it comes off the presses!

Thursday, January 26, 2012

A Forced Tour of Duty

Mainframe developers are well aware of the security, scalability, and reliability of mainframe computer systems and applications. Unfortunately, though, the bulk of new programmers and IT personnel are not mainframe literate. This should change. But maybe not for the reasons you are thinking.

Yes, I am a mainframe bigot. I readily admit that. In my humble opinion there is no finer platform for mission critical software development than the good ol’ mainframe. And that is why every new programmer should have to work a tour of duty on mainframe systems and applications after graduating from college.

Why would I recommend such a thing? Well, it is because of the robust system management processes and procedures which are in place and working extremely well within every mainframe shop in the world. This is simply not the case for Windows, Unix, and other platforms. By working on mainframe systems newbies will learn the correct IT discipline for managing mission critical software.

What do I mean by that? How about a couple of examples: It should not be an acceptable practice to just insert a CD and indiscriminately install software onto a production machine. Mainframe systems have well-documented and enforced change management procedures that need to be followed before any software is installed into a production environment.

Nor should it be acceptable to just flip the switch and reboot the server. Mainframe systems have safeguards against such practices. And mainframes rarely, if ever, need to be restarted because the system is hung or because of a software glitch. Or put in words PC dudes can understand: there is no mainframe “blue screen of death.” Indeed, months, sometimes years, can go by without having to power down and re-IPL the mainframe.

And don’t even think about trying to get around security protocols. In mainframe shops there is an entire group of people in the operations department responsible for protecting and securing mainframe systems, applications, and data. Security should not be the afterthought that it is in the Windows world.

Ever wonder why there are no mainframe viruses? A properly secured operating system and environment make such a beast extremely unlikely. And with much of the world’s most important and sensitive data residing on mainframes, don’t you think the hackers out there would just love to crack into those mainframes more frequently?

Project planning, configuration management, capacity planning, job scheduling and automation, storage management, database administration, operations management, and so on – all are managed and required in every mainframe site I’ve ever been involved with. When no mainframe is involved many of these things are afterthoughts, if they’re even thought of at all.

Growing up in a PC world is a big part of the problem. Although there may be many things to snark about with regard to personal computers, one of the biggest is that they were never designed to be used the way that mainframes are used. Yet we call a sufficiently “pumped-up” PC a server – and then try to treat it like we treat mainframes. Oh, we may turn it on its side and tape a piece of paper on it bearing a phrase like “Do Not Shut Off – This is the Production Server”… but that is a far cry from the glass house that we’ve built to nourish and feed the mainframe environment.

Now to be fair, strides are being made to improve the infrastructure and best practices for managing distributed systems. Some organizations have built an infrastructure around their distributed applications that rivals the mainframe glass house. But this is more the exception than the rule. With time, of course, the policies, practices, and procedures for managing distributed systems will improve to mainframe levels.

But the bottom line is that today’s distributed systems – that is, Linux, Unix, and Windows-based systems – typically do not deliver the stability, availability, security, or performance of mainframe systems. As such, a forced tour of duty supporting or developing applications for a mainframe would do every IT professional a whole world of good.

Tuesday, January 17, 2012

Row and Column Access Control in DB2 Version 10

Row and column access control enables you to manage access to a table at the level of a row, a column, or both. It enables you to build policies for the particulars of which data can be accessed by specific users, groups, or roles. Row access can be controlled using row permissions and column access control can be accomplished using column masks.

Row and column access control differs from multilevel security in that it is integrated into the database system. All applications and tools that use SQL to access the database are automatically subject to the same control. Sensitive data need not be filtered at the application level when row and column access control is in place.

Prior to row permissions and column masks, row and column level security was implemented in DB2 using views or stored procedures. Using views and stored procedures is a viable approach for simple requirements, but it breaks down as a solution for more complex requirements. When a large number of views are built to support your security needs, it can be difficult to administer as the views need to be updated and maintained.

Let’s see how row permissions and column masks can be used to improve upon row- and column-level security.

Row Permissions: Row Access Control

A row permission must be created and activated to be enforced. The structure of a permission will be familiar to anyone who is used to coding SQL statements. The CREATE PERMISSION statement is used to create a row permission.

Let’s consider an example using a banking system. Assume that bank tellers should only be able to access customers from their local branch. But customer service representatives (CSRs) should be allowed to access all customer data. Assume further, that secondary authids are setup such that tellers have a secondary authid of TELLER, and CSRs have a secondary authid of CSR. Given this scenario, the following row permissions can be created to institute these policies:

CREATE PERMISSION TELLER_ROW_ACCESS
ON     CUST
FOR ROWS WHERE VERIFY_GROUP_FOR_USER(SESSION_USER, ′TELLER′) = 1
AND
BRANCH = (SELECT HOME_BRANCH
          FROM   INTERNAL_INFO
          WHERE  EMP_ID = SESSION_USER)
ENFORCED FOR ALL ACCESS
ENABLE;

COMMIT;

CREATE PERMISSION CSR_ROW_ACCESS
ON     CUST
FOR ROWS WHERE VERIFY_GROUP_FOR_USER(SESSION_USER, ′CSR′) = 1
ENFORCED FOR ALL ACCESS
ENABLE;

COMMIT;





These row permissions will not be enforced, however, until they are activated by alteringthe table, for example:

ALTER TABLE CUST
 ACTIVATE ROW ACCESS CONTROL;

COMMIT;

With the row permissions in force, when tellers SELECT from the CUST table they will only be able to “see” customer data for their branch, whereas customer service representatives can see all customer data.

These row permission definitions use the VERIFY_GROUP_FOR_USER built-in function. This function returns a value indicating whether the primary authid and the secondary authids that are associated with the first argument are in the authorization names specified in the list of the second argument.

Data Masking: Column Access Control

Column access control allows you to manage access to a table with filtering and data masking. As with a row permission, a column mask must be created and activated before it can be enforced. The column mask defines the rules to be used for masking values returned for a specified column.

You use the CREATE MASK statement to create a column mask. Multiple column masks can be created for a table, but each column can have only one mask. The table and column must exist before the mask can be created.

For example, you can create a mask for employee social security numbers (assuming the table name is EMP and the column name is SSN) as follows:

CREATE MASK SSNMASK
ON     EMP
FOR COLUMN SSN RETURN
  CASE
    WHEN (VERIFY_GROUP_FOR_USER(SESSION_USER, ′PAYROLL′) = 1)
    THEN SSN
    WHEN (VERIFY_GROUP_FOR_USER(SESSION_USER, ′HR′) = 1)
    THEN ′XXX-XX-′ || SUBSTR(SSN,8,4)
    ELSE NULL
  END
ENABLE;

COMMIT;

This mask will return the actual data when accessed by a user in accounting, a version with the first 5 digits masked when access by human resources, and null for anyone else. Of course, column access control must be activated for the table before any mask will be enforced:

ALTER TABLE EMP
 ACTIVATE COLUMN ACCESS CONTROL;
 
COMMIT;

Summary

Using row and column access control a security administrator can enforce detailed security policies for the databases under their control.

Tuesday, December 20, 2011

Seasons Greetings

Here's wishing all of my readers a very happy holiday season... Be safe and enjoy the holidays and we'll see you again in 2012!

Saturday, November 19, 2011

Tune That SQL to Improve DB2 Performance!

Structured Query Language, better known as SQL, is a powerful tool for manipulating data. It is used in virtually every relational database management system on the market today, not just by DB2, but also by Oracle, Sybase, MySQL, and Microsoft SQL Server.

SQL is a high-level language that provides a greater degree of abstraction than do procedural languages. Most programming languages require that the programmer navigate data structures. The navigation information is encoded in the program and is difficult to change after it has been programmed.

SQL is different. It is designed so that the programmer can specify what data is needed, and not how to retrieve it. A DB2 application programmer will use SQL to define data selection criteria. DB2 analyzes SQL and formulates data-navigational instructions “behind the scenes.” These data-navigational instructions are called access paths. By having the DBMS determine the optimal access path to the data, a heavy burden is removed from the programmer. The database has a better understanding of the state of the data it stores, and thereby can produce a more efficient and dynamic access path to the data. The result is that SQL, used properly, can provide for quicker application development.

Quick application development is a double-edged sword. While it can mean reduced application development time and lowered costs, it can also mean that testing and performance tuning are not thoroughly done. The task of tuning the database as well as optimizing the SQL typically falls to the database administrator (DBA).

The DB2 environment and its host system can be tuned to achieve a certain level of performance improvement, but the greatest potential for performance improvement comes from analyzing the SQL code itself and making changes to improve speed and efficiency. The consensus among SQL performance experts is that 80% or more of database performance problems are caused by improperly written and un-tuned SQL.

SQL Query Tuning

SQL is not merely a query language. It can also define data structures, control access to the data, and insert, modify, and delete data. Consolidating these functions into a single language eases communication between different types of users.

SQL is, by nature, quite flexible. It uses a free-form structure that gives the user the ability to develop SQL statements in a way best suited to the given user. Each SQL request is parsed by the DBMS before execution to check for proper syntax and to optimize the request. Therefore, SQL statements do not need to start in any given column and can be strung together on one line or broken apart on several lines. Any SQL request could be formulated in a number of different but functionally equivalent ways.

SQL’s flexibility makes it intrinsically simple, but flexibility is not always a good thing when it comes to performance. Different but equivalent SQL formulations can result in extremely variable performance. In this section, we’ll talk about some of the tools within DB2 to help optimize performance and we’ll get into some of the things to watch for in the code itself.

Queries Built for Speed

When you are writing your SQL statements to access DB2 data, keep in mind the three fundamental guidelines listed in this section. These are simple, yet important rules to follow when writing your SQL statements. Of course, SQL performance is a complex topic and to understand every nuance of how SQL performs can take a lifetime. That said, adhering to the following simple rules puts you on the right track to achieving high-performing DB2 applications.

Always provide only the exact columns that you need to retrieve in the SELECT-list of each SQL SELECT statement.
Another way of stating this is “do not use SELECT *”. The shorthand SELECT * means retrieve all columns from the table(s) being accessed. This is fine for quick and dirty queries but is bad practice for inclusion in application programs because:

DB2 tables may need to be changed in the future to include additional columns. SELECT * will retrieve those new columns, too, and your program may not be capable of handling the additional data without requiring time-consuming changes.
DB2 will consume additional resources for every column that requested to be returned. If the program does not need the data, it should not ask for it. Even if the program needs every column, it is better to explicitly ask for each column by name in the SQL statement for clarity and to avoid the previous pitfall.

Of course, simply avoiding SELECT * is not sufficient. You also have to avoid returning certain columns…

Do not ask for what you already know.
This may sound simplistic, but most programmers violate this rule at one time or another. For example, consider what is wrong with this simple query:

SELECT LASTNAME, FIRST_NAME, JOB_CODE, DEPTNO

FROM EMP

WHERE JOB_CODE = 'A'

AND DEPTNO = 'D01';

Look at the SELECT-list. There are four columns specified but only two of them are needed. We know that JOB_CODE will be A and DEPTNO will be D01 because we told DB2 to only return those rows using the WHERE clauses. Every column that DB2 has to access and return to our program adds overhead. Yes, it is a small amount of overhead here, but this statement may be run hundreds, or even thousands, of times a day. And that small overhead adds up to significant overhead.

Use the WHERE clause to filter data in the SQL instead of bringing it all into your program to filter.
This too is a common rookie mistake. It is much better for DB2 to filter the data before returning it to your program. This is so because DB2 uses additional I/O and CPU resources to obtain each row of data. The fewer rows passed to your program, the more efficient your SQL will be.

Look for IF-THEN-ELSE logic or CASE statements immediately following the FETCH statements in your application programs. If the conditional logic code is analyzing columns that you just retrieved from DB2, try to remove them from the host language code instead building the tests into WHERE clauses in your SQL statements. Doing so will improve performance.

Follow good SQL coding practices (like these three guidelines), and you’ll start seeing a performance improvement in your DB2 applications. To further tune the code, you’ll need to understand how to leverage the optimizer, update statistics, and manage indexes.

Leveraging the Optimizer

The optimizer is the heart and soul of DB2. It analyzes SQL statements and determines the most efficient access path available for satisfying each statement. It accomplishes this by parsing the SQL statement to determine which tables and columns must be accessed. It then queries system information and statistics stored in the DB2 system catalog to determine the best method of accomplishing the tasks necessary to satisfy the SQL request.

The optimizer is essentially an expert system for accessing DB2 data. An expert system is a set of standard rules that when combined with situational data can return an expert opinion. For example, a medical expert system takes the set of rules determining which medication is useful for which illness, combines it with data describing the symptoms of ailments, and applies that knowledge base to a list of input symptoms. The DB2 optimizer renders expert opinions on data retrieval methods based on the situational data housed in DB2’s system catalog and a query input in SQL format.

The notion of optimizing data access in the DBMS is one of the most powerful capabilities of DB2. Remember, access to DB2 data is achieved by telling DB2 what to retrieve, not how to retrieve it. Regardless of how the data is physically stored and manipulated, DB2 and SQL can still access that data. This separation of access criteria from physical storage characteristics is called physical data independence. DB2’s optimizer is the component that accomplishes this physical data independence.

If indexes are removed, DB2 can still access the data (albeit less efficiently). If a column is added to the table being accessed, the data can still be manipulated by DB2 without changing the program code. This is all possible because the physical access paths to DB2 data are not coded by programmers in application programs, but are generated by DB2.

Compare this with non-DBMS systems in which the programmer must know the physical structure of the data. If there is an index, the programmer must write appropriate code so that the index is used. If the index is removed, the program will not work unless changes are made. Not so with DB2 and SQL. All this flexibility is attributable to DB2’s capability to optimize data manipulation requests automatically.

The optimizer performs complex calculations based on a host of information. To simplify the functionality of the optimizer, you can picture it as performing a four-step process:

Receive and verify the syntax of the SQL statement.
Analyze the environment and optimize the method of satisfying the SQL statement.
Create machine-readable instructions to execute the optimized SQL.
Execute the instructions or store them for future execution.

The second step of this process is the most intriguing. How does the optimizer decide how to execute the vast array of SQL statements that can be sent its way?

The optimizer has many types of strategies for optimizing SQL. How does it choose which of these strategies to use in the optimized access paths? IBM does not publish the actual, in-depth details of how the optimizer determines the best access path, but the optimizer is a cost-based optimizer. This means that the optimizer will always attempt to formulate an access path for each query that reduces overall cost. To accomplish this, the DB2 optimizer applies query cost formulas that evaluate and weigh four factors for each potential access path: the CPU cost, the I/O cost, statistical information in the DB2 system catalog, and the actual SQL statement.

The Importance of Statistics

Without the statistics stored in DB2’s system catalog, the optimizer will have a difficult time optimizing anything. These statistics provide the optimizer with information about the state of the tables that will be accessed by the SQL statement that is being optimized. The types of statistical information stored in the system catalog include:

Information about tables including the total number of rows, information about compression, and total number of pages.
Information about columns including number of discrete values for the column and the distribution range of values stored in the column.
Information about table spaces including the number of active pages.
Current status of the index including whether an index exists or not, the organization of the index (number of leaf pages and number of levels), the number of discrete values for the index key, and whether the index is clustered.
Information about the table space and partitions.

Statistics are gathered and stored in DB2’s system catalog when the RUNSTATS utility is executed. Be sure to work with your DBA to ensure that statistics are accumulated at the appropriate time, especially in a production environment.

Index for Performance

Perhaps the single most important thing that can be done to assure optimal DB2 application performance is creating correct indexes for your tables based on the queries used by your applications. Of course, this is easier said than done. But we can start with some basics. For example, consider the following SQL statement:

SELECT LASTNAME, SALARY

FROM EMP

WHERE EMPNO = '000010'

AND DEPTNO = 'D01';

What index or indexes would make sense for this simple query? The short answer is “it depends.” Let’s discuss what it depends upon! First, think about all of the possible indexes that could be created. Your first short list probably looks something like this:

Index1 on EMPNO
Index2 on DEPTNO
Index3 on EMPNO and DEPTNO

This is a good start and Index3 is probably the best of the lot. It allows DB2 to use the index to immediately lookup the row or rows that satisfy the two simple predicates in the WHERE clause. Of course, if you already have a lot of indexes on the EMP table you might want to examine the impact of creating yet another index on the table. Factors to consider include:

Modification impact
Columns in the existing indexes
Importance of a particular query

Modification Impact

DB2 will automatically maintain every index that you create. This means that every INSERT and every DELETE to this table will cause data to be inserted and deleted not just from the table, but also from its indexes. And if you UPDATE the value of a column that is in an index, the index will also be updated. So, indexes speed the process of retrieval but slow down modification.

Columns in the Existing Indexes

If an index already exists on EMPNO or DEPTNO it might not be wise to create another index on the combination. However, it might make sense to change the other index to add the missing column. But not always because the order of the columns in the index can make a big difference in access path selection and performance, depending on the query. Furthermore, if indexes already exist for both columns, DB2 potentially can use them both to satisfy this query so creating another index may not be necessary.

Importance of this Particular Query

The more important the query the more you may want to tune by index creation. For example, if you are coding a query that will be run every day by the CIO, you will want to make sure that it performs optimally. Who wants to risk a call from the CIO complaining about performance? So building indexes for that particular query is very important. On the other hand, a query for a low-level clerk may not necessarily be weighted as high, so that query may have to make due with the indexes that already exist. Of course, the decision will depend on the importance of the application to the business – not just on the importance of the user of the application. An additional criterion to factor into your decision is how often the query is run. The more frequently the query needs to be executed during the day, the more beneficial it becomes to create an index to optimize it.

There is much more to index design than we have covered so far. For example, you might consider index overloading to achieve index only access. If all of the data that a SQL query asks for is contained in the index, DB2 may be able to satisfy the request using only the index. Consider our previous SQL statement. We asked for LASTNAME and SALARY given information about EMPNO and DEPTNO. And we also started by creating an index on the EMPNO and DEPTNO columns. If we include LASTNAME and SALARY in the index as well then we never need to access the EMP table because all of the data we need exists in the index. This technique can significantly improve performance because it cuts down on the number of I/O requests.

Keep in mind, though, that it is not prudent (or even possible) to make every query an index only access. This technique should be saved for particularly troublesome or important SQL statements. And you should always examine the impact on other queries and programs when deciding whether to add columns to any index.

Summary

Properly tuned SQL and a well-tuned DB2 environment can yield noticeable performance improvements. These can mean faster response time for DB2 applications, a better user experience, and faster throughput. The key is a combination of programming practice, system optimization, and effective use of software tools to automate simulation and code analysis.

Wednesday, October 26, 2011

IBM Information on Demand 2011: Day Four (#IODGC)

The highlights of my fourth day at the IOD conference in Las Vegas was the general session with Michael Lewis and Billy Beane.

Billy Beane is the the general manager (as well as a minority owner) of the Oakland Athletics. Michael Lewis is the author of the book, Moneyball, that outlines how Beane revolutionized baseball analytics by focusing on different statistics than the traditional RBI and batting average. Indeed, the A's analysis showed that on-base percentage and slugging percentage were better predictors of offesnive success, and therefore translated into more wins. Additionally, because other teams were not focusing on those stats it would be easier for a small market team like the A's to acquire talent based on them and compete with the "big boys" like the New York Yankees and Boston Red Sox.

Lewis and Beane were informative and entertaining. Lewis started with a funny tale about waiting to talk to the A's players and seeing them as they walked naked from the showers. He said if you just lined these guys naked, up against a wall, you'd never think they were professional athletes.When Lewis mentioned this to Beane, Beane replied that that was basically the point. He told him "We're in the market for players whose value the market doesn't grasp..."

After this conversation Lewis continued to observe the team operations for awhile. And the light bulb came on. Lewis told Beane "Aha... I see what you are doing here. You are arbitraging the mispricing of baseball players." He recognized it because he had covered Wall Street in the past.

When asked if it took courage to rely on the statistics like he did, Beane countered that it really didn't. With a small market team he had no money to compete against the major market teams using traditional measurement analytics. So, it made sense to use the new statistics that were backed up by rigorous analytics and compete in a non-traditional way.

Beane also discussed how baseball tends to get the 8 best teams in the playoffs each year because they play 162 games and the better teams tend to win over longer periods of time. But in the post season, with best of 5 or best of 7 series, anybody can win. The nugget of wisdom passed on by Beane in this story: "Never make decisions based on short term results." To make his point, Beane said that this year most people would agree that the Philadelphia Phillies were the best team in baseball... but they lost to the St. Louis Cardinals in a best of 5 series in the National League Division Series.

The interview with Lewis and Beane tied in well to the overall theme of the IOD conference, which focused on gaining insight from information through analytics. And that is what Beane achieved and Lewis documented in Moneyball (which is now a major motion picture showing at a theatre near you).

Speaking of motion pictures, not major ones by Hollywood standards but perhaps by DB2 users standards, be sure to keep checking in on the daily IOD video blog that I am hosting at http://www.softbase.com/blog. Today's video blog offers up an interview with Suresh Sane, Database Architect at DST Systems in Kansas City and three-time best user speaker at IDUG.