Friday, May 09, 2014

IDUG DB2 Tech Conference in Phoenix, Arizona

Today's blog post is just a quick one to remind everybody about the 2014 IDUG DB2 Tech Conference coming up next week in Phoenix, Arizona. 

As usual, there will be a plethora of useful information shared at the conference - as well as the opportunity to meet with developers, gold consultants and DBAs/developers from all over the country... and even the world!

There are complimentary workshop sessions scheduled for Friday on a variety of topics and you will probably also want to sign up for one of the educational seminars on Monday. And don't forget to take advantage of the free IBM certification exams offered up at every IDUG conference.

This year's keynote presentation will be delivered by noted Gartner analyst Donald Feinberg, who is responsible for Gartner's research on database management systems, data warehousing infrastructure and Big Data.

And just check out the impressive agenda of DB2 presentations that will be available at this year's IDUG! 

Add to all of the above an array of Special Interest Groups, the DB2 z/OS and LUW panel sessions (where you can ask questions of the developers), and let's not forget the vendor exhibition where you can check out the latest products and services for DB2... and it all adds up to IDUG being the place to be next week if you have anything to do with DB2. I know I'll be there...

...and I hope to see you there, too.

Monday, May 05, 2014

Using Check Constraints to Simulate Domains

Check constraints are a very useful, though somewhat underutilized feature of DB2. Check constraints enable enhanced data integrity without requiring procedural logic (such as in stored procedures and triggers). Let’s examine the basics of table check constraints.
A constraint is basically a restriction placed upon the data values that can be stored in a column or columns of a table. Of course, most RDBMS products provide several different types of constraints, such as referential constraints (to define primary and foreign keys) and unique constraints (to prohibit duplicates).
Check constraints place specific data value restrictions on the contents of a column through the specification of a Boolean expression. The expression is explicitly defined in the table DDL and is formulated in much the same way that SQL WHERE clauses are formulated. Any attempt to modify the column data (i.e. during INSERT and UPDATE processing) will cause the expression to be evaluated. If the modification conforms to the Boolean expression, the modification is permitted to continue.  If not, the statement will fail with a constraint violation.

This functionality is great for simulating the relational concept of a domain. A domain is basically the set of valid values that a column or data type can take on. Check constraints only simulate domains, though, because there are other features provided by domains that are not provided by check constraints. One such feature is that columns pooled from separate domains must not be compared or operated on by expressions that require the same type of data for both operands. For domains to truly be supported the DBMS should support both check constraints and user-defined data types with strong type checking. This prohibits allowing ridiculous operations, such as comparing IQ to shoe size or adding Australian dollars to Euros.

Forming Check Constraints

Check constraints are written using recognizable SQL syntax. This makes them easy to implement for anyone who has even a passing familiarity with SQL. The check constraint consists of two components: a constraint name and a check condition. 

The constraint name is an SQL identifier and is used to reference or identify the constraint. The same constraint name cannot be specified more than once for the same table. If a constraint name is not explicitly coded, DB2 will create a unique name automatically for the constraint. 

The check condition defines the actual constraint logic.  The check condition can be defined using any of the basic predicates (>, <, =, <>, <=, >=), as well as BETWEEN, IN, LIKE, and NULL.  Furthermore, AND and OR can be used to string conditions together.

There are, however, restrictions on how check constraints are formulated. Some of these restrictions include:
·         Limitations on the entire length of the check condition.
·         Other tables may not be accessed in the check condition.
·         Only a limited subset of SQL operations are permitted (for example subselects and column functions are prohibited in a check constraint).
·         One of the operands (usually the first) of the check constraint must be the name of a column contained in the table for which the constraint is defined.
·         The other operand (usually the second) must be either another column name in the same table or a constant value.
·         If the second operand is a constant, it must be compatible with the data type of the first operand.  If the second operand is a column, it must be the same data type as the first column specified.

Check Constraint Examples

Check constraints enable the DBA or database designer to specify more robust data integrity rules directly into the database.  Consider the following example:

CREATE TABLE EMP
      (EMPNO                      INTEGER

               CONSTRAINT CHECK_EMPNO
               CHECK (EMPNO BETWEEN 100 and 25000),

        EMP_ADDRESS     VARCHAR(70),
        EMP_TYPE               CHAR(8)

               CHECK (EMP_TYPE IN (‘TEMP’,   ‘FULLTIME’, ‘CONTRACT’)),

        EMP_DEPT              CHAR(3)                        NOT NULL WITH DEFAULT,
        SALARY                    DECIMAL(7,2)               NOT NULL

              CONSTRAINT CHECK_SALARY
               CHECK (SALARY < 50000.00),

        COMMISSION          DECIMAL(7,2),
        BONUS                      DECIMAL(7,2)
        );

The CREATE statement for the EMP table contains three different check constraints:

1.      The name of the first check constraint for the EMP table is CHECK_EMPNO.  It is defined on the EMPNO column.  The constraint ensures that the EMPNO column can contain values that range from 100 to 25000 (instead of the domain of all valid integers).
2.      The second check constraint for this table is on the EMP_TYPE column.   This is an example of an unnamed constraint.  Though this is possible, it is not recommended.  It is best to always provide an explicit constraint name in order to ease identification and administration.  This specific constraint restricts the values that can be placed into EMP_TYPE as: 'TEMP', 'FULLTIME', and 'CONTRACT';  no other values would be accepted.
3.      The last check constraint on this table is named CHECK_SALARY.  It effectively ensures that no employee can be entered with a salary of more than $50,000. (Now who would want to work there?)

Column vs. Table Level Constraints

The first check constraint example we reviewed showed a column-level check constraint. However, check constraints also may be coded at the table-level. A column-level check constraint is defined in the DDL immediately after the column. Appropriately enough, a table-level check constraint is defined after all of the columns of the table have already been defined.

It is quite common for business rules to require access to multiple columns within a single table.  When this situation occurs, it is wise to code the business rule into a check constraint at the table-level, instead of at the column level.  Of course, any column-level check constraint can also be defined at the table-level, as well.  In terms of functionality, there is no difference between an integrity constraint defined at the table-level and the same constraint defined at the column-level.

Let’s augment our sample table DDL to add two table-level check constraints:

CREATE TABLE EMP
      (EMPNO                      INTEGER
               CONSTRAINT CHECK_EMPNO
               CHECK (EMPNO BETWEEN 100 and 25000),
        EMP_ADDRESS     VARCHAR(70),
        EMP_TYPE               CHAR(8)
               CHECK (EMP_TYPE IN (‘TEMP’,   ‘FULLTIME’, ‘CONTRACT’)),
        EMP_DEPT              CHAR(3)                        NOT NULL WITH DEFAULT,
        SALARY                    DECIMAL(7,2)               NOT NULL
               CONSTRAINT CHECK_SALARY
               CHECK (SALARY < 50000.00),
        COMMISSION          DECIMAL(7,2),
        BONUS                      DECIMAL(7,2),

               CONSTRAINT COMM_VS_SALARY
               CHECK (SALARY > COMMISSION),

               CONSTRAINT COMM_BONUS
               CHECK (COMMISSION>0 OR BONUS > 0),
        );



The CREATE statement for the EMP table has been modified to contain two table-level check constraints having the following ramifications:

1.      The name of the first table-level check constraint for the EMP table is COMM_VS_SALARY.  This constraint will ensure that no employee can earn more commission than salary.
2.      The second table-level check constraint is named COMM_BONUS.   This constraint will ensure that every employee either earns a commission or a bonus (or possibly, both).

Check Constraint Benefits

So what are the benefits of check constraints? The primary benefit is the ability to enforce business rules directly in each database without requiring additional application logic. Once defined, the business rule is physically implemented and cannot be bypassed. Check constraints also provide the following benefits:
·         Because there is no additional programming required, DBAs can implement check constraints without involving the application programming staff. This effectively minimizes the amount of code that must be written by the programming staff.  With the significant application backlog within most organizations, this can be the most crucial reason to utilize check constraints.
·         Check constraints provide better data integrity.  As check constraints are always executed whenever the data in the column upon which they are defined is to be modified, the business rule is not bypassed during ad hoc processing and dynamic SQL. When business rules are enforced using application programming logic instead, the rules can not be checked during ad hoc processes.
·         Check constraints promote consistency. Because they are implemented once, in the table DDL, each constraint is always enforced. Constraints written in application logic, on the other hand, must be executed within each program that modifies any data to which the constraint applies. This can cause code duplication and inconsistent maintenance resulting in inaccurate business rule support.
·         Typically check constraints coded in DDL will outperform the corresponding application code.

The overall impact of check constraints will be to increase application development productivity while at the same time promoting higher data integrity.

Check Constraints, NULLs, and Defaults

An additional consideration for check constraints is the relational NULL. Any nullable column also defined with a check constraint can be set to null. When the column is set to null, the check constraint evaluates to unknown.  Because null indicates the lack of a value, the presence of a null will not violate the check constraint.

Additionally, DB2 provides the ability to specify defaults for table columns – both system-defined defaults (pre-defined and automatically set by the DBMS) and user-defined defaults. When a row is inserted or loaded into the table and no value is specified for the column, the column will be set to the value that has been identified in the column default specification.  For example, we could define a default for the EMP_TYPE column of our sample EMP table as follows:

        EMP_TYPE               CHAR(8)        DEFAULT ‘FULLTIME’
               CHECK (EMP_TYPE IN (‘TEMP’,   ‘FULLTIME’, ‘CONTRACT’)),

If a row is inserted without specifying an EMP_TYPE, the column will default to the value, ‘FULLTIME’.

A problem can arise when using defaults with check constraints. Most DBMS products do not perform semantic checking on constraints and defaults. The DBMS, therefore, will allow the DBA to define defaults that contradict check constraints.  Furthermore, it is possible to define check constraints that contradict one another. Care must be taken to avoid creating this type of problem. 

Examples of contradictory constraints are depicted below:

CHECK (EMPNO > 10 AND EMPNO <9 o:p="">

In this case, no value is both greater than 10 and less than 9, so nothing could ever be inserted.

EMP_TYPE    CHAR(8)  DEFAULT ‘NEW’
CHECK (EMP_TYPE IN (‘TEMP’,   ‘FULLTIME’, ‘CONTRACT’)),

In this case, the default value is not one of the permitted EMP_TYPE values according to the defined constraint.  No defaults would ever be inserted.

CHECK (EMPNO > 10)
CHECK (EMPNO >= 11)

In this case, the constraints are redundant.  No logical harm is done, but both constraints will be checked, thereby impacting the performance of applications that modify the table in which the constraints exist.

Other potential semantic problems could occur if the constraints contradicts a referential integrity DELETE or UPDATE rule, if two constraints are defined on the same column with contradictory conditions, or if the constraint requires that the column be NULL, but the column is defined as NOT NULL.

Other Potential Hazards

Take care when using the LOAD utility on a table with check constraints defined to it. By specifying the ENFORCE NO parameter you can permit DB2 to load data that does not conform to the check constraints (as well as the referential constraints). Although this eases the load process by enabling DB2 to bypass constraint checking, it will place the table space into a check pending state. You can run CHECK DATA to clear this state (or force the check pending off by using START with the FORCE option or the REPAIR utility). If you do not run CHECK DATA, constraint violations may occur causing dirty data.

Summary


Check constraints provide a very powerful vehicle for supporting business rules in the database. They can be used to simulate relational domains. Because check constraints are non-bypassable, they provide better data integrity than corresponding logic programmed into the application. It is a wise course of action to use check constraints in your database designs to support data integrity, domains, and business rules in all of your relational database applications. 

Monday, April 21, 2014

A Little Bit About LOBs

In today's blog post we will take a brief look at LOBs, or Large OBjects, in DB2. I have been prepping for my webinar later this week, titled Bringing Big Data to DB2 for z/OS with LOBs: Understanding, Using, and Managing DB2 LOBsBe sure to click on the link for that and join me on April 24, 2014 for that webinar!

But back to the topic du jour... LOBs. Let's start with a bit of history. LOBs were added to relational products like DB2, back in the 1990s, ostensibly to compete more capably against the Object-Oriented databases of the time. Remember them? Back then it seemed that everybody thought OO DBMS products would supplant relational technology. Seems somewhat quaint now, doesn't it?

At any rate, the idea was to enable relational tables to house unstructured data like images and audio and video data. DB2 for z/OS users were slow to adopt LOBs in their mainframe databases. I think that is due to several different reasons, not the least of which is that it took several new versions of DB2 for LOBs to mature into capable, usable things for enterprise applications. Early implementations of LOBs in DB2 were somewhat kludgy and difficult to administer and use. But IBM has corrected many of the deficiencies over time and there are now tools that can help us to effectively manage DB2 LOBs, too.

The other new force driving LOB usage is the whole Big Data movement. Big Data is a force that is driving organizations to accumulate and analyze more data, and more varied types of data, to gain business insight. The most common definition of Big Data was coined by Forrester Research defining big data in terms of “The 4 V’s” -- volume, velocity, variety, variability. But that is somewhat limiting. However, I do not want to turn this posting into a definition of big data, so... let's just think of Big Data as MORE DATA, MORE TYPES OF DATA, and FASTER GENERATION OF DATA. One example of Big Data driving the usage of LOBs in DB2 is the JSON support that has been added to DB2. JSON objects are stored in DB2 as BLOBs. 

So LOBs have been around for awhile now, but it has taken some time for them to gain significant levels of usage "out there!" 

Why LOBs, you may ask? Why not just store the data in regular old DB2 data types like VARCHAR or VARGRAPHIC? The basic answer is that DB2's data types were not large enough to hold this amount of data, because of their limit of 32 KB. Some multimedia data can get VERY large. For example, a high resolution video requires about 3 GB for each hour. And high-def TV video requires 720 GB/hour!

But an additional part of the answer is that you do not always want to access the LOB data with the traditional data. For example, if you are querying employee information you probably don't always want their photo or their entire resume... so the data can be stored in a LOB, external from the rest of the data (so as not to impede performance) but connected to the data (when you need to access it).

There are three types of LOBs supported by DB2:
1. BLOB – Binary Large Object – for binary data
2. CLOB – Character Large Object – for text data
3. DBCLOB – Double Byte Character Large Object – for graphic character data

Furthermore, there are two options for specifying LOBs in DB2 tables:

  1. The traditional way to define LOBs where the LOB data is stored separately from the rest of the data. This requires defining additional database objects.
  2. A newer method, for smaller LOBs, introduced with DB2 10, called Inline LOBs.

Using the traditional method, LOBs are defined such that they are stored external to the base table. The column is specified as a BLOB, CLOB or DBCLOB in the base table, but DB2 stores the actual LOB data in an auxiliary table in a LOB table space. You must define a LOB table space per LOB per partition. So, let’s say we are creating a table with 2 LOBs in a table space with ten (10) partitions. That means we will need to define 20 LOB table spaces – 10 for the first LOB (one per partition) and 10 for the second LOB (one for each partition). An auxiliary table is defined for each LOB table space to hold the LOB data. And an auxiliary index is required on each auxiliary table. 

Now back to the base table. Remember that we have defined the LOB columns in the base table. But we also need to include a ROWID column in the base table. Only one ROWID column is needed per base table no matter how many LOBs you have defined in the table.


Don’t confuse the ROWID with other concepts that may seem to be similar. A ROWID is not an identity column nor is it a SEQUENCE. The ROWID is used by DB2 behind the scenes to connect the base table row to the auxiliary table row that contains the LOB data. You generally will not access the ROWID, although it is possible to do so just like any other column using SQL. With the ROWID you can perform direct row access. This is so because the row ID value implicitly contains the location of the row. 

Finally, each LOB column also has a 2 byte version number associated with it. This is used by DB2 when LOB data is modified. You cannot access the version number.

Now as of DB2 10 for z/OS, you can create inline LOBs in your DB2 tables. But what is an inline LOB? Well, at a high level, the name is somewhat self-explanatory. The LOB data for inline LOBs is stored with the rest of the base table.

But it is really a bit more nuanced that that. You can store part of the LOB inline with the rest of the data and the rest of the LOB externally, if you so choose. So an inline LOB can have all of the LOB data stored inline with the rest of the data (if it is small enough) or it can store only a portion with the rest of the data. This can be a great technique to use if you have some applications that require perhaps only the first 500 bytes of a larger LOB. Those 500 can be stored inline – with the rest of the data – while the rest is stored externally and accessed only when needed. You can create an inline LOB by specifying the INLINE LENGTH clause on your CREATE TABLE statement.

Inline LOBs can improve performance depending upon the type of access required. Consider the example where RESUME data is stored in a CLOB. The first 200 bytes of the resume are accessed most of the time, with the entire resume being accessed only during rare instances (e.g. interview, performance reviews, etc.) By storing those first 200 bytes inline with the rest of the data we can eliminate the I/O to the LOB in the auxiliary table, thereby improving performance. External LOB data is not buffered, but inline LOB data is – this too can impact performance.

This blog entry is getting a little longer than I was expecting, so I am going to cut it off here. We've reviewed what LOBs are, a little but of their history, and discussed a but about both traditional and in-line LOBs. To learn more, be sure to join me for the webinar on April 24th (or watch it later - it will be recorded).

Monday, April 14, 2014

Aggregating Aggregates Using Nested Table Expressions


Sometimes when you are writing your SQL to access data you come across the need to work with aggregates. Fortunately, SQL offers many simple ways of aggregating data. But what happens when you uncover then need to perform aggregations of aggregates?

What does that mean? Well, consider an example. Let's assume that you  want to compute the average of a sum. This is a reasonably common requirement that comes up frequently in applications that are built around sales amounts. Let's say that we have a table containing sales information, where each sales amount has additional information indicating the salesman, region, district, product, date, etc. 

A common requirement is to produce a report of the average sales by region for a particular period, say the first quarter of 2014. But the data in the table is at a detail level, meaning we have a row for each specific sale.

A novice SQL coder might try to write a query with a function inside a function, like AVG(SUM(SALE_AMT)). Of course, this is invalid SQL syntax. DB2 will not permit the nesting of aggregate functions. But we can use nested table expressions and our knowledge of SQL functions to build the correct query.

Let’s start by creating a query to return the sum of all sales by region for the time period in question. That query should look something like this:

SELECT REGION, SUM(SALE_AMT)
FROM   SALES
WHERE SALE_DATE BETWEEN DATE(‘2014-01-01’)
                AND     DATE(‘2014-03-31’)
GROUP BY REGION;


Now that we have the total sales by region for the period in question, we can embed this query into a nested table expression in another query like so:

SELECT NTE.REGION, AVG(NTE.TOTAL_SALES)
FROM (SELECT REGION, SUM(SALE_AMT)
      FROM   SALES
      WHERE SALE_DATE BETWEEN DATE(‘2014-01-01’)
                      AND     DATE(‘2014-03-31’)
      GROUP BY REGION) AS NTE

GROUP BY NTE.REGION;


And voila! We have aggregated an aggregate, (averaged a sum)...

Sunday, April 06, 2014

DB2 Buffer Pool Monitoring

After setting up your buffer pools, you will want to regularly monitor your configuration for performance. The most rudimentary way of doing this is using the -DISPLAY BUFFERPOOL command. There are many options of the DISPLAY command that can be used to show different characteristics of your buffer pool environment; the simplest is the summary report, requested as follows:

-DISPLAY BUFFERPOOL(BP0) LIST(*) DBNAME(DSN8*)

And a truncated version of the results will look something like this:

DSNB401I - BUFFERPOOL NAME BP0, BUFFERPOOL ID 0, USE COUNT 20
DSNB402I - VIRTUAL BUFFERPOOL SIZE = 500 BUFFERS 736
             ALLOCATED = 500 TO BE DELETED = 0
             IN-USE/UPDATED = 0
DSNB404I - THRESHOLDS - 739
             VP SEQUENTIAL        = 80   HP SEQUENTIAL = 75
             DEFERRED WRITE       = 85   VERTICAL DEFERRED WRT = 80,0
             PARALLEL SEQUENTIAL  = 50   ASSISTING PARALLEL SEQT = 0

Of course, you can request much more information to be displayed using the DISPLAY BUFFERPOOL command by using the DETAIL parameter. Additionally, you can request that DB2 return either incremental statistics (since the last DISPLAY) or cumulative statistics (since DB2 was started). The statistics in a detail report are grouped in the following categories: GETPAGE information, Sequential Prefetch information, List Prefetch information, Dynamic Prefetch information, Page Update statistics, Page Write statistics, and Parallel Processing Activity details.

A lot of interesting and useful details can be gathered simply using the DISPLAY BUFFERPOOL command. For example, you can review GETPAGE requests for random and sequential activity, number of prefetch requests (whether static or dynamic, or for sequential or list prefetch), number of times each of the thresholds were tripped, and much more. Refer to the DB2 Command Reference manual (SC19-4054-02 for DB2 11) for a definition of each of the actual statistics returned by DISPLAY BUFFERPOOL.

Many organizations also have a performance monitor (such as IBM’s Omegamon) that simplifies the gathering and display of buffer pool statistics. Such tools are highly recommended for in-depth buffer pool monitoring and tuning. More sophisticated tools also exist that offer guidance on how to tune your buffer pools — or that automatically adjust your buffer pool parameters according to your workload. Most monitors also provide more in-depth statistics, such as buffer pool hit ratio calculations.

The buffer pool hit ratio is an important tool for monitoring and tuning your DB2 buffer pools. It is calculated as follows:

Hit ratio = GETPAGES - pages_read_from_disk / GETPAGEs

“Pages read from disk” is a calculated field that is the sum of all random and sequential reads.

The highest possible buffer pool hit ratio is 1.0. This value is achieved when each requested page is always in the buffer pool. When requested pages are not in the buffer pool, the hit ratio will be lower. You can have a negative hit ratio — this just means that prefetch was requested to bring pages into the buffer pool that were never actually referenced.


In general, the higher the hit ratio the better because it indicates that pages are being referenced from memory in the buffer pool more often. Of course, a low hit ratio is not always bad. The larger the amount of data that must be accessed by the application, the lower the hit ratio will tend to be. Hit ratios should be monitored in the context of the applications assigned to the buffer pool and should be compared against hit ratios from prior processing periods. Fluctuation can indicate problems.