Monday, April 21, 2014

A Little Bit About LOBs

In today's blog post we will take a brief look at LOBs, or Large OBjects, in DB2. I have been prepping for my webinar later this week, titled Bringing Big Data to DB2 for z/OS with LOBs: Understanding, Using, andManaging DB2 LOBsBe sure to click on the link for that and join me on April 24, 2014 for that webinar!

But back to the topic du jour... LOBs. Let's start with a bit of history. LOBs were added to relational products like DB2, back in the 1990s, ostensibly to compete more capably against the Object-Oriented databases of the time. Remember them? Back then it seemed that everybody thought OO DBMS products would supplant relational technology. Seems somewhat quaint now, doesn't it?

At any rate, the idea was to enable relational tables to house unstructured data like images and audio and video data. DB2 for z/OS users were slow to adopt LOBs in their mainframe databases. I think that is due to several different reasons, not the least of which is that it took several new versions of DB2 for LOBs to mature into capable, usable things for enterprise applications. Early implementations of LOBs in DB2 were somewhat kludgy and difficult to administer and use. But IBM has corrected many of the deficiencies over time and there are now tools that can help us to effectively manage DB2 LOBs, too.

The other new force driving LOB usage is the whole Big Data movement. Big Data is a force that is driving organizations to accumulate and analyze more data, and more varied types of data, to gain business insight. The most common definition of Big Data was coined by Forrester Research defining big data in terms of “The 4 V’s” -- volume, velocity, variety, variability. But that is somewhat limiting. However, I do not want to turn this posting into a definition of big data, so... let's just think of Big Data as MORE DATA, MORE TYPES OF DATA, and FASTER GENERATION OF DATA. One example of Big Data driving the usage of LOBs in DB2 is the JSON support that has been added to DB2. JSON objects are stored in DB2 as BLOBs. 

So LOBs have been around for awhile now, but it has taken some time for them to gain significant levels of usage "out there!" 

Why LOBs, you may ask? Why not just store the data in regular old DB2 data types like VARCHAR or VARGRAPHIC? The basic answer is that DB2's data types were not large enough to hold this amount of data, because of their limit of 32 KB. Some multimedia data can get VERY large. For example, a high resolution video requires about 3 GB for each hour. And high-def TV video requires 720 GB/hour!

But an additional part of the answer is that you do not always want to access the LOB data with the traditional data. For example, if you are querying employee information you probably don't always want their photo or their entire resume... so the data can be stored in a LOB, external from the rest of the data (so as not to impede performance) but connected to the data (when you need to access it).

There are three types of LOBs supported by DB2:
1. BLOB – Binary Large Object – for binary data
2. CLOB – Character Large Object – for text data
3. DBCLOB – Double Byte Character Large Object – for graphic character data

Furthermore, there are two options for specifying LOBs in DB2 tables:

  1. The traditional way to define LOBs where the LOB data is stored separately from the rest of the data. This requires defining additional database objects.
  2. A newer method, for smaller LOBs, introduced with DB2 10, called Inline LOBs.

Using the traditional method, LOBs are defined such that they are stored external to the base table. The column is specified as a BLOB, CLOB or DBCLOB in the base table, but DB2 stores the actual LOB data in an auxiliary table in a LOB table space. You must define a LOB table space per LOB per partition. So, let’s say we are creating a table with 2 LOBs in a table space with ten (10) partitions. That means we will need to define 20 LOB table spaces – 10 for the first LOB (one per partition) and 10 for the second LOB (one for each partition). An auxiliary table is defined for each LOB table space to hold the LOB data. And an auxiliary index is required on each auxiliary table. 

Now back to the base table. Remember that we have defined the LOB columns in the base table. But we also need to include a ROWID column in the base table. Only one ROWID column is needed per base table no matter how many LOBs you have defined in the table.


Don’t confuse the ROWID with other concepts that may seem to be similar. A ROWID is not an identity column nor is it a SEQUENCE. The ROWID is used by DB2 behind the scenes to connect the base table row to the auxiliary table row that contains the LOB data. You generally will not access the ROWID, although it is possible to do so just like any other column using SQL. With the ROWID you can perform direct row access. This is so because the row ID value implicitly contains the location of the row. 

Finally, each LOB column also has a 2 byte version number associated with it. This is used by DB2 when LOB data is modified. You cannot access the version number.

Now as of DB2 10 for z/OS, you can create inline LOBs in your DB2 tables. But what is an inline LOB? Well, at a high level, the name is somewhat self-explanatory. The LOB data for inline LOBs is stored with the rest of the base table.

But it is really a bit more nuanced that that. You can store part of the LOB inline with the rest of the data and the rest of the LOB externally, if you so choose. So an inline LOB can have all of the LOB data stored inline with the rest of the data (if it is small enough) or it can store only a portion with the rest of the data. This can be a great technique to use if you have some applications that require perhaps only the first 500 bytes of a larger LOB. Those 500 can be stored inline – with the rest of the data – while the rest is stored externally and accessed only when needed. You can create an inline LOB by specifying the INLINE LENGTH clause on your CREATE TABLE statement.

Inline LOBs can improve performance depending upon the type of access required. Consider the example where RESUME data is stored in a CLOB. The first 200 bytes of the resume are accessed most of the time, with the entire resume being accessed only during rare instances (e.g. interview, performance reviews, etc.) By storing those first 200 bytes inline with the rest of the data we can eliminate the I/O to the LOB in the auxiliary table, thereby improving performance. External LOB data is not buffered, but inline LOB data is – this too can impact performance.

This blog entry is getting a little longer than I was expecting, so I am going to cut it off here. We've reviewed what LOBs are, a little but of their history, and discussed a but about both traditional and in-line LOBs. To learn more, be sure to join me for the webinar on April 24th (or watch it later - it will be recorded).

Labels: ,

Monday, April 14, 2014

Aggregating Aggregates Using Nested Table Expressions


Sometimes when you are writing your SQL to access data you come across the need to work with aggregates. Fortunately, SQL offers many simple ways of aggregating data. But what happens when you uncover then need to perform aggregations of aggregates?

What does that mean? Well, consider an example. Let's assume that you  want to compute the average of a sum. This is a reasonably common requirement that comes up frequently in applications that are built around sales amounts. Let's say that we have a table containing sales information, where each sales amount has additional information indicating the salesman, region, district, product, date, etc. 

A common requirement is to produce a report of the average sales by region for a particular period, say the first quarter of 2014. But the data in the table is at a detail level, meaning we have a row for each specific sale.

A novice SQL coder might try to write a query with a function inside a function, like AVG(SUM(SALE_AMT)). Of course, this is invalid SQL syntax. DB2 will not permit the nesting of aggregate functions. But we can use nested table expressions and our knowledge of SQL functions to build the correct query.

Let’s start by creating a query to return the sum of all sales by region for the time period in question. That query should look something like this:

SELECT REGION, SUM(SALE_AMT)
FROM   SALES
WHERE SALE_DATE BETWEEN DATE(‘2014-01-01’)
                AND     DATE(‘2014-03-31’)
GROUP BY REGION;


Now that we have the total sales by region for the period in question, we can embed this query into a nested table expression in another query like so:

SELECT NTE.REGION, AVG(NTE.TOTAL_SALES)
FROM (SELECT REGION, SUM(SALE_AMT)
      FROM   SALES
      WHERE SALE_DATE BETWEEN DATE(‘2014-01-01’)
                      AND     DATE(‘2014-03-31’)
      GROUP BY REGION) AS NTE

GROUP BY NTE.REGION;


And voila! We have aggregated an aggregate, (averaged a sum)...

Labels:

Sunday, April 06, 2014

DB2 Buffer Pool Monitoring

After setting up your buffer pools, you will want to regularly monitor your configuration for performance. The most rudimentary way of doing this is using the -DISPLAY BUFFERPOOL command. There are many options of the DISPLAY command that can be used to show different characteristics of your buffer pool environment; the simplest is the summary report, requested as follows:

-DISPLAY BUFFERPOOL(BP0) LIST(*) DBNAME(DSN8*)

And a truncated version of the results will look something like this:

DSNB401I - BUFFERPOOL NAME BP0, BUFFERPOOL ID 0, USE COUNT 20
DSNB402I - VIRTUAL BUFFERPOOL SIZE = 500 BUFFERS 736
             ALLOCATED = 500 TO BE DELETED = 0
             IN-USE/UPDATED = 0
DSNB404I - THRESHOLDS - 739
             VP SEQUENTIAL        = 80   HP SEQUENTIAL = 75
             DEFERRED WRITE       = 85   VERTICAL DEFERRED WRT = 80,0
             PARALLEL SEQUENTIAL  = 50   ASSISTING PARALLEL SEQT = 0

Of course, you can request much more information to be displayed using the DISPLAY BUFFERPOOL command by using the DETAIL parameter. Additionally, you can request that DB2 return either incremental statistics (since the last DISPLAY) or cumulative statistics (since DB2 was started). The statistics in a detail report are grouped in the following categories: GETPAGE information, Sequential Prefetch information, List Prefetch information, Dynamic Prefetch information, Page Update statistics, Page Write statistics, and Parallel Processing Activity details.

A lot of interesting and useful details can be gathered simply using the DISPLAY BUFFERPOOL command. For example, you can review GETPAGE requests for random and sequential activity, number of prefetch requests (whether static or dynamic, or for sequential or list prefetch), number of times each of the thresholds were tripped, and much more. Refer to the DB2 Command Reference manual (SC19-4054-02 for DB2 11) for a definition of each of the actual statistics returned by DISPLAY BUFFERPOOL.

Many organizations also have a performance monitor (such as IBM’s Omegamon) that simplifies the gathering and display of buffer pool statistics. Such tools are highly recommended for in-depth buffer pool monitoring and tuning. More sophisticated tools also exist that offer guidance on how to tune your buffer pools — or that automatically adjust your buffer pool parameters according to your workload. Most monitors also provide more in-depth statistics, such as buffer pool hit ratio calculations.

The buffer pool hit ratio is an important tool for monitoring and tuning your DB2 buffer pools. It is calculated as follows:

Hit ratio = GETPAGES - pages_read_from_disk / GETPAGEs

“Pages read from disk” is a calculated field that is the sum of all random and sequential reads.

The highest possible buffer pool hit ratio is 1.0. This value is achieved when each requested page is always in the buffer pool. When requested pages are not in the buffer pool, the hit ratio will be lower. You can have a negative hit ratio — this just means that prefetch was requested to bring pages into the buffer pool that were never actually referenced.


In general, the higher the hit ratio the better because it indicates that pages are being referenced from memory in the buffer pool more often. Of course, a low hit ratio is not always bad. The larger the amount of data that must be accessed by the application, the lower the hit ratio will tend to be. Hit ratios should be monitored in the context of the applications assigned to the buffer pool and should be compared against hit ratios from prior processing periods. Fluctuation can indicate problems.

Labels: ,

Sunday, March 30, 2014

DB2 Buffer Pool Sizing

Sizing DB2 buffer pools can be a time-consuming and arduous process. With that in mind, here are a few thoughts on the matter...
  
DB2 very efficiently manages data in large buffer pools. To search for data in a large buffer pool does not consume any more resources than searching in a smaller buffer pool. With this in mind, do not skimp on allocating memory to your DB2 buffer pools.

So, just how big should you make each buffer pool?  One rule of thumb is to make the buffer pool large enough to hold five minutes worth of randomly read pages during peak processing time. To clarify, the goal is that once a page has been read from disk into the buffer pool, the goal is to maintain it in memory for at least five minutes. The idea is that a page, once read, is likely to be needed again by another process during peak periods. Of course, your particular environment may differ. 

If you have metrics on how frequently a piece of data once read, will need to be read again, as well as how soon it will need to be read, you can use that to tinker with your buffer pool size and parameters to optimize buffer pool data residency.

But it can be difficult to know how much data is read at peak processing time, as well as even knowing when peak processing time will be, before an application is coded. You can gather estimates from the subject matter experts, end users, and application designers, but it will just be estimates.  Over time you will have to examine buffer pool usage and perhaps re-size your buffer pools, or move objects from one buffer pool to another to improve performance.

Of course, you will want to make sure that you have not allocated so much memory to DB2 buffer pools that the system starts paging. When there is not enough real storage to back the buffer pool storage, paging activity will cause performance to degrade.

What is paging? Paging occurs when the virtual storage requirements for a buffer pool exceeds the real storage capacity for the z/OS image. When this happens DB2 migrates the least recently used pages in the buffer pool to auxiliary storage. If the data that was migrated is then accessed those pages must be brought back into real storage from auxiliary storage. When you detect that DB2 is paging you should either increase the amount of real storage or decrease the size of your buffer pools.


Only DBAs or systems programmers should be allowed to add or modify the size of DB2 buffer pools.  And these qualified professionals should be able to analyze the system to know the amount of memory available (real and virtual), as well as the amount being used by other system software and applications. 

For a good overview of mainframe virtual storage consult the following link from IBM

Labels: ,

Friday, March 21, 2014

DB2 Tool Requirements

The last blog post here at the DB2 Portal offered up a brief overview of the types of tools that you might want to consider to help you use, manage, and administer your DB2 applications and databases. But it did not really look into the capabilities and requirements for modern DB2 tools and solutions.
Today’s DB2 management and administration tools should provide intelligent automation to reduce the problems inherent in the tedious day-to-day tasks of database administration. Simple automation is no longer sufficient. Modern data management software must be able to intelligently monitor, analyze, and optimize applications using past, present, and future analysis of collected data. Simply stated, the software should work the way a consultant works--fulfilling the role of a trusted advisor. The end result should be software that functions like a consultant, enabling your precious human resources to spend time on research, strategy, planning, and implementing new and advanced features and technologies, instead of rote day-to-day tasks.
Furthermore, modern database tools should provide cross-platform, heterogeneous management. For most medium-to-large IT organization it is not enough to manage just DB2 for z/OS systems, for example. The ability to offer administrative and development assistance across multiple DBMS platforms (for example, DB2 for LUW, Oracle, SQL Server, MySQL, and so on). Most companies have multiple DBMSs that need to be managed -- not just one... and DBAs and developers are precious resources that increasingly are being asked to work on more than just a single DBMS. When the tools can manage cross-platform, the learning curve is reduced and productivity can be enhanced.
And while it is true that today’s DBMS products are becoming more self-managing, they do not yet provide out-of-the-box, lights-out operation, nor do they offer all of the speed, usability, and ease of use features of ISV admin, management, and development tools. An organization looking to provide 24/7 data availability coupled with efficient performance will have to augment the capabilities of their DBMS software with data management and DBA tools to get the job done.
As data management tasks get more complex and DBAs become harder to find and retain, more and more database maintenance duties should be automated using intelligent management software. Using intelligent, automated DB2 tools will help to reduce the amount of time, effort, and human error associated with implementing and managing efficient database applications.

Labels: ,