Let's take a look at the burgeoning role of "unstructured" data and the impact of using LOBs to store that data in our Db2 databases.
The Growth of "Unstructured" Data
Although structured
data remains the bedrock of the information infrastructure in most
organizations, unstructured data is
growing in importance. And there is much more "unstructured" data “out
there” than structured. Indeed, analysts at IDC estimate that unstructured
data accounts for as much as 90 percent of all digital information.
The rise of "unstructured" data is often attributed to the growing amount of multimedia data being adopted by organizations as they store more audio, video, and images. But that is only part of the story. Text documents, in the form of business forms, letters, and documents, and most importantly, e-mail, greatly contribute to the growing importance of unstructured data.
Now I'm sure you've noticed the quotes I've been putting around the word "unstructured," and they are there for a reason. You see, the word unstructured is a horrible adjective to use in conjunction with data to refer to this type of data. I refer you to this article I wrote called The Big Unstructured Data Lie for why I think this way. Nevertheless, the term "unstructured" will continue to be used to describe this type of data, so from now on I will stop using the quotes...
Db2 and Unstructured
Data
Db2 for z/OS can be used to store unstructured data using
BLOB, CLOB, and DBCLOB data types – collectively known as LOBs. Historically,
LOBs were added to relational products like DB2 to compete more capably against
the Object-Oriented databases of the time – this was back in the late 1990s.
The idea was to enable relational tables to house unstructured data like images
and audio and video data.
But Db2 for z/OS users were slow to adopt LOBs in their
mainframe databases. This was due to several different reasons, not the least
of which is that it took several versions of Db2 for LOBs to mature into
capable, useful things for enterprise applications. Early implementations of
LOBs in Db2 were somewhat kludgy and difficult to administer and use. But IBM
has corrected many of those deficiencies over time and there are now tools that
can help organizations to exploit and effectively manage Db2 LOBs, too.
The other new force driving LOB usage is the whole Big Data
movement. Big Data is driving organizations to accumulate and analyze more
data, and more varied types of data, to gain business insight. A specific example
of Big Data driving the usage of LOBs in Db2 is the JSON support that has been
added to Db2. JSON objects are stored in Db2 as BLOBs.
So, more and more organizations are adopting LOB data in their Db2 databases – to support unstructured data, for their big data projects, and to store documents and multimedia data.
Nevertheless, sometimes I hear DBAs say things like “Oh, well, I don’t use LOBs so I don’t really need to worry about them.” That can cause trouble, because you have been using LOBs for some time now, whether you know it or not. LOB data has been part of the Db2 Catalog since Version 7 and the number of LOB columns used in the Db2 Catalog has been increasing over the past couple of Db2 releases.
As you can
see in table below, the Db2 Catalog now has over 40 LOB columns. So, even if
you have not created any user Db2 tables with LOBs, you have system Db2 tables
with LOBs. For example, SYSIBM.SYSVIEWS contains a CLOB containing the source
text used to create the VIEW.
LOBs are also used in the Db2 Directory, in the DBD01
“table” a BLOB column is used to store the DBD data (2GB). And in the SPT01
“table” there are two BLOB columns for storing data and explain information.
Using LOBs in Db2
I will not attempt to train you on the usage of LOBs in Db2
in this blog post. Suffice it to say that LOBs require different management and
administration tactics to ensure their accuracy and usability. But I do want to
bring up some of the administration concerns that using LOBs can create.
The first thing to keep in mind is that most LOBs are larger
than your typical column data. I mean, it’s right there in the name LOB = Large OBject. As the size of data increases, so do the management concerns,
such as lengthy elapsed times to run utilities on the data, slower performance of
accessing the data, and so on. And you’ll also need to decide whether or
not to log changes to your LOB data. If you do
log LOB changes, then you can stress your logs because of the size of the LOB
data. If you do not log LOB changes, then you will need to make sure that you
have sufficient methods to recover the LOB data because the changes between
image copies won’t be there in the log. Generally speaking, most usually choose to avoid the logging of LOBs. You can turn off LOB logging by specifying
NOT LOGGED in the LOB table space DDL.
There are also many restrictions on how LOB data can be used
with SQL. LOB data is not like traditional, structured database data, so Db2 imposes some limitations, for example:
- Cannot use a LOB in a GROUP BY or ORDER BY clause
- Cannot specify SELECT DISTINCT on a LOB
- LOBs cannot be used in the context of an INCLUDE(column-name) clause in a MERGE statement
- Cannot define check constraints, primary key, unique, or foreign keys on LOBs
- LOBs cannot be used in any predicate except EXISTS, LIKE and NULL
These are for illustrative purposes only. There are other
restrictions, all of which can be found in the IBM SQL Reference manual.
Unless it is an inline LOB where the entire LOB is stored in
the base table, a LOB will require a LOB table space, auxiliary table, and LOB
index. When building auxiliary tables and indexes, you do not specify columns
the same way that you do for normal tables and indexes. For the Auxiliary Table
you specify the LOB column and base table and Db2 automatically generates the
columns needed. For the Auxiliary Index you just specify the auxiliary table
and Db2 implicitly generates the needed index keys.
Each table can have 0, 1, or many LOB columns and each LOB instance
can be up to 2GB in size. Each table with at least 1 LOB must have a ROWID; the
ROWID is a varying-length 17 byte field. One page of a LOB table space will never
contain more than one LOB, but one LOB can span multiple LOB table space pages.
An auxiliary table, which resides in the LOB table space, can store only one
LOB column of a base table; there must be one and only one index on this column.
All of these things can alter the way in which you manage and administer your
Db2 tables and table spaces.
Let’s backtrack and review the size of LOBs in a little more
depth. Each LOB instance can be up to 2G – and that is per row! Each LOB table space
can have as many as 254 different data sets with a DSSIZE from 2G to 64G each
for a total of about 16 terabytes (TB). This is per partition, so if there are
4096 partitions, then the total size for a single LOB is
over 66,000 TB. That is big!
Unless all of your LOB data is static – meaning it never
changes – the size of your LOB data sets will continue to grow. Are you
prepared for running utilities on such large table spaces?
When you drop a LOB column from a base table, DB2 will not
automatically clean up LOB table spaces. After removing the LOB column you can
either drop the LOB table space yourself explicitly, or perhaps reuse it for
another LOB.
And finally, LOB columns are not really updated. The old
version of the LOB is de-allocated, and a new LOB is allocated. So LOBs are a bit
different than the traditional data we are used to managing.
What can go wrong
with LOBs?
Errors with LOBS
occur when there are inconsistencies between the components of the LOB. We
all know that “normal” Db2 indexes can be inconsistent with their associated
table, but the issues are multiplied for LOB indexes:
- The ROWID-Version number in the base table row may not be found in the LOB index.
- There may be entries in the LOB index that are not referenced by any row in the base table.
- The LOB data itself may not be where the LOB index points to.
- There may be LOBs in the LOB table space that are not referenced by the LOB index.
CHECK DATA can be used to find errors 1 and 2 (from the list
above); CHECK LOB can be used to find errors 3 and 4. But it is possible that CHECK
LOB will convert a type 4 error into a type 2 error, so proceed with caution.
Then there is the issue of LOB index consistency. If the LOB
index is inconsistent with the base table data, the LOB data cannot be
accessed. There is no direct access to the LOB table space except through the
LOB index. If the LOB index is inconsistent with the LOB table space, Db2 will
get errors trying to access the LOB data for that row.
Maintaining LOBs
Db2 maintains LOB data in the LOB table space using a
hierarchical structure.
LOB data in the LOB table space can be distributed over many
different pages of the LOB table space. Remember, this LOB data is very large. Db2 uses a structure of map pages to point to data pages. At the top is the first
map page and it is this page number that is stored in the LOB index. This first
map page contains a list of pages, which can be other map pages and data pages.
It also contains the total size of the LOB data. If all the data pages are not
referenced by map pages or if the map pages are not properly referenced by a
higher level map page, LOB data will be
lost.
With all of these pointers and structures to maintain, there are a variety of things that can go wrong. To verify that your LOBs are structurally sound you must run a series of DB2 utilities, in the following order:
- Run CHECK DATA to verify that the ID fields specified in the base table are also found in the LOB index.
- Run CHECK INDEX to verify that the LOB index is valid.
- Run CHECK LOB to verify that the internal structure of the LOB table space is sound.
The Bottom Line
Business and industry trends dictate that unstructured data,
in the form of LOBs, is increasingly being stored in our Db2 databases. This
type of data is different than traditional, structured data and must be managed
with these differences in mind. To do so requires in-depth knowledge and
planning to avoid inconsistencies and errors.