The Db2 Portal Blog

Thursday, January 17, 2019

Making Plans for IBM Think 2019

I'm looking forward to once again attend IBM Think, IBM's flagship technology conference. I attended the inaugural Think conference and it was one of the highlights of the year (2017). This year IBM Think is being held in San Francisco February 12 thru 15 at the Moscone Center and surrounding hotels. San Francisco is a wonderful location because it will give the conference more room to accommodate the large crowds more comfortably than the Las Vegas venue of 2017.

One of the great things about Think is the breadth and scope of pertinent technical content that it covers. So whether you are a developer, a DBA, a data scientist, a manager, or any flavor of IT specialist, there will be a plethora of useful sessions and activities to educate and make you “think.”

Now you all know that my primary background is database administration and Db2, but I also work with and have interest in many other technologies, including data governance, security and data protection, DevOps, machine learning, AI, blockchain, quantum computing, and cloud computing. And the great thing about the IBM Think conference is that it provides in-depth coverage of all of these areas, and more.

A big struggle for such a large event (expected attendance in excess of 30,000) is finding what you need. Well, IBM Think makes it a bit easier because it is broken down into campuses that focus on a specific areas. This year’s campuses include:

· Smarter Business Showcase
· Data & AI Campus
· Cloud & Infrastructure Campus
· Security & Resiliency Campus

There will be more than 2,000 business strategy sessions and technical deep dives over the course of the week, along with professional development opportunities from 100s of hands-on labs and certification exams.

One of the big highlights of IBM Think is always the great speakers, and this year is no exception. From IBM speakers like CEO Ginni Rometty and Sr. VP Hybrid Cloud Arvind Krishna, to industry speakers like Founder & CEO of Mogul Tiffany Pham and AT&T CEO John Donovan, to researchers like MIT Media Lab and Harvard research specialist Dr. Kate Darling, to entertainers like Super Bowl MVP Joe Montana and skateboarding legend Tony Hawk, there will be a lot of knowledge imparted. I’m particularly looking forward to hearing Paul Cormier, EVP and President of Products and Technologies at Red Hat to hear how the IBM / Red Hat combination is working.

Another advantage of attending IBM Think is the access to exclusive information about IBM products, technologies, strategies, and services that are sure to be shared during the event. IBM always unveils a ton of great stories and technologies at Think.

I’ll be live-tweeting at IBM Think 2019, so be sure to follow me at twitter.com/craigmullins so you can experience Think right along with me, as it happens. Some of the sessions I plan on attending include topics on governed data science, using machine learning to prioritize business issues, and Db2 on cloud... but those are just the tip of the tech iceberg.

And finally, it is not too late. Click here if you want to attend IBM Think 2019… If you do, maybe I’ll see you there amongst 30,000 of our IT friends!

Monday, December 24, 2018

Happy Holidays 2018

It is hard to believe that yet another year has come and gone (well, almost) and that the holiday season is once again upon us. And that means it is time to reflect on the past year -- including all that we have accomplished and what is yet to be done.

And importantly, it is also time to wind down and relax with friends, family and loved ones. A time to put down the work that consumes us most of the year and to celebrate and enjoy...

So whatever holiday tradition you celebrate, be sure to celebrate well, wave goodbye to 2018 and ring in the New Year with happiness and anticipation...

...and I'll see you back here on the blog in the New Year, 2019!

Monday, December 17, 2018

Dirty Reads... Done Dirt Cheap

Let's talk about dirty reads (with apologies to the AC/DC pun in the title of this blog post).

Application programmers must understand how concurrency problems impact the access and modification of Db2 data. When one program attempts to read data that’s in the process of being changed by another, the DBMS must forbid access until the modification is complete to ensure data integrity. Most DBMS products, including Db2, use a locking mechanism for all data items being changed. Therefore, when one task is updating data on a page, another task can’t access data (i.e., read or update) on that same page until the data modification is complete and committed.

If you are interested, I wrote a 17-part series of blog post on Db2 locking back in 2013... that last part, found here, contains an index to all 17 posts. But back to today's topic... the dirty read.

Before discussing what a “dirty read” is, we should first talk a bit about transactions and the importance of ACID. With the advent of NoSQL database systems that do not always support ACID, it is important that developers and DBAs understand what ACID is and why it is important to the integrity of your data.

Transactions and ACID

A transaction is an atomic unit of work with respect to recovery and consistency. A logical transaction performs a complete business process typically on behalf of an online user. It may consist of several steps and may comprise more than one physical transaction. The results of running a transaction will record the effects of a business process—a complete business process. The data in the database must be correct and proper after the transaction executes.

When all the steps that make up a specific transaction have been accomplished, a COMMIT is issued. The COMMIT signals that all work since the last COMMIT is correct and should be externalized to the database. At any point within the transaction, the decision can be made to stop and roll back the effects of all changes since the last COMMIT. When a transaction is rolled back, the data in the database will be restored to the original state before the transaction was started. The DBMS maintains a transaction log (or journal) to track database changes.

In other words, transactions exhibit ACID properties. ACID is an acronym for atomicity, consistency, isolation, and durability. Each of these four qualities is necessary for a transaction to be designed correctly.

· Atomicity means that a transaction must exhibit “all or nothing” behavior. Either all of the instructions within the transaction happen, or none of them happen. Atomicity preserves the “completeness” of the business process.
· Consistency refers to the state of the data both before and after the transaction is executed. A transaction maintains the consistency of the state of the data. In other words, after running a transaction, all data in the database is “correct.”
· Isolation means that transactions can run at the same time. Any transactions running in parallel have the illusion that there is no concurrency. In other words, it appears that the system is running only a single transaction at a time. No other concurrent transaction has visibility to the uncommitted database modifications made by any other transactions. To achieve isolation, a locking mechanism is required.
· Durability refers to the impact of an outage or failure on a running transaction. A durable transaction will not impact the state of data if the transaction ends abnormally. The data will survive any failures.

Let’s use an example to better understand the importance of transactions to database applications. Consider a banking application. Assume that you wish to withdraw $50 from your account with Mega Bank. This “business process” requires a transaction to be executed. You request the money either in person by handing a slip to a bank teller or by using an ATM (Automated Teller Machine). When the bank receives the request, it performs the following tasks, which make up the complete business process. The bank will:

Check your account to make sure you have the necessary funds to withdraw the requested amount.
If you do not, deny the request and stop; otherwise continue processing.
Debit the requested amount from your checking account.
Produce a receipt for the transaction.
Deliver the requested amount and the receipt to you.

The transaction performing the withdrawal must complete all of these steps, or none of these steps, or else one of the parties in the transaction will be dissatisfied. If the bank debits your account but does not give you your money, then you will not be satisfied. If the bank gives you the money but does not debit the account, the bank will be unhappy. Only the completion of every one of these steps results in a “complete business process.” Database developers must understand the requisite business processes and design transactions that ensure ACID properties.

To summarize, a transaction—when executed alone, on a consistent database—will either complete, producing correct results, or terminate, with no effect. In either case the resulting condition of the database will be a consistent state.

Now Let’s Get Back to Dirty Reads

Programs that read Db2 data typically access numerous rows during their execution and are susceptible to concurrency problems. But when writing your application programs you can use read-through locks, also known as “dirty read” or “uncommitted read,” to help overcome concurrency problems. When using uncommitted reads, an application program can read data that has been changed, but not yet committed.

Dirty read capability is implemented using the UR isolation level (for uncommitted read). If the application program is using the UR isolation level, it will read data without taking locks. This lets the application program read data contained in the table as it’s being manipulated. Consider the following sequence of events:

1. At 9 a.m., a transaction containing the following SQL to change a specific value is executed:

UPDATE EMP

SET FIRST_NAME = “MICHELLE”

WHERE EMPNO = 10020;

2. The transaction is long-running and continues to execute without issuing a COMMIT.

3. At 9:01 a.m., a second transaction attempts to SELECT the data that was changed, but not committed.

If the UR isolation level was specified for the second transaction, it would read the changed data even though it had yet to be committed. Because the program simply reads the data in whatever state it happens to be at that moment, it can execute faster than if it had to wait for locks to be taken and resources to be freed before processing.

However, the implications of reading uncommitted data must be carefully examined before being implemented, as several problems can occur. A dirty read can cause duplicate rows to be returned where none exist. Alternately, a dirty read can cause no rows to be returned when one (or more) actually exists.

Some Practical Advice

So, when is it a good idea to implement dirty reads using the UR isolation level? If the data is read only, a dirty read is fine because there are no changes being made to the data. In "real life," though, true read only data is rare.

A general rule of thumb is to avoid dirty reads whenever the results of your queries must be 100 percent accurate. For example, avoid UR if calculations must balance, data is being retrieved from one source to modify another, or for any production, mission-critical work that can’t tolerate data integrity problems.

In other words: If my bank deployed dirty reads on its core banking applications I would definitely find myself another bank!

One of the more concerning things that I’ve witnessed as a Db2 consultant out “in the real world” is a tendency for dirty read to be used as a quick and dirty way to improve performance. By appending a WITH UR to a statement a developer can remove the overhead of locking and improve performance. But often this is done without a thorough investigation of the possible implications. Even worse, some organizations have implemented a standard that says SELECT statements should always be coded using WITH UR. That can wreak havoc on data integrity... and it goes against my core mantra - almost never say always or never.

Most Db2 applications aren’t viable candidates for dirty reads, but there are a few situations where dirty reads can be beneficial. Examples include access to a reference, code, or look-up table (where the data is non-volatile), statistical processing on large amounts of data, analytical queries in data warehousing and Business Intelligence (BI) applications, or when a table (or set of tables) is used by a single user only (which is rare). Additionally, if the data being accessed is already questionable, little harm can be done using a dirty read to access the information.

Because of the data integrity issues associated with dirty reads, DBAs should keep track of the programs that specify an isolation level of UR. This information can be found in the Db2 Catalog. The following two queries can be used to find the applications using uncommitted reads.

Issue the following SQL for a listing of plans that were bound with ISOLATION(UR) or contain at least one statement specifying the WITH UR clause:

SELECT DISTINCT S.PLNAME

FROM SYSIBM.SYSPLAN P,

SYSIBM.SYSSTMT S

WHERE (P.NAME = S.PLNAME AND

P.ISOLATION = ˈUˈ

)

OR S.ISOLATION = ˈUˈ

ORDER BY S.PLNAME;

Issue the following SQL for a listing of packages that were bound with ISOLATION(UR) or contain at least one statement specifying the WITH UR clause:

SELECT DISTINCT P.COLLID, P.NAME, P.VERSION

FROM SYSIBM.SYSPACKAGE P,

SYSIBM.SYSPACKSTMT S

WHERE (P.LOCATION = S.LOCATION AND

P.LOCATION = ˈ ˈ AND

P.COLLID = S.COLLID AND

P.NAME = S.NAME AND

P.VERSION = S.VERSION AND

P.ISOLATION = ˈUˈ

)

OR S.ISOLATION = ˈUˈ

ORDER BY S.COLLID, S.NAME, S.VERSION;

The dirty read capability can provide relief to concurrency problems and deliver faster performance in specific situations. Understand the implications of the UR isolation level and the “problems” it can cause before diving headlong into implementing it in your production applications.

Thursday, November 22, 2018

Happy Thanksgiving 2018

Just a quick post today to wish all of my readers in the US (and everywhere, really) a very Happy Thanksgiving.

Historically, Thanksgiving has been observed in the United States on various dates. From the earliest days of the country until Lincoln, the date Thanksgiving was observed differed from state to state. But as of the 19th Century, the final Thursday in November has been the customary celebration date. Our modern idea of Thanksgiving was first officially called for in all states in 1863 by a presidential proclamation made by Abraham Lincoln.

With all that history aside, I am just looking forward to celebrating with family and eating a nice, juicy turkey!

Oh, and I'll probably watch some football, too...

Here's wishing you and yours a healthy, happy, relaxing Thanksgiving day!

Monday, November 12, 2018

Data Masking: An Imperative for Compliance and Governance

For those who do not know, data masking is a process that creates structurally similar data that is not the same as the values used in production. Masked data does not expose sensitive data to those using it for tasks like software testing and user training. Such a capability is important to be in compliance with regulations like GDPR and PCI-DSS, which place restrictions on how personally identifiable information (PII) can be used.

The general idea is to create reasonable test data that can be used like the production data, but without using, and therefore exposing the sensitive information. Data masking protects the actual data but provides a functional substitute for tasks that do not require actual data values.

What type of data should be masked? Personal information like name, address, social security number, payment card details; financial data like account numbers, revenue, salary, transactions; confidential company information like blueprints, product roadmaps, acquisition plans. Really, it makes sense to mask anything that should not be public information.

Data masking is an important component of building any test bed of data – especially when data is copied from production. To be in compliance, all PII must be masked or changed, and if it is changed, it should look plausible and work the same as the data it is masking. Think about what this means:

Referential constraints must be maintained. If primary or foreign keys change – and they may have to if you can figure out the original data using the key – the data must be changed the same way in both the parent, and child tables.

Do not forget about unique constraints. If a column, or group of columns, is supposed to be unique, then the masked version of the data must also be unique.

The masked data must conform to the same validity checks that are used on the actual data. For example, a random number will not pass a credit card number check. The same is true of the social insurance number in Canada and the social security number in US, too (although both have different rules).

And do not forget about related data. For example, City, State, and Zip Code values are correlated, meaning that a specific Zip Code aligns with a specific City and State. As such, the masked values should conform to the rules,

A reliable method of automating the process of data masking that understands these issues and solves them is clearly needed. And this is where UBS Hainer’s BCV5 comes in.

BCV5 and Data Masking

Now anybody who has ever worked on creating a test bed of data for their Db2 environment knows how much work that can be. Earlier this year I wrote about BCV5 and its ability to quickly and effectively copy and move Db2 data. However, I did not discuss BCV5’s ability to perform data masking, which will be covered in this blog post.

A component of BCV5, known appropriately enough as The Masking Tool, provides a comprehensive set of data masking capabilities. The tool offers dozens of masking algorithms implemented as Db2 user-defined functions (UDFs), written in PL SQL so they are easy to understand and customize if you so desire.

These functions can be used to generate names, addresses, credit card numbers, social security numbers, and so on. All of the generated data is plausible, but not the real data. For example, credit card numbers pass validity checks, addresses have matching street names, zip codes, cities, and states, and so on...

BCV5 uses hash functions that map an input value to a single numeric value (see Figure 1). The input can be any string or a number. So the hashing algorithm takes the input value and hashes it to a specific number that serves as a seed for a generator. The number is calculated using the hashing algorithm, it is not a random number.

Figure 1. The input value is hashed to a number that is used as a seed for a generator

Some data types, such as social security numbers or credit card numbers, can be generated directly from the seed value through mathematical operations. Other types of data, like names or addresses, are picked from a set of lookup tables. The Masking Tool comes with several pre-defined lookup tables that contain thousands of names and millions of addresses in many different languages.

Similar input values result in totally different generated values so the results are not predictable and the hashing function is designed to be non-invertible, so you cannot infer information about the original value from the generated value.

The functions are repeatable – the same source value always yields the same masked target value. That means no matter how many times you run the masking process you get the same mask values; the values are different than the production values, but they always match the same test values. This is desirable for several reasons:

Because the hashing algorithm will always generate the same number for the same input value you can be sure that referential constraints are taken care of. For example, if the primary key is X598, any foreign key referring to that PK would also contain the value X598… and X598 always hashes to the same number, so the generated value would be the same for the PK and all FKs.
It is also good for enforcing uniqueness. If a unique constraint is defined on the data different input values will result in different hashed values… and likewise, repeated input values will result in the same hashed output values (in other words, duplicates).
Additionally, this repeatability is good for testing code where the program contains processes for checking that values match.

Data masking is applied using a set of rules that indicate which columns of which tables should be masked. Wild carding of the rules is allowed, so you can apply a rule to all tables that match a pattern. At run time, these rules are evaluated and the Masking Tool automatically identifies the involved data types and performs the required masking.

You can have a separate set of rules for each Db2 subsystem that you work with. Depending on your requirements, you can either mask data while making a copy of your tables, or you can mask data in-place (see Figure 2).

Figure 2. Mask data when copying or mask-in-place.

Masking while copying data is generally most useful when copying data from a production environment into a test or QA system. Or you can mask data in-place enabling you to mask the contents of an existing set of tables without making another copy. For example, you may use this option to mask data in a pre-production environment that was created by making a 1:1 copy of a productive system.

What About Native Masking in Db2 for z/OS?

At this point, some of you are probably asking “Why do I need a product to mask data? Doesn’t Db2 provide a built-in ability to create a mask?” And the answer is “yes,” Db2 offers a basic data masking capability, but without all of the intricate capabilities of a product like BCV5.

Why is this so? Well, Db2’s built-in data masking is essentially just a way of displaying a different value based on a rule for a specific column. A mask is an object created using CREATE MASK and it specifies a CASE expression to be evaluated to determine the value to return for a specific column. The result of the CASE expression is returned in place of the column value in a row. So, it can be used to specify a value (like XXXX or ###) for an entire column value, or a portion thereof using SUBSTR.

So native Db2 for z/OS data masking can be used for basic masking of data at execution time. However, it lacks the robust, repeatable nature for generating masked data that a tool like BCV5 can provide.

This overview of Db2 for z/OS data masking has been brief, but I encourage you to examine Db2’s built-in capabilities and compare them to other tools like BCV5.

Poor Masking versus Good Masking

The goal should be to mask your data such that it works like the actual data, but does not contain any actual data values (or any processing artifacts that make it possible to infer information about the actual data).

There are many methods of masking data, some better than others. You should look to avoid setting up poor data masking rules.

One example of bad masking is just setting everything to NULL, blank, or XXXXXX. This will break keys and constraints and it does not allow applications to test everything appropriately because the data won’t match up to the rules – it is just “blanked out.”

Another bad approach is shifting the data, for example A – B, B – C, etc. Shifting is easy to reverse engineer making it easy to re-create the original data. Furthermore, the data likely won’t match up to business rules, such as check digits and correlation.

You can avoid all of the problems and hassles of data masking by using a product like BCV5 to mask your data effectively and accurately. Take a look at the data masking capabilities of BCV5 and decide for yourself what you need to protect your valuable data and comply with the industry and governmental regulations on that data.