Monday, November 12, 2018

Data Masking: An Imperative for Compliance and Governance



For those who do not know, data masking is a process that creates structurally similar data that is not the same as the values used in production. Masked data does not expose sensitive data to those using it for tasks like software testing and user training. Such a capability is important to be in compliance with regulations like GDPR and PCI-DSS, which place restrictions on how personally identifiable information (PII) can be used.

The general idea is to create reasonable test data that can be used like the production data, but without using, and therefore exposing the sensitive information. Data masking protects the actual data but provides a functional substitute for tasks that do not require actual data values.

What type of data should be masked? Personal information like name, address, social security number, payment card details; financial data like account numbers, revenue, salary, transactions; confidential company information like blueprints, product roadmaps, acquisition plans. Really, it makes sense to mask anything that should not be public information.

Data masking is an important component of building any test bed of data – especially when data is copied from production. To be in compliance, all PII must be masked or changed, and if it is changed, it should look plausible and work the same as the data it is masking. Think about what this means:

  • Referential constraints must be maintained. If primary or foreign keys change – and they may have to if you can figure out the original data using the key – the data must be changed the same way in both the parent, and child tables.
  • Do not forget about unique constraints. If a column, or group of columns, is supposed to be unique, then the masked version of the data must also be unique.
  • The masked data must conform to the same validity checks that are used on the actual data. For example, a random number will not pass a credit card number check. The same is true of the social insurance number in Canada and the social security number in US, too (although both have different rules).
  • And do not forget about related data. For example, City, State, and Zip Code values are correlated, meaning that a specific Zip Code aligns with a specific City and State. As such, the masked values should conform to the rules,

A reliable method of automating the process of data masking that understands these issues and solves them is clearly needed. And this is where UBS Hainer’s BCV5 comes in.

BCV5 and Data Masking

Now anybody who has ever worked on creating a test bed of data for their Db2 environment knows how much work that can be. Earlier this year I wrote about BCV5 and its ability to quickly and effectively copy and move Db2 data. However, I did not discuss BCV5’s ability to perform data masking, which will be covered in this blog post.

A component of BCV5, known appropriately enough as The Masking Tool, provides a comprehensive set of data masking capabilities. The tool offers dozens of masking algorithms implemented as Db2 user-defined functions (UDFs), written in PL SQL so they are easy to understand and customize if you so desire.

These functions can be used to generate names, addresses, credit card numbers, social security numbers, and so on. All of the generated data is plausible, but not the real data. For example, credit card numbers pass validity checks, addresses have matching street names, zip codes, cities, and states, and so on...

BCV5 uses hash functions that map an input value to a single numeric value (see Figure 1). The input can be any string or a number. So the hashing algorithm takes the input value and hashes it to a specific number that serves as a seed for a generator. The number is calculated using the hashing algorithm, it is not a random number.


Figure 1. The input value is hashed to a number that is used as a seed for a generator

Some data types, such as social security numbers or credit card numbers, can be generated directly from the seed value through mathematical operations. Other types of data, like names or addresses, are picked from a set of lookup tables. The Masking Tool comes with several pre-defined lookup tables that contain thousands of names and millions of addresses in many different languages.

Similar input values result in totally different generated values so the results are not predictable and the hashing function is designed to be non-invertible, so you cannot infer information about the original value from the generated value.

The functions are repeatable – the same source value always yields the same masked target value. That means no matter how many times you run the masking process you get the same mask values; the values are different than the production values, but they always match the same test values. This is desirable for several reasons:

  • Because the hashing algorithm will always generate the same number for the same input value you can be sure that referential constraints are taken care of. For example, if the primary key is X598, any foreign key referring to that PK would also contain the value X598… and X598 always hashes to the same number, so the generated value would be the same for the PK and all FKs. 
  • It is also good for enforcing uniqueness. If a unique constraint is defined on the data different input values will result in different hashed values… and likewise, repeated input values will result in the same hashed output values (in other words, duplicates). 
  • Additionally, this repeatability is good for testing code where the program contains processes for checking that values match.
Data masking is applied using a set of rules that indicate which columns of which tables should be masked. Wild carding of the rules is allowed, so you can apply a rule to all tables that match a pattern. At run time, these rules are evaluated and the Masking Tool automatically identifies the involved data types and performs the required masking.
You can have a separate set of rules for each Db2 subsystem that you work with. Depending on your requirements, you can either mask data while making a copy of your tables, or you can mask data in-place (see Figure 2).


Figure 2. Mask data when copying or mask-in-place.


Masking while copying data is generally most useful when copying data from a production environment into a test or QA system. Or you can mask data in-place enabling you to mask the contents of an existing set of tables without making another copy. For example, you may use this option to mask data in a pre-production environment that was created by making a 1:1 copy of a productive system.

What About Native Masking in Db2 for z/OS?

At this point, some of you are probably asking “Why do I need a product to mask data? Doesn’t Db2 provide a built-in ability to create a mask?” And the answer is “yes,” Db2 offers a basic data masking capability, but without all of the intricate capabilities of a product like BCV5.

Why is this so? Well, Db2’s built-in data masking is essentially just a way of displaying a different value based on a rule for a specific column. A mask is an object created using CREATE MASK and it specifies a CASE expression to be evaluated to determine the value to return for a specific column. The result of the CASE expression is returned in place of the column value in a row. So, it can be used to specify a value (like XXXX or ###) for an entire column value, or a portion thereof using SUBSTR.

So native Db2 for z/OS data masking can be used for basic masking of data at execution time. However, it lacks the robust, repeatable nature for generating masked data that a tool like BCV5 can provide.

This overview of Db2 for z/OS data masking has been brief, but I encourage you to examine Db2’s built-in capabilities and compare them to other tools like BCV5.

Poor Masking versus Good Masking

The goal should be to mask your data such that it works like the actual data, but does not contain any actual data values (or any processing artifacts that make it possible to infer information about the actual data).

There are many methods of masking data, some better than others. You should look to avoid setting up poor data masking rules.

One example of bad masking is just setting everything to NULL, blank, or XXXXXX. This will break keys and constraints and it does not allow applications to test everything appropriately because the data won’t match up to the rules – it is just “blanked out.”
Another bad approach is shifting the data, for example A – B, B – C, etc. Shifting is easy to reverse engineer making it easy to re-create the original data. Furthermore, the data likely won’t match up to business rules, such as check digits and correlation.

You can avoid all of the problems and hassles of data masking by using a product like BCV5 to mask your data effectively and accurately. Take a look at the data masking capabilities of BCV5 and decide for yourself what you need to protect your valuable data and comply with the industry and governmental regulations on that data.

Thursday, November 01, 2018

30th Anniversary of the Platinum Db2 Tip of the Month


If you have worked with Db2 as long as I have you probably have fond memories of the Platinum Db2 Tip of the Month... but I know there are a lot of you out there who have no idea what I'm talking about. So let me explain.

First of all, there used to be a software company called Platinum Technology, Inc. They were headquartered in Oak Brook Terrace, Illinois and made some of the earliest Db2 for z/OS management products. Platinum was acquired by CA in 1999 and most of those good old Platinum Db2 tools are still available from CA today (albeit updated and modified, of course).

Well, back in the day, Platinum was one of the most innovative marketers in the world of Db2, and they used to mail out a monthly tip about how to use Db2 more efficiently. Even though they sold and marketed their tools, they were promoting Db2 itself (which made sense, because if Db2 thrived, so would their tools). 

And yes, I said mailed. With a stamp. In a mailbox and delivered by a postal worker. This was well before the days of email and the Internet. So each month, Db2 DBAs would eagerly anticipate receiving the latest tip of the month from Platinum... I know I did... until I joined Platinum and started writing the tips!

So the point of this blog post is just to commemorate the occasion, as this month, November 2018, marks the 30th anniversary of the first tip, which was mailed out to Db2 users in November 1988.

And here is what that tip was:



This is the type of thing that the tips covered, among many other tricks and techniques. 

And no, I do not still have this first tip in its original version (although I do still have a stack of original tips). This image comes from the 50th Monthly Tip book that Platinum published compiling the first fifty tips.

Here is the cover of that book:




Thanks for taking this trip down Db2 memory lane with me... hope you enjoyed it! How many of you "out there" still have copies of the Platinum Monthly Db2 Tips?

Friday, October 19, 2018

Unboxing My Book: A Guide to Db2 Performance for Application Developers

Just a quick blog post today to show everybody that my latest book, A Guide to Db2 Performance for Application Developers, is published and ready for shipping!  I just got my author's copies as you can see in this video:



Hope you all out there in Db2-land find the book useful.

If you've bought a copyu and have any comments, please feel free to share them here on the blog.

Monday, October 15, 2018

Published and Available to be Ordered: A Guide to Db2 Performance for Application Developers

The print version of my new book, A Guide to Db2 Performance for Application Developers, can now be ordered directly from the publisher. (If you want the ebook, it can be ordered from the same link below).

Just click on the book cover below and you can order it right now! The link provides more details on the book as well as options for buying the book.

 



Quick information about the book: The purpose of A Guide to Db2 Performance for Application Developers is to give advice and direction to Db2 application developers and programmers to help you code efficient, well-performing programs. If you write code and access data in a Db2 database, then this book is for you. Read the book and apply the advice it gives and your DBAs will love you!

The book was written based on the latest and greatest versions of Db2 for z/OS and Db2 for LUW... and, yes, the book covers both.

If you buy the book and have any thoughts for me, drop me a comment here on the blog!

Friday, October 05, 2018

What Do You Think of the New Design?

Regular readers should have noticed that the logo and basic design of the blog has been "spiffed" up a bit. I did this because the blog has been around for a long time now... my first post was in October 2005! So it was time for a bit of freshening.

Generally speaking, I think blogs are mostly for conveying information, so perhaps I haven't paid as much attention as should have to the look and feel of this blog. But hopefully I fixed that (at least somewhat) for now.

Also, please note that I have not removed any old content. Everything that was here before stays here, even the posts from over a decade ago. I am a big believer in keeping stuff available... some might call that being a packrat, but I wear that label proudly.




So I tend to err on the side of not removing content... I figure, you should note the date of everything you read on the Internet anyway... right?

Let me know what you think of the new look!