For those who do not know, data
masking is a process
that creates structurally similar data that is not the same as the values used
in production. Masked data does not expose sensitive data to those using it for
tasks like software testing and user training. Such a capability is important
to be in compliance with regulations like GDPR and PCI-DSS, which place
restrictions on how personally
identifiable information (PII)
can be used.
The general idea is to
create reasonable test data that can be used like the production data, but
without using, and therefore exposing the sensitive information. Data masking protects
the actual data but provides a functional substitute for tasks that do not
require actual data values.
What type of data should be
masked? Personal information like name, address, social security number,
payment card details; financial data like account numbers, revenue, salary,
transactions; confidential company information like blueprints, product
roadmaps, acquisition plans. Really, it makes sense to mask anything that
should not be public information.
Data masking is an
important component of building any test bed of data – especially when data is
copied from production. To be in compliance, all PII must be masked or changed,
and if it is changed, it should look plausible and work the same as the data it
is masking. Think about what this means:
- Referential constraints must be maintained. If primary or foreign keys change – and they may have to if you can figure out the original data using the key – the data must be changed the same way in both the parent, and child tables.
- Do not forget about unique constraints. If a column, or group of columns, is supposed to be unique, then the masked version of the data must also be unique.
- The masked data must conform to the same validity checks that are used on the actual data. For example, a random number will not pass a credit card number check. The same is true of the social insurance number in Canada and the social security number in US, too (although both have different rules).
- And do not forget about related data. For example, City, State, and Zip Code values are correlated, meaning that a specific Zip Code aligns with a specific City and State. As such, the masked values should conform to the rules,
A reliable method of automating the process of data masking that understands these issues and solves them is clearly needed. And this is where UBS Hainer’s BCV5 comes in.
BCV5 and Data Masking
Now anybody who has ever
worked on creating a test bed of data for their Db2 environment knows how much
work that can be. Earlier this year I wrote about BCV5 and its ability to quickly and effectively copy and
move Db2 data. However, I did not discuss BCV5’s ability to perform data
masking, which will be covered in this blog post.
A component of BCV5, known
appropriately enough as The Masking Tool, provides a comprehensive set of data
masking capabilities. The tool offers dozens of masking algorithms implemented
as Db2 user-defined functions (UDFs), written in PL SQL so they are easy to
understand and customize if you so desire.
These functions can be used
to generate names, addresses, credit card numbers, social security numbers, and
so on. All of the generated data is plausible, but not the real data. For example,
credit card numbers pass validity checks, addresses have matching street names,
zip codes, cities, and states, and so on...
BCV5 uses hash functions
that map an input value to a single numeric value (see Figure 1). The input can be any string or a number. So
the hashing algorithm takes the input value and hashes it to a specific number
that serves as a seed for a generator. The number
is calculated using the hashing algorithm, it is not a random number.
Figure 1. The input value is hashed to a number that is used
as a seed for a generator
Some data types, such as
social security numbers or credit card numbers, can be generated directly from
the seed value through mathematical operations. Other types of data, like names
or addresses, are picked from a set of lookup tables. The Masking Tool comes
with several pre-defined lookup tables that contain thousands of names and
millions of addresses in many different languages.
Similar input values result
in totally different generated values so the results are not predictable and
the hashing function is designed to be non-invertible, so you cannot infer
information about the original value from the generated value.
The functions are repeatable – the same source value always yields the same masked target value. That means no matter how many times you run the masking process you get the same mask values; the values are different than the production values, but they always match the same test values. This is desirable for several reasons:
The functions are repeatable – the same source value always yields the same masked target value. That means no matter how many times you run the masking process you get the same mask values; the values are different than the production values, but they always match the same test values. This is desirable for several reasons:
- Because the hashing algorithm will always generate the same number for the same input value you can be sure that referential constraints are taken care of. For example, if the primary key is X598, any foreign key referring to that PK would also contain the value X598… and X598 always hashes to the same number, so the generated value would be the same for the PK and all FKs.
- It is also good for enforcing uniqueness. If a unique constraint is defined on the data different input values will result in different hashed values… and likewise, repeated input values will result in the same hashed output values (in other words, duplicates).
- Additionally, this repeatability is good for testing code where the program contains processes for checking that values match.
Data masking is applied
using a set of rules that indicate which columns of which tables should be
masked. Wild carding of the rules is allowed, so you can apply a rule to all
tables that match a pattern. At run time, these rules are evaluated and the
Masking Tool automatically identifies the involved data types and performs the
required masking.
You can have a separate set
of rules for each Db2 subsystem that you work with. Depending on your
requirements, you can either mask data while making a copy of your tables, or
you can mask data in-place (see Figure 2).
Figure 2. Mask data when copying or mask-in-place.
Masking while copying data is
generally most useful when copying data from a production environment into a
test or QA system. Or you can mask data in-place enabling you to mask the
contents of an existing set of tables without making another copy. For example,
you may use this option to mask data in a pre-production environment that was
created by making a 1:1 copy of a productive system.
What
About Native Masking in Db2 for z/OS?
At this point, some of you are probably asking “Why do I
need a product to mask data? Doesn’t Db2 provide a built-in ability to create a
mask?” And the answer is “yes,” Db2
offers a basic data masking capability, but without all of the
intricate capabilities of a product like BCV5.
Why is this so? Well, Db2’s built-in data masking is essentially
just a way of displaying a different value based on a rule for a specific
column. A mask is an object created using CREATE MASK and it specifies a CASE
expression to be evaluated to determine the value to return for a specific
column. The result of the CASE expression is returned in place of the column
value in a row. So, it can be used to specify a value (like XXXX or ###) for an
entire column value, or a portion thereof using SUBSTR.
So native Db2 for z/OS data masking can be used for basic
masking of data at execution time. However, it lacks the robust, repeatable
nature for generating masked data that a tool like BCV5 can provide.
This overview of Db2 for z/OS data masking has been brief,
but I encourage you to examine Db2’s built-in capabilities and compare them to
other tools like BCV5.
Poor Masking
versus Good Masking
The goal should be to mask your data such that it works
like the actual data, but does not contain any actual data values (or any
processing artifacts that make it possible to infer information about the
actual data).
There are many methods of masking data, some better than
others. You should look to avoid setting up poor data masking rules.
One example of bad masking is just setting everything to
NULL, blank, or XXXXXX. This will break keys and constraints and it does not
allow applications to test everything appropriately because the data won’t
match up to the rules – it is just “blanked out.”
Another bad approach is shifting the data, for example A –
B, B – C, etc. Shifting is easy to reverse engineer making it easy to re-create
the original data. Furthermore, the data likely won’t match up to business rules,
such as check digits and correlation.
You can avoid all of the problems and hassles of data
masking by using a product like BCV5 to mask your data
effectively and accurately. Take a look at the data masking capabilities of
BCV5 and decide for yourself what you need to protect your valuable data and
comply with the industry and governmental regulations on that data.