Showing posts with label test data. Show all posts
Showing posts with label test data. Show all posts

Thursday, June 01, 2023

Manage Your Test Data Before It Becomes a Bottleneck

Test data management is an important aspect of application development. Anybody who has ever been involved in software development knows that creating applications requires test data. Without the data to test your programs against, there is really no way to ensure that the programs are doing what you want them to do.

Image by Freepik

But test data management can quickly consume a lot of time and effort. Supporting multiple programmers, multiple database environments, and making sure you have sufficient test cases is easier said than done. And we are not just talking about developing new applications, but also modifying existing ones!

So, what can be done? Well, I'm going to point you to a couple of resources. First, I am delivering a webinar on this topic on June 8, 2023 titled Manage Your Test Data Before it Becomes a Bottleneck -- the same title as this blog post!

In this webinar, I discuss test data management and the trends impacting application development and testing. Some of the issues I examine include automation, shift-left testing, agile development, and DevOps. And I'll uncover the impact of these trends and how they drive up the speed and complexity of creating and testing applications. I will also define the components and requirements of test data management and examine the pitfalls that can lead test data management to become a bottleneck, slowing down your developers.

Another resource you can take a look at is the article I wrote for ELNION, titled The Test Data Management Struggles of Modern App Development.

In this article, I explore the issues impacting application developers regarding test data management, and look at how they can overcome these challenges to ensure a successful testing process.

So if test data management is a challenge for you and your organization, I urge you to take advantage of these resources by registering for and attending the webinar, and also reading the article.

Friday, July 16, 2021

Keeping Track of Data Movement in Db2 for z/OS

Creating and managing test data for Db2 application development and testing requirements can be a significant challenge. To enable not only the development of new programs, but to be able to maintain existing ones, organizations must ensure that there is an adequate amount of accurate test data always available. Without relevant, useful data, there is no way to test applications to make sure they are operating correctly. 

Although this duty must be a shared one between the application developers and the DBAs, managing and controlling all of the data movement tasks typically falls on the DBAs. And keeping track of what data moved where, when it moved, and why can at times be as much of a challenge as moving the data itself.

Fortunately, there are test data management tools available to not only move the data, but to keep track of it. I’ve written about one of the better Db2 for z/OS data movement tools here in the blog before: Fast and Effective Db2 for z/OS Test Data Management with BCV5. I hope you'll take a moment to click and read that post.

Now BCV5 has been improved with a new reporting feature, to enable users to track the movement of data across their Db2 subsystems. This is a significant new feature that can be used to glean useful information for DBAs, storage administrators, and even by data stewards for data governance.

There are six tables of metadata that BCV5 populates to track the data movement and the copy tasks it performs. These tables are:
  • BCV531.TASK_EXECUTIONS
  • BCV531.TASK_EXECUTIONS_OBJECTS
  • BCV531.TASK_EXECUTIONS_PARAMETERS
  • BCV531.TASK_EXECUTIONS_JOBS
  • BCV531.TASK_EXECUTIONS_RULES
  • BCV531.TASK_EXECUTIONS_MASKING
The information in these tables is updated whenever BCV5 runs a task to copy Db2 data. Users can query these tables just like any other Db2 tables to monitor the details of the BCV5 tasks you have run. This information may be useful for many different IT and business professionals, but let’s take a look at three specific use cases: 
  1. database administration (DBA), 
  2. storage administration, and 
  3. data governance.
DBA tracking
DBAs tasked with moving and refreshing data from one Db2 environment to another are the typical users of BCV5, and therefore they will be one of the primary users of the new reporting tables. Most sites that use BCV5 use it to refresh test data, for example, copying production data to test, or copying unit test data to an integration test set of tables.  

Regardless of the type of data movement that is being undertaken, it is usually being done for multiple tables, tablespaces, and databases. Usually, there will be regularly scheduled processes that copy some of the data, but this is rarely sufficient as there will be on-off requests, special situations, and emergency data refreshes happening all the time. Keeping track of such a hectic morass of copying and refreshing data can be difficult. 

Fortunately, if you are using BCV5 the new Usage Tracker tables can simplify keeping track of data refreshes for DBAs. For example, a DBA looking to find out which BCV5 copy tasks were run during the month of May could code a query like:

 SELECT T.ROWDATE, T.USER, T.TASKNAME 
 FROM BCV531.TASK_EXECUTIONS T 
 WHERE T.ROWDATE BETWEEN ´2021-05-01´ AND ´2021-05-31´ 
 ORDER BY T.ROWDATE ;

This will show all the BCV5 copy tasks that ran during that timeframe, and will look something like this:

ROWDATE      USER     TASKNAME 
---------+---------+---------+---------+--------- 
2021-05-12   USERID1  TSK0001 
2021-05-12   USERID1  TSK0002 
2021-05-20   USERID5  TSKPROD1 
2021-05-21   USERID9  TSKPROD4 
 

The results shown here are just a sample and will likely be a subset of the actual results of running such a query. 

Of course, this is rudimentary information and it is likely that the DBA will want to know more, such as which objects were impacted by these tasks. A query such as the following will come in handy:

SELECT SUBSTR(SRCSCHEMA,1,8) AS SS, 
       SUBSTR(SRCNAME,1,12) AS SN, 
       SUBSTR(TGTSCHEMA,1,8) AS TS, 
       SUBSTR(TGTNAME,1,12) AS TN, 
       SUBSTR(OBJTYPE,1,1), 
       ROWDATE AS DATE_COPIED, 
       SIZEKB 
FROM   BCV531.TASK_EXECUTIONS_OBJECTS 
ORDER BY DATE_COPIED ;

The results here show all the Db2 objects copied by BCV5 showing the source and target names as well as the object type, date copied, and amount of data (in KB) copied:

SS      SN       TS     TN         DATE_COPIED   SIZEKB 
---------+---------+---------+---------+---------+-----
DB500XA TS500X01 DBA001 TS500XA1 S  2021-05-17  1462480 
QUALID  XCL59011 TESTID XCL59011 X  2021-05-17   325040 
QUALID  XCL59012 TESTID XCL59012 X  2021-05-17   125200 
QUALID  XCL59013 TESTID XCL59013 X  2021-05-17   301460 
QUALID  XCL5901C TESTID XCL5901C X  2021-05-17    98400 
QUALID  TEST_TBL TESTID TEST_TBL T  2021-05-17       20 

Again, the results have been truncated as this is intended as an example.

A DBA looking to track down the results of a specific copy task that ran on a specific date might want to run a query like this, to verify which objects were copied. Simply plug in the name of your task and the date it ran:

SELECT T.TASKNAME, O.OBJTYPE, O.SRCSCHEMA, O.SRCNAME, 
       O.TGTSCHEMA, O.TGTNAME, O.PARTITIONS 
FROM   BCV531.TASK_EXECUTIONS         T, 
       BCV531.TASK_EXECUTIONS_OBJECTS O 
WHERE  T.ID = O.EXECID 
AND    T.TASKNAME = ? 
AND    T.ROWDATE = ? ;

Storage Administration Tracking 
Another type of user who might find the Usage Tracker capabilities of BCV5 useful is the storage administrator. Storage administrators are responsible for managing an organization’s disk and tape systems. Additionally, they are also responsible for monitoring storage usage and capacity to ensure that sufficient storage is available for the organization’s IT requirements. 

As such, the storage administrator will likely want to keep an eye on the data movement activities of BCV5. For example, a query such as this one can be used to report on the total amount of data (TS and IX) copied by date:

SELECT ROWDATE AS DATE_COPIED, 
       SUM(SIZEKB) AS TOTAL_KB_COPIED 
FROM   BCV531.TASK_EXECUTIONS_OBJECTS 
WHERE  OBJTYPE IN ('S', 'X') 
GROUP BY ROWDATE ;

Which will return data similar to this:

DATE_COPIED  TOTAL_KB_COPIED 
---------+---------+---------+---------+------ 
2021-06-12         106231270 
2021-06-19         106231270 
2021-06-21        4451457810 
2021-06-26         106231270 
Another potentially useful query, not only for storage administrators and DBAs, but also for application managers, is tracking the amount of actual (tablespace) data copied by date and application. Finding the application name or identifier can be tricky, but if we assume that an application identifier is embedded in the second 2 chars of database name then a query like this can be run:

WITH SIZEBYAPP AS ( 
  SELECT ROWDATE AS DATE_COPIED, 
         SUBSTR(TGTSCHEMA,2,2) AS APPL_NAME, 
         SIZEKB AS SIZE_IN_KB 
  FROM   BCV531.TASK_EXECUTIONS_OBJECTS 
  WHERE OBJTYPE = 'S' 
                  ) 
SELECT DATE_COPIED, APPL_NAME, 
       SUM(SIZE_IN_KB) AS TOTAL_KB_COPIED 
FROM   SIZEBYAPP 
GROUP BY DATE_COPIED, APPL_NAME ;

Which might return a report looking something like this:

DATE_COPIED APPL_NAME TOTAL_KB_COPIED 
---------+------------------+---------+------- 
2021-05-11  EN               46805760 
2021-05-22  BA              242791056 
2021-05-22  BX                4094640 
2021-05-22  CM                1008720 
2021-05-22  DA              270390816 
2021-05-22  OR                  90528 
2021-05-26  PR               55737376 
2021-05-26  XX              537647328 

You can adjust this query if you want to know the amount of index data copied by data and application like so (under the same assumption as above for application identifier):

WITH SIZEBYAPP AS ( 
  SELECT B.ROWDATE AS DATE_COPIED, 
         SUBSTR(T.DBNAME,2,2) AS APPL_NAME, 
         B.SIZEKB AS SIZE_IN_KB 
  FROM   BCV531.TASK_EXECUTIONS_OBJECTS B, 
         SYSIBM.SYSINDEXES              X, 
         SYSIBM.SYSTABLES               T 
 WHERE   B.OBJTYPE = 'X' 
 AND     B.TGTNAME = X.NAME 
 AND     B.TGTSCHEMA = X.CREATOR 
 AND     X.TBNAME = T.NAME 
 AND     X.TBCREATOR = T.CREATOR 
                 ) 
SELECT DATE_COPIED, APPL_NAME, 
       SUM(SIZE_IN_KB) AS TOTAL_KB_COPIED 
FROM   SIZEBYAPP 
GROUP BY DATE_COPIED, APPL_NAME ;

Data Governance Tracking 
Although tracking data movement activities is useful for DBAs, it is also important for data governance reporting. Data governance refers to the processes and standards of ensuring access to high-quality data throughout an organization. Data governance encompasses all aspects of data quality including its accuracy, availability, consistency, integrity, security, and usability. The role of data governance has expanded as data privacy rules and regulations have expanded in response to an increasing number of data breaches and hacker attacks. For example, in early July 2021, Colorado passed the Colorado Privacy Act, meaning that now Colorado, Virginia, and California have passed data privacy legislation that impacts how personal data must be governed. More states are certain to follow their lead… and let’s not forget the European GDPR act!

These type of regulations provide rights for access, deletion, correction, portability, and protection for personally identifiable information, or PII. Note that “portability” is one aspect of data protection covered under the auspices of such regulations… and BCV5 is a mover of data, so you need to be able to track what data moved where, especially when the data that moved contains any PII. 

So, what types of queries can be run using the new BCV5 reporting tables to help satisfy the needs of data governance? 

Well, if you have identified specific tables that have personally identifiable information, and therefore requires specific policies to ensure its privacy and protection, a data steward might want to run a query that shows all of the times that a specific protected table was copied:

SELECT SUBSTR(SRCSCHEMA,1,8) AS SS, 
       SUBSTR(SRCNAME,1,12) AS SN, 
       SUBSTR(TGTSCHEMA,1,8) AS TS, 
       SUBSTR(TGTNAME,1,12) AS TN, 
       ROWDATE AS DATE_COPIED, 
       SIZEKB 
FROM   BCV531.TASK_EXECUTIONS_OBJECTS 
WHERE  OBJTYPE = 'T' 
AND    SRCSCHEMA = ? 
AND    SRCNAME = ? 
ORDER BY DATE_COPIED ;

Simply code the appropriate schema (SRCSCHEMA) and table name (SRCNAME) and this query will show all the times that the particular table (say PROD.CUSTOMERS) was copied. A data governance professional with a list of tables that contains PII could alter this query to accept that list as an IN clause instead of the simple equality clause shown here. 

Additionally, BCV5 can de-identify sensitive data using masking. Whenever a task that requires data to be masked is run, information is captured in the TASK_EXECUTIONS_MASKING table. So, a data governance professional might want to run a query like this one to report on all the masking of sensitive data.

SELECT SUBSTR(TBCREATOR,1,8) AS TBCREATOR,
       SUBSTR(TBNAME,1,12) AS TBNAME, 
       SUBSTR(COLNAME,1,18) AS COLNNAME, 
       METHOD AS MASKING_METHOD, 
       SQLEXPR 
FROM   BCV531.TASK_EXECUTIONS_MASKING 
ORDER BY ROWDATE ;

This can always be modified to join it to the TASK_EXECUTIONS table to obtain the task name if that is important. And with a little manipulation of the query it is possible to look for tables that contain sensitive data that have been copied using BCV5, but have not had masking applied. 

Summary 
BCV5 has offered powerful data movement and masking capabilities for Db2 data for a long time, but now it also offers the ability to track and report on your organization’s Db2 data movement and copy tasks. This new functionality opens a plethora of useful information for BCV5 users.

Monday, April 22, 2019

Db2 Application Testing Considerations


Testing application programs is a discipline unto itself and there are many considerations and nuances to be mastered to be able to test software appropriately. This blog post will not go into a lot of depth regarding testing practices, other than to highlight the important things to keep in mind to assure optimal performance of your Db2 applications.

Statistics

The data that you use in your test environment will not be the same as your production data. Typically, you will have less test data than you do in production. So, if you run the RUNSTATS utility on your test data you will get different statistics than in production. 

Instead of running RUNSTATS, you can test your SQL access paths by updating the test catalog statistics to be the same as your production system. Do not attempt to modify test statistics on your own. You should work with your DBA group to set up a process for up­dating test statistics. This can be accomplished in various ways. Your organization may have a tool that makes it easy to copy statistics from production to test; or your DBA team may use a DDL script of queries and modification statements to populate test statistics from production.

If your test table definitions are different in test and produc­tion you will need to take this into account in the script. For example, things like creator, name, indexes, number of parti­tions, and even columns can differ between environments. Furthermore, you may have new tables and columns for which there are no current production statistics, meaning that you will need to create estimated statistics based on your knowledge of the business and application.

Some organizations make the mistake of copying production statistics to test once, and never (or rarely) populating test again. This is a mistake because most production databases change over time, sometimes dramatically. When you run Runstats for your production applications it is a good idea to also update your test statistics from the new production statis­tics.

Modeling a Production Environment in Test

Another tactic you can use to improve the accuracy of access path testing is to model the configuration and settings of your productionenvironment in your test system. Remember that the Db2 optimizer does not just use statistics, but also infor­mation about your computing environment.

Db2 test systems typically vary from the production system. Application testing is often accomplished on test systems that have different parameters and configurations than the produc­tion systems that run the applications. Test systems usually get set up on a less powerful processor (or LPAR), and use less memory for buffering, sorting, and other system processes. This can result in different access paths for test and produc­tion systems, which can cause performance problems that only show up after you move to production.

However, it is possible to model the configuration and param­eters of your production environment in your test system. You can specify configuration details for Db2 to use for access path selection in your test system using profile tables.

Work with your DBA group to configure proper profile settings for testing your applications.

Test Cases for Skewed Data

Db2 assumes that data values are mostly uniformly distributed throughout the data. However, not all data is uniformly distributed. Db2 RUNSTATS can capture information about non-uniformly distributed and skewed data.

When data is non-uniformly distributed a subset of the values occur much more frequently than others. A special case of non-uniformly distributed data is skewed data. When data is skewed, one value (or a very small number of values) occurs much more frequently than others.

Non-uniformly distributed and skewed data presents a performance testing challenge. The Db2 optimizer can formulate different access paths for non-uniformly distributed data based on the actual values supplied. This is particularly important for dynamic SQL applications, but you should be aware of non-uniformly distributed and skewed data even for static SQL applications.

For non-uniformly distributed data you can examine the Db2 catalog to obtain values for the most commonly occurring values. For Db2 for z/OS this information is in the SYSIBM.SYSCOLDIST table.

Be sure to test with several of the values that are stored in the Colvalue column of these tables, and some that are not. This will enable you to test the performance of the most common values and less common values. The access paths may differ for each and the performance also can vary.

An Example

Let’s discuss an example. Suppose you operate a bar with a cigar room and you want to keep track of customers with a table of cigar smokers. We gather information like name, address, phone number, sex, and favorite cigar in the table. Cigar smokers skew male, so there will likely be many more rows where Sex is M, than there are where Sex is F.  With that background, consider a query like this one:

  SELECT name, phoneno, fav_cigar
  FROM   cigar_smokers
  WHERE  sex = ?;

Because the data is skewed, it is possible that Db2 will choose a different access path for M than for F. If the vast majority of the rows are men, then a table scan might be chosen for Sex = ‘M’; whereas with only a few rows for women, an index might be chosen if one exists on the Sex column.

This is just a simple example. You need to understand your data and how it skews to make sure that you create and test sample test cases for all of the pertinent values.

SQL Variations

A final performance testing consideration is to consider multiple SQL variations, especially for queries that access a lot of data or have complex access paths. Do not just find one SQL formulation that works and stick with it. Remember from earlier chapters that you can code multiple variations of SQL statements that return the same data, but that perform quite differently.


Note
This blog post was adapted and excerpted from my latest book, A Guide to Db2 Performance for Application Developers. Click the link for more information or to buy a copy (both print and ebook available).



Monday, November 12, 2018

Data Masking: An Imperative for Compliance and Governance



For those who do not know, data masking is a process that creates structurally similar data that is not the same as the values used in production. Masked data does not expose sensitive data to those using it for tasks like software testing and user training. Such a capability is important to be in compliance with regulations like GDPR and PCI-DSS, which place restrictions on how personally identifiable information (PII) can be used.

The general idea is to create reasonable test data that can be used like the production data, but without using, and therefore exposing the sensitive information. Data masking protects the actual data but provides a functional substitute for tasks that do not require actual data values.

What type of data should be masked? Personal information like name, address, social security number, payment card details; financial data like account numbers, revenue, salary, transactions; confidential company information like blueprints, product roadmaps, acquisition plans. Really, it makes sense to mask anything that should not be public information.

Data masking is an important component of building any test bed of data – especially when data is copied from production. To be in compliance, all PII must be masked or changed, and if it is changed, it should look plausible and work the same as the data it is masking. Think about what this means:

  • Referential constraints must be maintained. If primary or foreign keys change – and they may have to if you can figure out the original data using the key – the data must be changed the same way in both the parent, and child tables.
  • Do not forget about unique constraints. If a column, or group of columns, is supposed to be unique, then the masked version of the data must also be unique.
  • The masked data must conform to the same validity checks that are used on the actual data. For example, a random number will not pass a credit card number check. The same is true of the social insurance number in Canada and the social security number in US, too (although both have different rules).
  • And do not forget about related data. For example, City, State, and Zip Code values are correlated, meaning that a specific Zip Code aligns with a specific City and State. As such, the masked values should conform to the rules,

A reliable method of automating the process of data masking that understands these issues and solves them is clearly needed. And this is where UBS Hainer’s BCV5 comes in.

BCV5 and Data Masking

Now anybody who has ever worked on creating a test bed of data for their Db2 environment knows how much work that can be. Earlier this year I wrote about BCV5 and its ability to quickly and effectively copy and move Db2 data. However, I did not discuss BCV5’s ability to perform data masking, which will be covered in this blog post.

A component of BCV5, known appropriately enough as The Masking Tool, provides a comprehensive set of data masking capabilities. The tool offers dozens of masking algorithms implemented as Db2 user-defined functions (UDFs), written in PL SQL so they are easy to understand and customize if you so desire.

These functions can be used to generate names, addresses, credit card numbers, social security numbers, and so on. All of the generated data is plausible, but not the real data. For example, credit card numbers pass validity checks, addresses have matching street names, zip codes, cities, and states, and so on...

BCV5 uses hash functions that map an input value to a single numeric value (see Figure 1). The input can be any string or a number. So the hashing algorithm takes the input value and hashes it to a specific number that serves as a seed for a generator. The number is calculated using the hashing algorithm, it is not a random number.


Figure 1. The input value is hashed to a number that is used as a seed for a generator

Some data types, such as social security numbers or credit card numbers, can be generated directly from the seed value through mathematical operations. Other types of data, like names or addresses, are picked from a set of lookup tables. The Masking Tool comes with several pre-defined lookup tables that contain thousands of names and millions of addresses in many different languages.

Similar input values result in totally different generated values so the results are not predictable and the hashing function is designed to be non-invertible, so you cannot infer information about the original value from the generated value.

The functions are repeatable – the same source value always yields the same masked target value. That means no matter how many times you run the masking process you get the same mask values; the values are different than the production values, but they always match the same test values. This is desirable for several reasons:

  • Because the hashing algorithm will always generate the same number for the same input value you can be sure that referential constraints are taken care of. For example, if the primary key is X598, any foreign key referring to that PK would also contain the value X598… and X598 always hashes to the same number, so the generated value would be the same for the PK and all FKs. 
  • It is also good for enforcing uniqueness. If a unique constraint is defined on the data different input values will result in different hashed values… and likewise, repeated input values will result in the same hashed output values (in other words, duplicates). 
  • Additionally, this repeatability is good for testing code where the program contains processes for checking that values match.
Data masking is applied using a set of rules that indicate which columns of which tables should be masked. Wild carding of the rules is allowed, so you can apply a rule to all tables that match a pattern. At run time, these rules are evaluated and the Masking Tool automatically identifies the involved data types and performs the required masking.
You can have a separate set of rules for each Db2 subsystem that you work with. Depending on your requirements, you can either mask data while making a copy of your tables, or you can mask data in-place (see Figure 2).


Figure 2. Mask data when copying or mask-in-place.


Masking while copying data is generally most useful when copying data from a production environment into a test or QA system. Or you can mask data in-place enabling you to mask the contents of an existing set of tables without making another copy. For example, you may use this option to mask data in a pre-production environment that was created by making a 1:1 copy of a productive system.

What About Native Masking in Db2 for z/OS?

At this point, some of you are probably asking “Why do I need a product to mask data? Doesn’t Db2 provide a built-in ability to create a mask?” And the answer is “yes,” Db2 offers a basic data masking capability, but without all of the intricate capabilities of a product like BCV5.

Why is this so? Well, Db2’s built-in data masking is essentially just a way of displaying a different value based on a rule for a specific column. A mask is an object created using CREATE MASK and it specifies a CASE expression to be evaluated to determine the value to return for a specific column. The result of the CASE expression is returned in place of the column value in a row. So, it can be used to specify a value (like XXXX or ###) for an entire column value, or a portion thereof using SUBSTR.

So native Db2 for z/OS data masking can be used for basic masking of data at execution time. However, it lacks the robust, repeatable nature for generating masked data that a tool like BCV5 can provide.

This overview of Db2 for z/OS data masking has been brief, but I encourage you to examine Db2’s built-in capabilities and compare them to other tools like BCV5.

Poor Masking versus Good Masking

The goal should be to mask your data such that it works like the actual data, but does not contain any actual data values (or any processing artifacts that make it possible to infer information about the actual data).

There are many methods of masking data, some better than others. You should look to avoid setting up poor data masking rules.

One example of bad masking is just setting everything to NULL, blank, or XXXXXX. This will break keys and constraints and it does not allow applications to test everything appropriately because the data won’t match up to the rules – it is just “blanked out.”
Another bad approach is shifting the data, for example A – B, B – C, etc. Shifting is easy to reverse engineer making it easy to re-create the original data. Furthermore, the data likely won’t match up to business rules, such as check digits and correlation.

You can avoid all of the problems and hassles of data masking by using a product like BCV5 to mask your data effectively and accurately. Take a look at the data masking capabilities of BCV5 and decide for yourself what you need to protect your valuable data and comply with the industry and governmental regulations on that data.

Wednesday, June 20, 2018

Fast and Effective Db2 for z/OS Test Data Management with BCV5


Perhaps the most significant requirement for coding successful Db2 application programs is having a reasonable set of test data to use during the development process. Without data, there is no way to test your code to make sure it runs. But there are many issues that must be overcome in order to have a useful test data management process. Today we will talk about this along with a key enabling component of such a process, BCV5 from UBS Hainer.
One of the first things that organizations try is to make a copy of the production for testing. But this is easier said than done. You cannot just stop your production databases to make a copy of them for testing. But you still want a fast, consistent copy of the data. Consistent in terms of the units of work and referential integrity. And maybe you just want some of the data, not all of it. And we haven’t even talked about the potential regulatory concerns if you are copying personally identifiable information.
When you initially go to build your test data environment, the tools at your disposal are likely the utilities that came with Db2. This means that you will start with solutions like unloading and loading the data. But the LOAD and UNLOAD utilities are not known for their speed, so this can take a long time to accomplish – both for the initial creation and for any subsequent refreshing of the test data. This is important because test data must be refreshed on a regular basis as application testing is performed. Without the capability to refresh it is impossible to compare test runs and develop your programs consistently.
So, what should you do? Well, the first step is to create a consistent test bed either from scratch or, more likely, from production. And you want to do this efficiently and without interrupting production processing. This core bed of test data can be manipulated to reduce its size and even to satisfy regulatory requirements. With a core set of data you can then develop procedures to copy this data out to the various development and QA environments. To succeed, you need a fast method of populating multiple environments, on demand, from the approved test bed.
A key to achieving such an environment is an efficient Db2 data copying tool like BCV5, which can be used to copy and refresh Db2 data very rapidly. BCV5 copies Db2 table spaces and indexes within the same Db2 subsystem or even between different Db2 subsystems much faster than unloading and reloading the data. Using BCV5 you can deliver speedy copies because it works directly at the VSAM level. As BCV5 copies at the VSAM level it can replace Db2-internal OBIDs with the correct target values. This is significantly more efficient than unloading and loading one row at a time. And it takes away the complicated user-managed OBIDXLAT capability of DSN1COPY.
If you have used DSN1COPY in the past you know that it can be difficult to use; this is not the case with BCV5. With DSN1COPY you must specify a series of parameters that describe the input, such as the PIECESIZE, NUMPARTS, DSSIZE, whether it is a LOB table space or not, and more. BCV5 determines all required values automatically, making things a lot easier and less prone to failure.
And if you use LOB and XML data, and these days who doesn’t, BCV5 handles this data like any other, copying it at the same rate as regular table spaces.
BCV5 copies everything, not just the physical Db2 data, but also all of the associated structures including databases, table spaces, tables, indexes, and even views, triggers, aliases, synonyms, constraints, and so on! And you don’t need to worry if objects already exist; BCV5 will check for compatibility and keep the environment accurate. And all of the functionality you’d expect is there, such as the ability to rename objects between environments and to run the copy job either manually or via a job scheduler. Furthermore, you can interact with BCV5 using either an ISPF or a GUI interface.
Using BCV5, you can even use image copies as the source for your test data. BCV5 can use the most recent image copy, or an older image copy chosen by generation number, timestamp, or data set name pattern. BCV5 can automatically identify the correct image copy data sets and use them as the source for the data to be copied. You can even use BCV5 to refresh indexes using image copies of indexes if they exist.
Keeping Db2 statistics accurate can be another vexing test data issue. Generally speaking, you want to keep statistics up-to-date, but in test you probably want test statistics to mirror production. BCV5 can copy both RUNSTATS and RTS (Real Time Stats) directly from the source environment into the target. There is no need for a separate RUNSTATS job or to do a REORG in order to collect an RTS baseline.
And let’s not forget the most impressive aspect of BCV5, its speed and efficiency. BCV5 runs tasks in parallel with automatic workload balancing to further improve the performance of copying Db2 data. This efficiency comes in three forms: less CPU consumption, less elapsed run time, and a reduction in the management steps which can be automated instead of being done manually.
A case in point, a large automobile manufacturer uses BCV5 to manage its large Db2 test data environment consisting of over 11,000 table space partitions, another 11,000+ index partitions, and 20 LOBs. Before deploying BCV5 the company required hundreds of jobs that took almost 2 weeks to create, configure, and execute. After automating the process with BCV5, the entire process requires only 6 jobs that can refresh the test environments in 91 minutes. Impressive, no?
UBS Hainer markets other tools that augment and assist BCV5. For example, its In-Flight Copy add-on can enable BCV5 to get up-to-the-moment accurate data by gathering information from the Db2 log to make consistent copies of table spaces and indexes. It also offers a Reduction and Masking Data add-on to assist with enforcing privacy regulations in your test data. And BCV4 can be used to duplicate an entire Db2 subsystem.
The bottom line is that setting up test data can be difficult and time-consuming. Without a well-thought-out approach to gathering and refreshing test beds, application developers and quality assurance personnel will run into issues as they try to test Db2 code with corrupted or improper data. If your organization has issues with effectively managing test data for your Db2 for z/OS developers, take a look at UBS Hainer’s BCV5 solution for quickly copying and refreshing Db2 data.