The Db2 Portal Blog: database design

Showing posts with label database design. Show all posts

Thursday, July 06, 2023

Top 10 Db2 Performance Tips - No. 2: Optimize Database Design

A well-designed database schema forms the foundation of a high-performing and efficient IBM Db2 database... and therefore, also serves as the basic starting point for efficient Db2 applications. The importance of optimizing the database design cannot be overstated, as it directly impacts query performance, data integrity, and overall system efficiency.

The Logical Data Model

The first step toward a proper database design is the creation of a logical data model. Before implementing databases of any sort, it is imperative to first develop a sound model of the data to be used. Novice database developers frequently begin with the quick-and-dirty approach to database implementation. They approach database design from a programming perspective. Because novices often lack experience with databases and data requirements gathering, they attempt to design databases like the flat files they are accustomed to using. This is a major mistake. Indeed, most developers using this approach quickly discover problems after the databases and applications become operational in a production environment. At a minimum, performance will suffer and data may not be as readily available as required. At worst, data integrity problems and/or performance problems may arise, rendering the entire application unusable.

The goal of a data model is to record the data requirements of a business process. The scope of the data model for each line of business must be comprehensive. A data model serves as lexicon for the data needs of the business... and as a blueprint for the physical implementation of the structures of the database.

A key component of building a proper data model is to ensure proper normalization.

Normalization

Normalization reduces data redundancy and inconsistencies by ensuring that the data elements are designed appropriately. A series of normalization rules are applied to the entities and data elements, each of which is called a “normal form.” If the data conforms to the first rule, the data model is said to be in “first normal form,” and so on.

A database design in First Normal Form (1NF) will have no repeating groups and each instance of an entity can be identified by a primary key. For Second Normal Form (2NF), instances of an entity must not depend on anything other than the primary key for that entity. Third Normal Form (3NF) removes data elements that do not depend on the primary key. If the contents of a group of data elements can apply to more than a single entity instance, those data elements belong in a separate entity.

This is a quick and dirty introduction to normalization, and there are further levels of normalization not discussed here in order to keep the discussion moving along. For an introductory discussion of normalization visit http://wdvl.com/Authoring/DB/Normalization.

The bottom line is that normalization reduces data redundancy and improves data integrity by organizing data into logical entities and minimizing data duplication. By carefully analyzing the business requirements and applying normalization principles, database designers can create tables that are lean, efficient, and accurately represent the data model.

Relationships

Optimizing relationships between tables is another critical aspect of database design. Relationships, such as primary key-foreign key associations, define the logical connections between tables. This too, should be evident in the logical data model, which is frequently depicted as an entity/relationship diagram.

Choosing appropriate indexing strategies, enforcing referential integrity, and carefully considering the cardinality and selectivity of relationships are crucial steps to ensure efficient query processing and join operations.

From Logical to Physical

Assuming you have a well-designed logical data model, the first step in moving to a physical database design is the process of transforming that logical data model into an actual physical database. The first step is to create an initial physical data model by transforming the logical data model into a physical implementation based on an understanding of the DBMS being used for deployment. To successfully create a physical database design you will need to have a good working knowledge of the features of the DBMS including:

In-depth knowledge of the database objects supported by the DBMS and the physical structures and files required to support those objects.
Details regarding the manner in which the DBMS supports indexing, referential integrity, constraints, data types, and other features that augment the functionality of database objects.
Detailed knowledge of new and obsolete features for particular versions or releases of the DBMS to be used.
Knowledge of the DBMS configuration parameters that are in place.
Data definition language (DDL) skills to translate the physical design into actual database objects.

Armed with the correct information, you can create an effective and efficient database from a logical data model. The first step in transforming a logical data model into a physical model is to perform a simple translation from logical terms to physical objects. Of course, this simple transformation will not result in a complete and correct physical database design – it is simply the first step. The transformation consists of the following:

Transforming entities into tables
Transforming attributes into columns
Transforming domains into data types and constraints

Data Types

To support the mapping of attributes to table columns you will need to map each logical domain of the attribute to a physical data type and perhaps additional constraints. In a physical database, each column must be assigned a data type.

Selecting appropriate data types is vital for optimizing database design. Choosing the right data types can have a significant impact on storage requirements, query performance, and overall system efficiency. By selecting data types that accurately represent the data and minimize storage overhead, such as using integer types instead of character types for numeric values, assuring that date and time data use appropriate data/time data types, and choosing wisely between the various text and character data types for each column helps to improve data integrity, optimize storage utilization, and improve query execution speed.

Constraints

Furthermore, you will need to implement appropriate constraints, such as primary keys, unique constraints, and foreign keys. This enhances data integrity and query performance. Furthermore, additional constraints such as check constraints and nullability enable the DBMS to better enforce data integrity, instead of leaving it to application code written at a later time. Constraints enforce data consistency rules, ensure referential integrity, and provide the optimizer with valuable information for query optimization.

An Iterative Process

It is worth mentioning that database design optimization is an iterative process that should consider not only the current requirements but also the future growth and scalability of the system. Regularly reviewing and revisiting the database design as the application evolves can help identify areas for improvement and ensure that the database remains optimized over time.

Finally...

In conclusion, a well-designed database schema is fundamental to achieving optimal performance for your database applications. By focusing on strategies such as normalization, relationship optimization, appropriate data types, and constraints, database designers can create a robust and efficient database environment. Optimizing the database design not only enhances query performance and data integrity but also lays the groundwork for scalability and adaptability as the system evolves.

Tuesday, October 19, 2021

How Do You Store Boolean Data in Db2?

Recently, I was reading an article written by Ben Brumm on using Boolean data in SQL and it got me thinking about Db2 and Boolean data. You see, if you click over to the article, you'll see that Ben talks about several popular database systems, but not Db2!

So first of all, what is Boolean data? Sometimes all you want to do is to be able to store a Yes/No, True/False, 1/0 value in a column of a Db2 table. But that can be problematic. What data type should you choose? How should you model that column? Does it matter?

Well, up until recently, Db2 did not support a Boolean data type. But I guess I should be more clear: Db2 LUW offers Boolean support, but Db2 for z/OS does not have a built-in Boolean data type.

As of V9.7, you can use a Boolean data type for local variables, global variables, parameters, or return types in compound SQL (compiled) statements. So it is really only a viable option for SQL PL. As of V11.1, you can use the Boolean data type for a column in a Db2 table.

OK, so what should you do if you do not have Boolean support in the Db2 you are currently using? In my opinion, the best approach is to use a numeric solution with 1 as TRUE and 0 as FALSE. You might want to consider using a Character column with YES/NO or TRUE/FALSE, but there are several issues with that approach. Firstly, it will likely consume more storage, secondly without a constraint on the column users might specify “YES” or “Yes” or “y” or “Y”… or other variations to mean the same thing, and finally, if English is not your first language then users may have the urge to specify “Si” or “Da” or… well, you get the idea.

A good choice for a Boolean data type could be BINARY(1) which only stores binary data, so no constraint is needed. Or you might want to specify SMALLINT with a check constraint only allowing 0 and 1 as options.

If you want to allow "Unknown" as an option, then you can make the column nullable.

Summary

So, share your experiences with us here on the blog. Tell us, have you implemented Boolean data in your Db2 tables? How did you do it? And how is it working out for you?

Friday, July 26, 2013

Sunday, June 19, 2011

The PIECESIZE Clause

One of the more troubling aspects of DB2 database design and creation is the non-partitioning index (NPI). Creating NPIs on tables in a partitioned table space can pose management and performance issues. Partitioned table spaces tend to be large and by their very design will span multiple underlying data sets. Any partitioning indexes will also span multiple data sets. But what happens when you need to define non-partitioning indexes on a table in a partitioned table space?

The PIECESIZE clause of the CREATE INDEX statement can be used during index creation to break an NPI into several data sets (or "pieces"). More accurately, the PIECESIZE clause specifies the largest data set size for a non-partitioned index. PIECESIZE can be specified in kilobytes, megabytes, or gigabytes. For example, the following statement will limit the size of individual data sets for the XACT2 index to 256 megabytes:

CREATE UNIQUE INDEX DSN8910.XACT2
ON DSN8910.ACT (ACTKWD ASC)
USING STOGROUP DSN8G910
PRIQTY 65536K
SECQTY 8192K
ERASE NO
BUFFERPOOL BP0
CLOSE NO
PIECESIZE 256M;

Basically, PIECESIZE is used to enable NPIs to be created on very large partitioned table spaces. It breaks apart the NPI into separate pieces that can be somewhat managed individually. Without PIECESIZE, NPIs would be quite difficult to manage and administer. Keep in mind, though, that PIECESIZE does not magically partition an NPI based on the partitioning scheme of the table space. This is a misperception held by some. So, if you have a partitioned table space with 4 partitions and then create an NPI with 4 pieces, the data in the NPI pieces will not match up with the data in the 4 partitions.

When using PIECESIZE, more data sets will be created and therefore you can obtain greater control over data set placement. Placing the pieces on separate disk devices can help to reduce I/O contention for SQL operations that access NPIs during read or update processing. The elapsed time improvement may be even greater when multiple tasks are accessing the NPI.

Separating the NPI into pieces allows for better performance of INSERT, UPDATE and DELETE processes by eliminating bottlenecks that can be caused by using only one data set for the index. The use of pieces also improves concurrency and performance of heavy INSERT, UPDATE, and DELETE processing against any size partitioned table space with NPIs.

Keep in mind that PIECESIZE is only a specification of the maximum amount of data that a piece (that is, a data set) can hold and not the actual allocation of storage, so PIECESIZE has no effect on primary and secondary space allocation. Each data set will max out at the PIECESIZE value, so specifying PRIQTY greater than PIECESIZE will waste space. But also make sure that you avoid setting the PIECESIZE too small. A new data set will be allocated each time the PIECESIZE threshold is reached. DB2 will increment the A001 component of the data set name each time. Ideally, the value of your primary quantity and secondary quantities should be evenly divisible into PIECESIZE to avoid wasting space.

To choose a PIECESIZE value, divide the overall size of the entire NPI by the number of data sets that you wish to have. For example, for an NPI that is 8 megabytes, you can arrive at 4 data sets for the NPI by specifying PIECESIZE 2M. Of course, if your NPI grows over 8 megabytes in total you will get additional data sets. Keep in mind that 32 pieces is the limit if the underlying table space is not defined with DSSIZE 4G or greater. The limit is 254 pieces if the table space is defined as DSSIZE 4G or greater.

Monday, April 04, 2011

What About Surrogate Keys?

As is so often the case with my blog, today's topic came about as the result of an e-mail question I received from a DBA I know. His question was this:

"A great debate rages here about the use of ‘synthetic’ keys. We read all sorts of articles on the wild wild web but none seem to address the database performance impacts of designs using synthetic keys. I wondered if you could point me to any information on this…"

If you've ever Googled the term "surrogate key" you know the hornet's nest of opinions that swirls around "out there" about the topic. For those who haven't heard the term, here is my attempt at a quick summary: a surrogate key is a generated unique value that is used as the primary key of a database table; database designers tend to consider surrogate keys when the natural key consists of many columns, is very long, or may need to change.

And here is the response I sent to my e-mail inquisitor:

I doubt that there is any “final word” on this topic. It has been raging on for years and years; so folks pro, others con. This Wikipedia article offers up a nice start: http://en.wikipedia.org/wiki/Surrogate_key

However, when I get to the performance area of this article I don’t think I agree. The article puts a lot of emphasis on there being fewer columns to join and therefore better performance.. If you’ve got an index on those multiple columns how much “worse” will the performance be, really? Sure, the SQL is more difficult to write, but will a join over 4 or 5 indexed columns perform that much worse than a join on one indexed column? I suppose as the number of columns required for the natural key increases the impact could be greater (e.g. 10 columns???)

I guess I can see the argument if you are swapping a variable length key with a surrogate having a fixed length key – that should improve things!

Furthermore consider this: the natural key columns are still going to be there, after all, they are naturally part of the data, right? So the surrogate (synthetic) key gets added to each row. This will likely reduce the number of rows per page (maybe not, but probably). And that, in turn, will negatively impact the performance of sequential access because more I/O will be required to read the “same” number of rows.

And what about the impact of adding data? If there are a significant number of new rows being added at the same time by different processes there will be locking issues as they all try to put the new data on the same page (unless, of course, your surrogate key is not a sequential number and is, instead, something like the microseconds portion of the current timestamp [that must be tested to avoid duplicates]).

The one thing that usually causes me to tend to favor natural keys is just that – they are natural. If the data is naturally occurring it becomes easier for end users to remember it and use it. If it is a randomly generated surrogate nobody will actually know the data. Yes, this can be masked to a great deal based on the manner in which you build your applications to access the data, but ad hoc access becomes quite difficult.

I guess the bottom line is that “it depends” on a lot of different things! No surprise there, I suppose.

Here are a few other resources with information (not so much on performance though) that you may or may not have reviewed already:

What do you think about natural keys versus surrogate keys? Surely some readers here have an opinion on this topic! If so, post them as comments...

Thursday, April 22, 2010

The Ever-Changing Role of the DBA

Defining the job of DBA is getting to be increasingly difficult. Oh, most people know the rudimentary aspects of the job, namely keeping your organization's databases and applications running up to par. The DBA has to be the resident DBMS expert (whether that is DB2, Oracle or SQL Server, or most likely a combination of those). He or she has to be able to solve thorny performance problems, ensure backups are taken, recover and restore data when problems occur, make operational changes to database structures and, really, be able to tackle any issue that arises that is data-related.

The technical duties of the DBA are numerous. These duties span the realm of IT disciplines from database design to physical implementation and consistent, on-going monitoring of the database environment.

DBAs must possess the abilities to create, interpret, and communicate a logical data model and to create an efficient physical database design from a logical data model and application specifications. There are many subtle nuances involved that make these tasks more difficult than they sound. And this is only the very beginning. DBAs also need to be able to collect, store, manage, and query data about the data (metadata) in the database and disseminate it to developers that need the information to create effective application systems. This may involve repository management and administration duties, too.

After a physical database has been created from the data model, the DBA must be able to manage that database once it has been implemented. One major aspect of this management involves performance management. A proactive database monitoring approach is essential to ensure efficient database access. The DBA must be able to utilize the monitoring environment, interpret its statistics, and make changes to data structures, SQL, application logic, and the DBMS subsystem to optimize performance. And systems are not static, they can change quite dramatically over time. So the DBA must be able to predict growth based on application and data usage patterns and implement the necessary database changes to accommodate the growth.

And performance management is not just managing the DBMS and the system. The DBA must understand SQL used to access relational databases. Furthermore, the DBA must be able to review SQL and host language programs and to recommend changes for optimization. As databases are implemented with triggers, stored procedures, and user-defined functions, the DBA must be able to design, debug, implement, and maintain these code-based database objects as well.

Additionally, data in the database must be protected from hardware, software, system, and human failures. The ability to implement an appropriate database backup and recovery strategy based on data volatility and application availability requirements is required of DBAs. Backup and recovery is only a portion of the data protection story, though. DBAs must be able to design a database so that only accurate and appropriate data is entered and maintained - this involves creating and managing database constraints in the form of check constraints, rules, triggers, unique constraints, and referential integrity.

DBAs also are required to implement rigorous security schemes for production and test databases to ensure that only authorized users have access to data. As industry and governmental regulations multiply, the need to audit who did what to which data when is also a requirement for sensitive data in production systems – and the DBA must be involved in ensuring data auditability without impacting availability or performance.

And there is more! The DBA must possess knowledge of the rules of relational database management and the implementation of many different DBMS products. Also important is the ability to accurately communicate them to others. This is not a trivial task since each DBMS is different than the other and many organizations have multiple DBMS products (e.g., DB2, Oracle, SQL Server).

Remember, too, that the database does not exist in a vacuum. It must interact with other components of the IT infrastructure. As such, the DBA must be able to integrate database administration requirements and tasks with general systems management requirements and tasks such as network management, production control and scheduling, and problem resolution, to name just a few systems management disciplines. The capabilities of the DBA must extend to the applications that use databases, too. This is particularly important for complex ERP systems that interface differently with the DBMS. The DBA must be able to understand the requirements of the application users and to administer their databases to avoid interruption of business. This includes understanding how any ERP packages impact the business and how the databases used by those packages differ from traditional relational databases.

But Things Are Changing

So at a high level, DBAs are tasked with managing and assuring the integrity and efficiency of database systems. But keep in mind, too, that there are actually many different DBAs. Some focus on logical design; others focus on physical design; some DBAs specialize in building systems and others specialize in maintaining and tuning systems; and there are specialty DBAs and general-purpose DBAs. Truly, the job of DBA encompasses many roles.

Some organizations choose to split DBA responsibilities into separate jobs. Of course, this occurs most frequently in larger organizations, because smaller organizations often cannot afford the luxury of having multiple, specialty DBAs.

Still other companies simply hire DBAs to perform all of the tasks required to design, create, document, tune, and maintain the organization’s data, databases, and database management systems.

But no matter what "type" of DBA you happen to be, chances are that your role is changing and adapting to new types of computing and data requirements. Indeed, one of the biggest challenges for DBAs these days is the ongoing redefinition of the job roles and responsibilities.

The primary role of database "custodian," of course, continues to be the main emphasis of the job. But that is no longer sufficient for most organizations. The DBA is expected to take on numerous additional -- mostly technical -- roles. These can include writing application code, managing the application server, enterprise application integration, managing Web services, network administration and more.

If you compare the job description of DBAs across several organizations, it is likely that no two of them would match exactly. This is both good and bad. It is good because it continually challenges the technically-minded employees who tend to become DBAs. But it can be bad, too; because the job differs so much from company to company, it becomes more difficult to replace a DBA who leaves or retires. And no one can deny that database administration is a full-time, stressful job all on its own. But the stress level just keeps increasing as additional duties get tacked onto the DBA's "to do" list.

Summary

There are many jobs that DBAs perform and it can be confusing when you try to match job title up against the responsibilities of the job. Don't let your job title keep you from expanding into other, related disciplines. The more you know and the more you can do, the more employable you become... and that is important in this day and age!

Tuesday, May 26, 2009

Ensuring Data Integrity is a Tricky Business

The term "data integrity" can mean different things to different people and at different times. But at a high level, there really are two aspects of integrity with respect to databases: database structure integrity and semantic data integrity. Keeping track of database objects and ensuring that each object is created, formatted and maintained properly is the goal of database structure integrity. Each DBMS uses its own internal format and structure to support the databases, table spaces, tables, and indexes under its control. System and application errors at times can cause faults within these internal structures. The DBA must identify and correct such faults before insurmountable problems occur. Semantic data integrity refers to the meaning of data and relationships that need to be maintained between different types of data. The DBMS provides options, controls and procedures to define and assure the semantic integrity of the data stored within its databases.

Structural database integrity and consistency is critical in the ongoing administration of databases. If the structural integrity of the database is not sound, everything else will be suspect, too. There are multiple types of structural problems that can occur. Indexing problems are one. Certain types of database maintenance can cause such problems and DBAs need to be able to recognize the problem, and rebuild the indexes to correct their structural integrity. Indexes are not the only database objects that utilize pointers. Many DBMSs use pointers to store very large objects containing text and image data. These can become corrupted.In today's modern database systems, structural integrity is rare -- much rarer than it used to be.

The more difficult and more pervasive problem is assuring the semantic integrity of the data. Getting that right requires proper design, processes that match your business requirements, good communication skills, and constant vigilance.

Perhaps the number one cause of data integrity problems is improperly designed databases. Just getting the data type and length correct for each column can go a long way to making sure the right data is stored. Think about it. If you need to store dates but the column is defined as CHAR(8) how can you enforce that only valid dates are stored? You would need to code program logic to accomplish that. But if the column is defined as DATE then the DBMS would take care of it -- and more of the data would be likely to be correct.

The DBA must also set up data relationships properly in the database. This is done using referential integrity (RI), a method for ensuring the "correctness" of data within a DBMS. People tend to over-simplify RI stating that it is merely the identification of relationships between tables. It is actually much more than this. Of course, the identification of the primary and foreign keys that constitutes a relationship between tables is a component of defining referential integrity. Basically, RI guarantees that an acceptable value is always in the foreign key column. Acceptable is defined in terms of an appropriate value as housed in the corresponding primary key (or perhaps null).

The combination of the relationship and the rules attached to that relationship is referred to as a referential constraint. The rules that accompany the RI definition are just as important as the relationship. These rules define how data is to be properly added to the databases and what happens when it is removed.

There are other mechanisms in the DBMS that DBAs can use to enforce semantic data integrity. Check constraints and rules can be applied to columns that dictate valid values. The DBMS will reject invalid data that does not conform to the constraints. More complex data relationships can be set up using database triggers.

Every DBA should take advantage of the mechanisms provided by the DBMS to ensure data integrity. When DBMS-provided methods are used, fewer data integrity problems are likely to be found. Fewer data integrity problems mean higher quality databases and more proficient end users. You have to know what integrity rules are proper for the DBMS to enforce. But once defined, many of those rules can be enforced by the DBMS.

And that is very good, indeed!

Monday, November 03, 2008

On Date Formats, Part 2

Here is a follow-up question and answer based on my previous blog post:

Q: My format does not fit into any of the formats listed in the DB2 manuals. What if I have a DATE stored like YYYYMMDD (with no dashes or slashes) and I want to compare it to a DB2 date?

A: Okay, let's look at one potential solution to your problem (and then I want to briefly talk about the use of proper data types). First of all you indicate that your date column contains dates in the following format: yyyymmdd with no dashes or slashes. You do not indicate whether this field is a numeric or character field - I will assume that it is character. If it is not, you can use the CHAR function to convert it to a character string.

Then, you can use the SUBSTR function to break the character column apart into the separate components, for example SUBSTR(column,1,4) returns the year component, SUBSTR(column,5,2) returns the month, and SUBSTR(column,7,2) returns the day.

Then you can concatenate all of these together into a format that DB2 recognizes, for example, the USA format which is mm/DD/yyyy. This can be done as follows:

SUBSTR(column,5,2) || "/" || SUBSTR(column,7,2) || 
                    "/" || SUBSTR(column,1,4)

Then you can use the DATE function to convert this character string into a DATE that DB2 will recognize. This is done as follows:

DATE(SUBSTR(column,5,2) || "/" || SUBSTR(column,7,2) || 
                   "/" || SUBSTR(column,1,4))

The result of this can be used in date arithmetic with other dates or date durations. Of course, it may not perform extremely well, but it should return the results you desire.

Now, a quick word about using proper data types. I say this all of the time, but there are many applications and implementations "out there" that do not heed the advice: it is wise to use the DATE data type when you store dates in DB2 tables. It simplifies life later on when you want to do things like formatting dates and performing date arithmetic.

Using the appropriate data type also ensures that DB2 will perform the proper integrity checks on the columns when data is entered, instead of requiring application logic to ensure that valid dates are entered.

Thursday, September 04, 2008

Database Performance and Row Size

Recently I was reading through some posts in a database-related newsgroup or mailing list (actually, right now I can't remember which one it was). The conversation I was reading was in response to a question like "Does the number of columns or size of the row matter in terms of performance?"

Actually, the question asked what kind of a performance impact might be expected if a query was issued against two similar tables. The first table had (say) 20 columns, and the second table had the same 20 columns, as well as 35 additional columns.

Well, most of the basic responses were similar. The consensus was that as long as the query was going against the same columns then performance should be about the same. I disagree. Here is why.

You also need to factor in the I/O requests that are required to return the data. The DBMS will perform I/O at the block (or page) level - this is so whether you return one row or millions of rows. For multi-row results, accessing data from the table with the wider row (more columns) will usually be less efficient. This is so because fewer rows will exist on each page (the row with 100 columns is smaller than the row with 150 columns so more rows can reside in a single, pre-sized block/page). The bigger the result set, the more pronounced the performance degradation can be (because more physical I/Os are required to retrieve the data).

Think about it this way. Is it faster to pull smaller peaches out of a basket than bigger peaches? That is about the same type of question and anybody can envision the process. Say you want 100 peaches. Small peaches fit 25 per basket; big peaches fit ten per basket. To get 100 small peaches you'd need to pull 4 baskets from the shelf. To get 100 big peaches you'd need to pull 10 baskets from the shelf. The second task will clearly take more time.

Of course, the exact performance difference is difficult to calculate - especially over an online forum and without knowledge of the specific DBMS being used. But there will, more than likely, be a performance effect on queries when you add columns to a table.