In today’s article, we will learn about Normalization and Denormalization which are the most commonly asked topics in database design. Moreover, frequently asked in interviews for positions like Data Engineer, and help improve performance in SQL.

What is Normalization?

Normalization is the process of eliminating redundant data in a database to make the data more organized and consistent. The goal here is to reduce data duplication, ensure data integrity, and making tasks easier like updates, maintenance, etc.

Example

Let’s say we have a dataset containing the courses taken by students at a university and the names of the instructors for these courses. We want to normalize this dataset. However, normalization is performed in several stages, which are called normal forms. A example of this dataset is provided in Table 1.

Analyze with Colorful Tables

Table 1. Data of courses taken by students at the university.

1NF (1st Normal Form — Using Atomic Values)

  • In the first normal form, each column must be atomic, meaning it should contain a single value.
  • There should be no repeating columns, for example, students’ courses should not be in columns like course1, course2, course3.
  • Each row must be unique.

In our example, Table 1 satisfies 1NF.

2NF (2nd Normal Form — Removing Partial Dependencies)

  • 1NF must be satisfied.
  • A primary key should be defined for the table.

In our example, we can choose the combination of “Student ID” and “Course ID” as the primary key. Below, you can find the 2NF tables created by removing partial dependencies.

Table 2. Students table.

Table 3. Courses table.

Table 4. Student-Course relationship table.

What is Denormalization?

If speed and performance are more important, denormalization is more suitable than normalization. It allows us to perform fewer queries and achieve faster read operations.

Table 5. Denormalized table.

In Table 5, you can see the denormalized example of Table 1. Here, all the data related to the courses taken by a student is present in each row. To increase the level of denormalization, additional columns like “Course 2 Name”, “Course 2 Instructor”, etc. can be added with keeping the extra course information in the same row for each student. This way query performance can be improved. However, the storage cost increases due to data redundancy.

So, Which One Should Be Used?

If data integrity is very important and updates are frequent, normalization should be preferred. In a banking system, customer, account, and transaction information are stored using normalization.

For example, a customer can have multiple bank accounts, and transactions made through each account are recorded separately. If customer information is repeated in every transaction record, every time a change occurs, all transaction records would need to be updated, which creates a major issue. Therefore, normalization is applied and customer information is stored in a separate table.

If queries need to run quickly and JOIN operations should be avoided, denormalization should be preferred. In an e-commerce site where order details are frequently queried, denormalization is preferred.

For example; combining customer, order, and product tables to see a customer’s order history can take time. Instead, order details can be stored in a single table along with the customer’s name, address, and product information. This situation speeds up read operations, but can lead to data redundancy. However, since read operations are more frequent than updates in e-commerce systems, it provides a performance advantage.

Conclusion

In this article, we have learned about the concepts of Normalization and Denormalization.

If you enjoyed this content, feel free to follow me and share this article to help more people learn. Thanks for your support! 🙌

How did you find this article?