Database normalization may be harmful to efficiency on large scale analytics projects.

#database #bigdata #sql #analysis

If the select queries to your relational database return big tables, additionally if you used one or more join statement in that query, let me guess it wasn't quick process to fetch the resulting table. JOIN operation adds a computation step for each record fetched. They accumulate over the number of records, or rows, for which records from a linked table need to be fetched. The worse it gets, the more joins there are in the query. This results in a computational expense that you might avoid using denormalized table design.

To decide which approach in a design to use to optimize the number of computations and the storage efficiency of your database, you need to consider the frequency of update and select operations and how many of them are going to result in queries returning a big number of records.

If you do not perform any separate analysis on a certain feature of a dataset (contained in one column), so in other words, if that data has a secondary relevance, you may want to consider merging the column with some other column with similar characteristics.

To back up the decision about merging/adding a summary column, consider:

whether your data to merge do not change too frequently. UPDATE, just like SELECT queries have their prices in terms of the number of computations performed. You "pay" for both of them, either it's updating some columns in a database upon the change of value in another column or users of your database waiting longer for the queries to JOIN the pieces of data.
one other thing to consider is the size of the pool of records that will usually gonna be returned from queries for analysis. You may want to think of any possible questions that may emerge and possible queries that may need to be performed in certain scenarios of analysis and consider the sizes of possible returns from those queries. If the data returned from most of the queries are gonna be big, you may save time for fetching records that are rather contained in a fewer number of cells that you do not need to JOIN before getting the results from RDBMS.

To sum this up, choosing the design—deciding whether to merge columns / add a summary column (denormalization practices)—aims at minimizing the time of return from the query, which involves minimizing the number of computations needed. It comes down to a trade-off between normalized and denormalized design, that should be carefully chosen and be mainly determined by:

the frequency of changes (UPDATE) in the data that involve computation if the table contains a summary column that needs to be updated along with a changing value in another column,
and the sizes and frequency of, especially big, SELECT queries.

But here again, there usually are not many updates in analytical DBs specifically, so I conclude that this type of database may benefit hugely from careful denormalization.