Using a story-telling example, I would like to differentiate between schema on read and schema on write.
Ria is running a retail shop. She had to store a lot of information on products, sales, customers and many more. Lets focus on the use case of product table here.
When she started storing the data, she created a product table with the following fields - 'product id' of type number which is the primary key, 'product name' of type text with unique constraint, 'cost' of type float, 'category' of type text, 'base discount percentage' of type float.
Six months had passed and she had about a few thousand records in it. To expand her customer base, she introduced promotional discount for her products. She had to alter the table definition to add a new column 'promotional discount percentage' of type float.
After a few months, she realized that it would be beneficial to store the supplier information in the product table. Again, she had to alter the table to add a new column 'supplier'.
The next day, she was talking with her data analyst, Tim. Tim helps in calculating the recommended price of products based on various factors. After talking, she realized that Tim did not need the supplier column, yet the table he used for analysis had the supplier column. She felt it was unnecessary to have the columns one does not need. But she could not remove that column since she needed it for logistics purposes. She was perplexed.
Later, one day, she was taken aback to know that product name is not unique. It looked like product names can repeat with different categories. Because of the unique constraint, she was not able to put some product data into the table. She had to remove the unique constraint from the table immediately to avoid data loss.
She had to face similar problems frequently until she met her friend Amy. Amy patiently listened to her problem. She explained to Ria that she has been following 'schema on write' approach for her data. She added that 'schema on read' would have been a better fit for her use case.
'Schema on write' approach is where we have a predefined schema for the data. Traditionally, most applications use relational databases to store data. In relational databases, we have multiple tables which has well defined column names, data types, constraints and even indexes. This is suitable for transaction data.
'Schema on read' approach is where we do not enforce any schema during data collection. When reading the data, we use a schema based on our requirements. In our example, Ria could have two different schemas: one with 'supplier' column for her logistic needs and one without 'supplier' column for the data analyst.
If there is an additional column coming as part of the data, that would be collected even before modifying the schema. Schema can be modified to include the additional column if it is needed. One big risk with Schema on Write is that schema is enforced on data up front and there are chances that we might lose additional information.
Also if any constraint in schema used for reading is not satisfied, it does not affect data collection. Data collection is independent of the schema.
Ria understood what Amy suggested. She agreed that 'schema on read' would have been a better fit for her use case and started working towards it.
Thanks for reading! Please post your feedback and comments.