re: BI Series: Datamarts, Data Vault, Data Lake... Data Swamp? VIEW POST


Thanks for sharing your experience. Sounds like it was a pretty frustrating project.

There’s no rules to say that everything in a relational database has to be completely normalised with primary/foreign keys on each table. There’s nothing to stop your team storing what is effectively a collection of spreadsheets. It’s an expensive way to go , but not flat out wrong.

Database development is a lot like software development. Sometimes a decision made a few years back causes headaches down the line - setting too strict data types, building metrics on dim tables, retrospectively adding keys, dropping tables without checking dependcies ... so a big bang overhaul to the architecture probably isn’t the best option.

What was the outcome for your project?


That's a good point about technical debt. I'm always cognizant that good database design is hard, but I suppose I should remember that it's not necessarily that someone sat down one day and said "I think this is a good design". These sort of situations can occur over time. And unlike software you cannot refactor a database, so any serious change will need an upgrade/transform step. And I've been involved in enough upgrades to know how scary that can seem.

We actually got it to a POC where its heuristics of inferring schema from unstructured/semi-structured sources worked fairly well, generating UNIQUE queries to find primary keys and JOIN queries for foreign keys. Then show our 'best guess' to a user who could fiddle with the diagram correcting mistakes before committing to the new schema and starting the transfer. But at a certain point if you had a very large, convoluted unstructured database it became a 'garbage in, garbage out' situation.

The actual frustrating part of the project is that priorities shifted and it got put on the shelf :(

code of conduct - report abuse