re: Datamarts, Data Vault, Data Lake... Data Swamp? VIEW POST


From a Data Analyst perspective how fuzzy do you think the boundaries between these Data * types are, how often do you think they are broken and how much do you loose when you break them?

To provide some context for my question, earlier this year I was working on a data migration tool that could take data from various different sources (structured, semi-structured and unstructured) and put them in a single database. After getting it to a point where it could handle test data sources we wanted to try it on a real database. We pointed it at an internal quality-data-warehouse, which combined data from Jira, builds, test results, etc that we knew was fairly large but not big data large. Much to our surprise we found out that this SQL Server instance, which is a relational database that combined data from multiple other relational databases, had ZERO relationships defined. As devs, everyone on my team was flabbergasted. Why would anyone setup a system like this? And yet clearly the consumers of this data don't consider this much of an issue.

So I'm curious, is this sort of thing common? A personal preference for certain types of analysts? A bizarre aberration? A sign the architect is mentally ill and requires treatment?


Thanks for sharing your experience. Sounds like it was a pretty frustrating project.

There’s no rules to say that everything in a relational database has to be completely normalised with primary/foreign keys on each table. There’s nothing to stop your team storing what is effectively a collection of spreadsheets. It’s an expensive way to go , but not flat out wrong.

Database development is a lot like software development. Sometimes a decision made a few years back causes headaches down the line - setting too strict data types, building metrics on dim tables, retrospectively adding keys, dropping tables without checking dependcies ... so a big bang overhaul to the architecture probably isn’t the best option.

What was the outcome for your project?


That's a good point about technical debt. I'm always cognizant that good database design is hard, but I suppose I should remember that it's not necessarily that someone sat down one day and said "I think this is a good design". These sort of situations can occur over time. And unlike software you cannot refactor a database, so any serious change will need an upgrade/transform step. And I've been involved in enough upgrades to know how scary that can seem.

We actually got it to a POC where its heuristics of inferring schema from unstructured/semi-structured sources worked fairly well, generating UNIQUE queries to find primary keys and JOIN queries for foreign keys. Then show our 'best guess' to a user who could fiddle with the diagram correcting mistakes before committing to the new schema and starting the transfer. But at a certain point if you had a very large, convoluted unstructured database it became a 'garbage in, garbage out' situation.

The actual frustrating part of the project is that priorities shifted and it got put on the shelf :(

code of conduct - report abuse