Big data is everywhere.
But before there were Data Lakes and big data tools, there were Datamarts and the Data Vault.
This is a beginners overview of some of the terminology used by Business Intelligence teams in their data architecture.
This is the data that’s aggregated, modelled and structured so that you can build a report over the top of it. Your friendly BI Consultant has gathered the requirements from the business, agreed on the business rules and logic, then modelled the data to be user-friendly.
The Datamarts tables are organised in one of two forms. A ‘Star’ schema and a ‘Snowflake’ schema and are made of two types of tables.
The Dimension table contains attributes, or thing that describe. In the case of a User this could be their first name, last name, age, etc. In the case of a Product this could be it’s weight, colour or date of manufacture. These attributes should not be repeated in other tables around the database.
The Fact table is used for counting how many times something has happened. How many orders, payruns etc. They are used to capture an action that can then be joined back to a Dimension table.
Example of a Snowflake schema by SqlPac at English Wikipedia, CC BY-SA 3.0
Example of a Star schema by SqlPac at English Wikipedia, CC BY-SA 3.0
The Data Vault is used to store source data at a more granular level. Generally, the data is not changed in any way, other than to add load date keys to track changes. These are the tables that build the Datamart tables.
Tables come in three types:
These are the complete source tables that contain descriptive information and time attributes so we can track changes and do point-in-time analysis.
These contain the business keys and any metadata. Nothing descriptive is written to a Hub.
Links connect one or more Hubs together.
Example of a simple data vault model by Carbidfischer at English Wikipedia, CC BY-SA, CC BY-SA 4.0
The Data Lake is designed to contain raw data in it’s most granular form at low cost. Data can be:
- Structured - tables from databases.
- Semi-structured - csvs and log files.
- Or not structured at all - images, emails and documents.
Data stored here is retrieved and read when required and organised based on need, making it popular with data scientists who would rather keep the quirks and ‘noise’ in, rather than having it cleaned and aggregated.
Beware … the Data Swamp.
Even though the Data Lake concept gives more freedom to load data of all shapes and sizes, without a catalogue of entities, things get messy.
This isn’t just a matter of not knowing what's in your Data Lake. If there are no controls over what kind of personal data is landed there could also be privacy issues.
Photo by Pixabay on Pexels