Demystifying Metadata Management — Part 1
Metadata Management provides a base for an organization’s Data Platform Architecture. Let’s understand each component and its role in metadata Management.
Data:
Data is a collection of raw and unorganized facts that can be used in calculating, reasoning or planning. Without proper processing and organizing, it is useless. That’s where metadata comes into play.
Good read on Data: Blog by Dataedo
Image Courtesy: Dataedo by @piotr kononow
MetaData:
Metadata is simply data about data. It means it is a description and context of the data. It helps to organize, find and understand data, through information such as format, origin, creation date, modification date, etc.
Data stores information, but if you don’t know how to interpret it, you don’t have access to this information. Metadata enables you to understand data and extract the information.
Metadata, you see, is really a love note — it might be to yourself, but in fact it’s a love note to the person after you, or the machine after you, where you’ve saved someone that amount of time to find something by telling them what this thing is.
Cit. Jason Scott’s Weblog
Good read on metadata: Blog by Dataedo
Data Democratization:
Empowering employees and stakeholders of an organization with the right set of tools that enables them to make informed decisions.
Data democratization is the ongoing process of enabling everybody in an organization, irrespective of their technical knowledge — how, to work with data comfortably, to feel confident talking about it, and as a result, make data-informed decisions and build customer experiences powered by data.
Data Democraatization have answers to questions like:
“Experts in my company are too busy to help me”.
“I do not have access to data”
“I can not trust the data”.
Data democratizaton is an ongoing process and need cultural shift because it depends on ongoing process called Data Literacy.
Image Courtesy: Arpit Choudhury from his medium blog
Good read on Data Democratization: Blog by Towards datascience
Data Literacy:
The ability to read, analyze, work and communicate with data — known as data literacy — is now so critical to companies that it has been hailed as the second language of business by Gartner. The global pandemic highlighted its importance, with many companies starting to rely on data to detect new patterns, respond to changing customer behavior and make first-of-a-kind decisions in a new environment of many unknown factors.
Poor data literacy is ranked as the second-biggest internal roadblock to the success of the CDO’s office, according to the Gartner Annual Chief Data Officer Survey.
In upcoming years, data literacy will become essential in driving business value, demonstrated by its formal inclusion in over 80% of data and analytics strategies and change management programs.
One common misconception about Data Democratization and Literacy is that now everyone in the compaby will know everything related to the data and get you details about data in no time and there will be no need for Subject Matter Expert or Data Architect. This is not true.
Data Literacy and Democratiaztion provides way to be independent and able to complete tasks and take company to the right direction and have no place for presumption.
Good read on Data Literacy: Blog by thedataliteracyproject
Image Courtesy: Dataedo by @piotr kononow
Data Architect & Data Engineer:
The data architect and data engineer titles are closely related and, as such, frequently confused. The difference in both roles lies in their primary responsibilities.
Data architects design the vision and blueprint of the organization’s data framework, while the data engineer is responsible for creating that vision.
Data architects provide technical expertise and guide data teams on bringing business requirements to life; data engineers ensure data is readily available, secure, and accessible to stakeholders (data scientists, data analysts) when they need it.
Data architects have substantial experience in data modeling, data integration, and data design and are often experienced in other data roles; data engineers have a strong foundation in programming with software engineering experience.
The data architect and the data engineer work together to build the organization’s data system.
Good read on Data Architect vs Data Engineer: Blog by rsTask
Image Courtesy: Arun Elangovan
Data Steward & Data Analyst & Data Scientist:
- Data Analyst gather data from various databases and warehouses, filter and clean it. Data Scientist perform ad-hoc data mining and gather large sets of structured and unstructured data from several sources.
- Data Analyst write complex SQL queries and scripts to collect, store, manipulate, and retrieve data from RDBMS such as MS SQL Server, Oracle DB, and MySQL. Data Scientist use various statistical methods, data visualization techniques to design and evaluate advanced statistical models from vast volumes of data.
- Data Analyst create different reports with the help of charts and graphs using Excel and BI tools. Data Scientist Build AI models using various algorithms and in-built libraries.
- Data Analyst s pot trends and patterns from complex datasets. Data Scientist Automate tedious tasks and generate insights using machine learning models.
At high level, Data Steward handles day to day operations on policeis created by either Data Architect.
Data Engineers are the Bridge by Jennifer Shalamanov
The data steward is the “go-to” guy for everyone working with data within the company. Typical data steward roles and responsibilities can be grouped as:
- Operational Oversight — a data steward oversees the lifecycle of a data set. They are responsible for defining and implementing rules and regulations for the day-to-day operational and administrative management of data and systems.
- Data Quality — data steward responsibilities include establishing data quality metrics and requirements, like setting acceptable values, ranges, and parameters for every data element.
- Privacy, Security, and Risk Management — data protection is a key aspect of data steward responsibilities. A steward must establish regulations and conventions that govern data proliferation to ensure that data privacy controls are exercised in all processes.
- Policies and Procedures — data stewards, also establish policies and procedures for data access, including authorization criteria based on any individual and/or the role.
Good read on Data Steward vs Data Analyst: Blog by Simplilearn
Data Warehouse & Data Lake & Data Mart:
Data warehouse (DW) is a system for aggregating data from connected databases — and then transforming and storing it in an analytics-ready state. The main benefits of a data warehouse are effective data consolidation, fast pre-processing, and easy self-access for business users. The key constraint of using a data warehouse solution is the need to pre-transform all data using standard schemas. This increases the usage costs and reduces scalability potential.
Data warehouse solutions:
- Azure Synapse Analytics
- Amazon Redshift
- Google BigQuery
- Snowflake
Image Courtesy: Dataedo by @piotr kononow
Data lake is a centralized cloud-based repository for storing raw (unprocessed, non-cataloged, or pre-cleansed) data from various systems. Unlike DWHs, data lake technology allows storing both structured and unstructured data of any size (as object blobs or files). Cloud data lakes are also more scalable and support more querying methods for data retrieval and analysis — a factor data scientists well appreciate.
- Data lake solutions:
- Azure Data Lake
- Amazon S3
- Apache Hadoop
Data Mart is more focused subset of data present in Data Warehouse. It generally concerned with a single team of department like finance, marketing, or sales. It is smaller, more focused, and may contain summaries of data that best serve its community of users. A data mart might be a portion of a data warehouse, too.
Data Mart has few benefits over giving access to fuill warehouse to all the departments:
- Cost-efficiency
- Simplified data access
- Quicker access to insights
- Simpler data maintenance
- Easier and faster implementation
Good read on Data Warehouse vs DataLake: Blog by AWS
Conclusion:
This is the first part of series on metadata management. This part will help in building conceptual blocks of metadata management.
Please stay tuned for more parts of the series where we will discuss metadata management in detail and will also take one example of an organization to create metadata management for an example organization.
Please comment if you guys want me to focus on metadata management of any specific industry like E-Commerce, Healthcare, or Offline retail.
Keep Learning: Please refer here for part of the Demystifying Metadata Management — Part 2.
Top comments (0)