I was busy as hell from last few months, and now able to get the time.
This blog is going to be first part of DP-900 Preparation.
I am mentioning things in brief for revision purpose, please read documentation from MS learn first. Will recommend reading these blogs for revision.
LETS BEGIN TADAA
As the name is data fundamentals, we should know what data is.
Data is collection of facts, number and description.
We have 3 types of data, defined in this course.
- Structured data
Data which adheres to a fixed schema or tabular format is structured data. All the data has field and properties.
Rows represent data entity, column represent attributes.
Example : A table for storing customer info and their details
- Semi-structured data
It has structure, but allows some variation.
Example : Customer may have more than one email ID
One such format is JSON
- Unstructured data
Images/ Audio file/ Video data/ Documents do not have specific structure and fall in this category.
Now we are done with types of data, lets see DATA STORE...
Organizations store data in all 3 defined types.
- File Store
In this file format is used to store data, it depends on type of data and application required to modify it.
CSV, and TSV files are structured data.
It is good for structured and semi structured data.
Hierarchical document schema is used to define objects that have more than one attributes. Hope you are familiar with JSON format.
It is human readable format. Tags are used for element and attributes.
Images, videos and audio are stores.
These are application specific documents.
It is for unstructured data
It stores file in raw binary
It enables compression, indexing and efficient storage.
- As the name suggests, it is optimized so we can assume it may acquire less space and all
- Avro It is a row based format by Apache. Header is stored in JSON, and data is stored in binary information. It is good for compressing data, minimize storage and network bandwidth requirements.
Since it optimized file format it is for compressing data, when data is compressed it take minimum storage.
Optimized row column format
It organizes data into columns rather than rows
Contains stripes of data, hold data for a column.
Stripe has index, data and footer
A Columnar data format
It has row groups
Paraquet files contain metadata
It supports compression and encoding schema
Thanks for reading :)
SEE YOU IN NEXT BLOG