In the previous article, we touched upon the basic properties of data, it's classification and the various schema types that are available for data organization. Pls checkout if you had missed it!
Here in this post, let's discuss about the Data Formats.
What is a Data format?
It's a way of representing data for the purpose of storage and transmission across software systems. Let's say passing a request to an API service, transferring multimedia content across the internet, send/receive email or short messages in web/mobile and in all these cases, message format is required so that the sender and receiver could understand and communicate accordingly.
Is it not that Database also stores and helps for data transmission?
Yes, it is, but it behaves as a storage system as a whole, performing store, manage and transmit data. In addition, it also does Indexing, Querying and defining data relations. On the other hand, Data Format is more of the way the data is represented and does not go beyond that (for ex: we create config file in an XML/CSV/TXT format as a way of representing the configuration of system).
Similarly, MYSQL database can store data in tables using Flat format such as CSV/JSON/BLOB or in Structured Rows and Columns format. Database also allows you to query and index the data accordingly. So, a Data format is only a Representative part of data, but a Database is a management system that encompasses all entities related to data including the Data formats.
Is it really necessary to know about the data formats?
Yes, it's important when you decide to design a system, which deals with various internal & external components for its normal operations. Better analysis of formats helps to optimize data retrieval and transfer rates.
Data Formats also depends on the type of industry and the application that is linked to it. Many Industries have their own custom formats defined for security and proprietary purposes as well (for ex: Industries using different CAD/CAM systems have customized CAD data formats to talk to external suppliers/customers and other industrial partners)
Checkout the Wikipedia link for the Data exchanges in CAD format...CAD_DATA_EXCHANGE).
What are the different Format Categories?
On a broader sense, data is represented at the binary level (0s and 1s) for the computer to process it...But the Software systems define additional data formats for human readability and interoperability between systems. Some commonly used Classification as shown in the diagram below,
- Binary data format is one where data is stored in sequence of 0s and 1s. It is used for efficient storage and faster retrieval of big data such as (Video files, Audio files, Images...etc).
- Text data format is more of human readable form (with numbers, letters, symbols) where it is easily read and processed by humans. CSV, TXT, Json, XML and other documents, spreadsheets are all that type.
- Vector formats are normally used for storing drawings and graphical images that can processed by software's for extensibility rather than static. SVG (Scalar Vector Graphics), EPS(Encapsulated Post Script), EMF(Enhanced Metafile) are all that formats.
- Compressed formats are done by taking the original data and simplify them to smaller size using standard compression algorithms without affecting the original content.ZIP, RAR, TAR files are all Compressed formats that we use it quite regularly.
- Custom Formats are kind of specific to certain products for proprietary and security reasons. It can also be a hybrid of the other formats described above. For ex: In case of Health records/Medical Imaging, custom formats are created for storing patient information that needs to be secured and needs the purview of the health company handling the same.
Cloud, Containers & Devices: With the Digital Explosion we have right now there is a need for lightweight data formats across Handheld, Embedded and IOT devices to handle quicker response and short message transfers. Also, the need for handling big data at the Cloud/Container ecosystems mandates for various new data formats as shown below.
What are the Factors to consider while selecting Data formats?
Selection of a Data format for IO operation/Data transmission hinges on the following basic factors as given in below diagram.
In addition to analyzing the factors above it is required to find and understand the trade-offs before arriving at specific data format. (Trade-off means whether we need to select a Text format against a Binary format for a simpler message transmission application or select a Binary format so that the data transmission handles huge data and is faster and more secure. It is advisable to check both the Advantages & Disadvantages of each format keeping the end-user experience as vital.)
Let's consider set of constraints
Most of the time, requirements and constraints drives us to go for a specific data format. Take the list of constraints based on the above factors.
- Low Network bandwidth mandates for better compression, storage and transmission...Protocol Buffers, Avro are Binary format and have better compressions compared to Text based formats.
- Low Network latency mandates for lesser data packet size, lesser time to synchronize between sender and receiver and reducing network hops...Binary encoding formats helps to achieve lesser packet size, Cloud based systems helps to access at central location at network for reducing hops and Avro, GZip, ORC, helps us to do the same.
- Low Memory mandates for Efficient compression and minimalized storage requirements. Binary Formats and Compression methods helps in this and GZip, Avro, Protocol Buffers helps us do the same.
- Low Processing power mandates for reduced complex data structures, efficient compression and simpler constructs. Json, CSV, Text has the simpler constructs and with Gzip compression can help to address the issue and with Protocol Buffers with inherent support for Binary format, Schema (which is scalable based on requirement) could also help in this regard.
As the constraints differ across application requirements, available Infrastructure and various other internal and external factors, we need to arrive at a format which is best for your application.
How to transmit the data faster and secure?
Clients & Servers are distributed across Public/Private networks and hence the need for encoding the data for faster transfer and encrypting the data for Secure transmission is a necessity.
The process of Encoding is used to convert the data from its original format into a format that can be transmitted or stored more efficiently. The Encoding process is typically used to compress the data and to make it more compact, so that it can be transmitted more quickly or stored more efficiently.
The process of decoding is used to convert the encoded data back into its original format so that it can be used by the system. The decoding process is typically used to decompress the data, so that it can be used by the system in its original format.
The Process Encryption is to secure the data through the network so that eavesdroppers cannot read the message easily. Both these processes are quite important in today's networked environment.
Similarly, The Process of Decryption is to decipher the encrypted message without any loss or inconsistency at the receiving end.
(Some common Encoding/Decoding methods are base64, UTF, ASCII, URL Encoding...etc... Some Common Encryption/Decryption methods are AES, DES,RSA....etc)...In my later posts, I will be covering the Encoding and Encryption concepts in more detail.
Below process diagram shows the concepts
(Encoding/Encryption can happen at different orders and stages and not necessarily as per above depiction. It all depends on the requirements and specific architecture the team is concerned about)
Here is the sample "Employee" object represented in different data formats.
For the above "Employee" data, Let me perform the base64 encoding format (with UTF-8). Below is the output for the same
As you can see the data is now modified from Json format to ASCII format useful for transmission across HTTP. With additional Encryption method like AES/RSA the same data will be more secure, and it will be difficult for hackers/intermediates to decipher it. Try out different sample data and perform Encoding/Decoding them using below URL
For more in-depth understanding of the data formats discussed, please do check out the below links for specifications and standards globally referred and used.
Xml Specification ---> Xml_Spec
Json Specification ---> Json_Spec
Proto Buffer Spec ---> Proto_Buffer
CSV Specification ---> Csv_Spec
YAML Specification ---> Yaml_Spec
Avro Specification --->Avro_Spec
Pls do share your thoughts and list the data formats you have come across that would be helpful for other readers.
Top comments (0)