Sumit Tyagi

Posted on Oct 5, 2022

How to choose the right cloud data storage service for your project?

#database #datascience #azure #aws

Since data comes in all shapes and sizes, no single cloud solution can fit all data.

the way you store and manipulate banking data is different from how you treat X-Ray images.
even the historical transaction data of bank will never be updated so you treat it differently from customer data which is bound to get updated.

Hence, we need to select the best cloud data storage service for our use cases for better performance, cost savings, and improved manageability.

Seeing a ton of storage services on your cloud console can be baffling. But, if we let the data speak itself for requirements it becomes easy to filter out the best available option. We can follow the below-mentioned steps to map data and data-operations characteristics to the best data service.

Step 1: Understanding Data & Operations Characteristics

Different data behave differently and require a peculiar methodology to effectively operate on it. Key characteristics to make a better decision are:

Data Type & Structure: What type of data do we have?
Data Size: How much data do we need to store?
Data Operations: [OLTP/OLAP] What type of transactions do we need to perform on data?
Data Hotness: How often do we access or transact on the data?
Data Availability: How much down-time is acceptable?
Data Privacy: Is there any privacy concerns with data?
Data Replication: Do we need multiple instances of data available for consumption?

Data Type & Structure:

On a high level we have three types of data following below:

Structured Data: relational data, all data has the same fields or properties. All the data has the same organisation and shape, or schema. Such as, historical sales data.
Semi-Structured Data: isn't stored in a relational format because the fields don't fit neatly into tables, rows, and columns. Semi-structured data contains tags that make the organisation and hierarchy of the data appear. One example is key/value pairs. Semi-structured data is also referred to as non-relational or not only SQL (NoSQL) data.
Unstructured Data: Data including blob, Media files, photos, videos, and audio files, Word documents, Text files, Log files.

Data Size:

The amount of data we need to store, transact or analyse. For example:

10 year sales data: 50 Million records X 500 dimensions.
X-Ray images: 5000 X 50X50 images.

Data Operations:

What type of operations we need to perform on data. For example:

read-only: can be used for historical data which is unchanging.
OLTP: Online Transaction Processing can be used in banking, shopping, order entry, or sending text messages where change in one piece of data results in change in another piece of data.
OLAP: Online Analytics Processing can be used to drill down for any type of analysis where we perform multidimensional analysis at high speeds on large volumes.

Data Hotness:

Data hotness defines how frequently it will be accessed and how long it will be retained.

Hot Data: Data which needs to be access or modified frequently. Such as viewer list on Whatsapp status.
Cool Data: Data which needs to be access or modified infrequently. Such as legal documents.
Archival Data: Data that is rarely accessed, and that has flexible latency requirements, on the order of hours. Such as historical weather information.

Data Availability:

How much up-time is required according to SLA?

Uptime level	Uptime hours per year	Downtime hours per year
99.9%	8,751.24	(8,760 – 8,751.24) = 8.76
99.99%	8,759.12	(8,760 – 8,759.12) = 0.88
99.999%	8,759.91	(8,760 - 8,759.91) = 0.09

Data Privacy:

Are we even supposed to move data out from on-premise machine?
Is there a specific encryption required before processing?
Is there a specific set of operations we can't perform?
Is there a specific region in which cloud service should reside?

Data Replication:

Do we need replicate data for multiple instances?

Because may be:

we need to have multiple instances available for multiple regions to reduced latency.
we need to have replication for backup purposes.

Step 2: Understand the Nature of Cloud Storage Services

Every cloud storage service is designed for a specific purpose which we need to understand (preferably remember). So that we can recall when we get a specific storage requirement based on data & data operations characteristics.

We are taking few Microsoft Azure services as example, you may search of an alternative GCP or AWS service for the same.

Service Name	Data & Data Operation Characteristics
Azure Data Storage Lake	To store huge amount of structured data for big data analytics.
Azure Cosmos DB	To store No-SQL (semi-structured) data with 99.999% availability with data replication.
Azure SQL DB	To store relational database with OLTP capability, has scale-up and scale-out capability.
Azure Synapse Analytics	To be used for data warehousing and Big Data analytics. It can process massive amounts of data and answer complex business questions with limitless scale.
Azure Stream Analytics	To be used when working with real-time steaming pipelines I can respond to data events in real time or analyze large batches of data in a continuous time-bound stream, Stream Analytics is a good solution. Very hot data.
Azure HD Insights	To be used to ingest, process, and analyze big data. It supports batch processing, data warehousing, IoT, and data science. Low-cost solution for big data
Azure Blob Storage	To store blob files, directories and any type of unstructured data, can be hot, cool, or archival

Step 3: Choose the Desired Storage Service based on your Data & Operation Characteristics

You can read the following use cases provided by Microsoft Azure to learn how different storage services solve different business problems.

Contoso Life Sciences is a cancer research center that analyzes petabytes of genetic data, patient data, and records of related sample data. Data Lake Storage Gen2 reduces computation times, making the research faster and less expensive.

Consider this example where Azure Cosmos DB helps resolve a business problem. Contoso is an e-commerce retailer based in Manchester, UK. The company sells children's toys. After reviewing Power BI reports, Contoso's managers notice a significant decrease in sales in Australia. Managers review customer service cases in Dynamics 365 and see many Australian customer complaints that their site's shopping cart is timing out.
Contoso's network operations manager confirms the problem. It's that the company's only data center is located in London. The physical distance to Australia is causing delays. Contoso applies a solution that uses the Microsoft Australia East datacenter to provide a local version of the data to users in Australia. Contoso migrates their on-premises SQL Database to Azure Cosmos DB by using the SQL API. This solution improves performance for Australian users. The data can be stored in the UK and replicated to Australia to improve throughput times.

At Absolutdata Analytics, we used Azure Blob Storage to store image data, the blob storage can be mounted as a volume to VM using Blobfuse and can be readily used.

Thanks for reading :)
I hope now you are in a right state to examine your business use case for it's nature of data and operations and select the right cloud storage service.

DEV Community