Azure Storage is cheap to just store data. For example 1 Terabytes of data storage cost starts from ~£15 per month.
But in reality we wanted to read/write and do some analytics on top of it (i.e. do some operations on the data). When you embark on an analytical project it is important we have decent estimate for storage and compute cost.
For this example I am only going to focus on read operation. The following API calls are considered read operations ReadFile, ListFilesystemFile
As per azure pricing doc - reading a 4MB data is considered as 10,000 operations.
So if I read 8MB data once then 8MB/4MB = 2 chunks of 4MB, 2 x 10,000 = 20,000 operations
So if I read 16MB data once then 16MB/4MB = 4 chunks of 4MB, 4 x 10,000 = 40,000 operations
So if I read 500MB data once then 500MB/4MB = 125 chunks of 4MB, 125 x 10,000 = 1,250,000 operations
As per the azure pricing doc £0.0042 is the unit cost per 10,000 operations/per read operation.
£0.0042 x 125 operations (1,250,000/10,000) = £0.525 for reading 500MB data once.
Say if I have spark or azure databricks script which is scheduled every 15 mins and reads this 500MB of data every time. The price per day (1440 minutes) for reading 500MB data will be,
Cost per 15 minutes = £0.525
Cost per day (we have 96 15 minutes per day) 96 x £0.525 = £50.4
That's £1,512 per month to read 500MB data every 15 minutes.
The above calculations were helpful for me to estimate true cost of an analytical workload. Based on this, if possible we can tweak and apply optimisations such as partition pruning, projections or selections on the workload/scripts so we read only the data we need.
Top comments (0)