Ayaka Hara

Posted on Aug 10, 2021 • Edited on Aug 16, 2021

Cost comparison between Azure services to determine architecture

#azure

Cost is an important factor to consider when developing a cloud-based solution. The main purpose of cost estimation is to determine the architecture and to predict the future costs.

First of all, let's take a look at the finalized architecture from the cost estimation results as well as project's specific requirements in the case we are going to use as an example.

In this article, I will explain how the cost estimation was done to determine this architecture, with actual data and requirements from a real project.

Note : The content of this article is as of August 2021. The cost was calculated using "Pay as you go" rate.

Requirements for cost estimation

1. Projected amount of data in the future

We perform cost estimation using projected scale information for the future, such as several years from now. The reason for this is that if we use the most recent data, scalability may not be able to be taken into account, or if the amount of data increases, the cost may greatly exceed the budget and the architecture may need to be reexamined.

In this article, the projected data is going to be used as an example for cost estimation.

The assumption in this example is that telemetry messages from 10 devices are consolidated into one array and it is sent to one connector, which then goes to Azure.

Actual telemetry data

Here is the example of how telemetry messages sent to Azure look like:

[
    { // device 1
        "deviceId": "ffee4208-eaca-4f7b-8882-fee956b3776a",
        "connectivity": "Online",
        "eventType": "Telemetry",
        "deviceTime": "2021-07-16T13:34:00.000Z",
        "connectorTime": "2021-07-16T13:34:00.000Z",
        "telemetryData": {
            "6E8E2CE5-3A7D-4997-9056-297BAD62C617": false, // data point - the max data size could be 0.05KB
            "1023EF00-093C-4702-886F-6C9C8B4D3169": 123,
            "46B219AF-E355-479D-B02C-274E09A38BDC": "1.234"
            ...
            ...
        }
    },
    { // device 2
        "deviceId": "44B5EF8A-0F54-4DBC-A343-58828892E2D2",
        "connectivity": "Online",
        "eventType": "Telemetry",
        "deviceTime": "2021-07-16T13:34:00.000Z",
        "connectorTime": "2021-07-16T13:34:00.000Z",
        "telemetryData": { 
            "2F91F10F-63BC-4E84-A72D-95BABB37C155": false, // data point - the max data size could be 0.05KB
            "CB7C127D-D3EB-475F-9FCB-0F721B582C58": 123,
            "4ADE8439-961D-493D-B306-D2567F87429A": "1.234"
            ...
            ...
        }
    },
    ...
]

As mentioned inline above, the data size of each data point could be up to 0.05KB (= 50Byte) and each device sends up to 50 data points (messages). In addition, messages from 10 devices are consolidated into one array.

This maximum size should be applied when cost is estimated.

Number of devices

10 devices are connected to a connector. The total number of devices will expect to be 2000 which means there will be 200 connectors.

Frequency

Telemetry messages for 10 devices are bundled together in connector units and sent every second.

2. Data storage

Keep data for a certain period of time

Telemetry messages (raw data) are required to be kept for a certain period of time in hot/warm storage without deleting them since creating summaries and querying them frequently are planned.

In this article, the cost will be calculated assuming that the data is retained for 48 hours as an example.

3. Budget

Compared with competitors

It is important to consider differences in both cost and functionality when comparing services. We modeled different usage scenarios representing differences in data volumes, frequency and data retention. Competing services' end-user pricing was used as a guide to determine the budget for our solution.

Architecture options

Now it's time to go through the cost comparison and decide on the architecture.

For comparison, the architecture of the Azure part will be divided into three major categories.

Where to send
Compare IoT Hub and Event Hubs to see where to receive per-connector telemetry messages.
Where to process
Compare Stream Analytics and Functions to see where telemetry messages sent in bulk by connector are decomposed to per device.
Where to persist
Compare Cosmos DB and Table Storage to see where to store per-device telemetry messages.

The architecture decision was not based on solely on cost but also on project's specific requirements. Those requirements were explained in each selection criteria below.

Cost comparison

This estimation will be based on the projected data volume and number of devices.

1. Where to send : IoT Hub vs Event Hubs

Selection criteria

A message size : 25KB
Frequency : telemetry message from 200 connectors sent every second

How to calculate

IoT Hub

Step 1 : Basic or Standard tier

Firstly you need to decide which tier you want to use.
In this project, Cloud-to-Device messaging is planning to be used in the future. Since Basic tier does not support that feature Standard tier is selected this time.

Please refer to the document - Choose the right IoT Hub tier for your solution for more detailed information.

Step 2 : Number of messages per day

Next, the amount of messages sent to the IoT Hub per day should be calculated.
As mentioned earlier, a message is sent every second per connector, which means 200 messages are sent every second. The formula is as shown below.

Step 3 : Number of billing messages per day

Then the amount of billing messages sent to the IoT Hub per day should be calculated.
The message meter size for both the Standard and Basic tiers is 4KB. (Ref - IoT Hub pricing)
As mentioned above, since messages from 10 devices are consolidated into one array a message size per connector is 25KB.
Based on the above information, the formula is as follows.

Step 4 : Edition Type, Number of units

There are three edition types under Standard tier : S1, S2, and S3. Each edition type has the limitation of total number of messages per day per IoT Hub unit. (Ref - IoT Hub pricing)

Based on the above information, the formula to calculate the number of units required respectively is as follows.

Event Hubs

Step 1 : Basic, Standard or Dedicated tier

Since the max retention period for the Basic tier is 1 day, Standard tier needs to be selected to retain data for 48 hours.

If Event Hubs events are retained for up to 90 days the Dedicated tier should be selected.

Step 2 : Ingress events

Since 200 connectors send telemetry messages every second, the number of ingress events per month can be calculated by multiplying the number of connectors by the number of seconds per month.

Step 3 : Throughput units

Then, the message size is multiplied by the number of connectors to calculate the ingress data size per second. In practice, in addition to the 25KB message size, the system message size and other factors need to be taken into account. Therefore, the total is calculated to be 6TU since 1TU is required for every 1000KB.

Step 4 (Optional) : Capture feature

The Azure Event Hubs Capture feature automatically processes and stores event data in your Azure storage account. The price is based on the number of Throughput Units selected for the Event Hubs. This time capture feature is not applied.
For more information, please see the pricing details page.

Tips

Tips 1 - IoT Hub : Increase in cost

The cost does not increase in proportion to the amount of data, but rather in a staircase pattern.

Tips 2 - IoT Hub: Number of connectors allowed in S3 and calculation method

The number of Connectors allowed in 1 unit of S3 is 555. The formula is as follows.

This means, of course, that the cost per connector will be lower when 555 connectors are used than when 200 connectors are used.

Tips 3 - Event Hubs: Consider the Auto-Inflate setting

Event Hubs traffic is controlled by TUs (standard tier). For the limits such as ingress and egress rates per TU, see Event Hubs quotas and limits.
Auto-inflate enables you to start small with the minimum required TUs you choose. The feature then scales automatically to the maximum limit of TUs you need, depending on the increase in your traffic. Auto-inflate provides the following benefits:

An efficient scaling mechanism to start small and scale up as you grow.
Automatically scale to the specified upper limit without throttling issues.
More control over scaling, because you control when and how much to scale.

Note : Auto-Inflate is a scale-up only feature. It will not automatically scale down.

More informations is available here.

Tips 4 - Event Hubs: Increase in cost

As with IoT Hub, the cost does not increase in proportion to the amount of data, but rather in a staircase pattern.

Tips 5 - Compare IoT Hub and Event Hubs

While the IoT Hub can manage devices and provide two-way communication (C2D, D2C), the Event Hubs can only provide one-way communication. Thus, when choosing Event Hubs, it is another option to use it together with S1 of IoT Hub and security should also be considered.

Please refer to Connecting IoT Devices to Azure: IoT Hub and Event Hubs.

2. Where to process : Stream Analytics vs Functions

Selection criteria

A message size : 25KB
Frequency : telemetry message from 200 connectors sent every second
Processing details :
- Decompose telemetry messages from per connector to per device (i.e. 200 messages from connectors to 2000 messages from devices)
- Save telemetry messages to specified storage dynamically
- The processing time between device and storage should be within 10 seconds

How to calculate

Stream Analytics

Step 1 : Standard or Dedicated plan

There are two plans: Standard and Dedicated.
For Dedicated plan, at least 36 streaming units (SUs) are required. Therefore, Standard plan is selected.

Please refer to Standard streaming unit section in the pricing page.

Step 2 : Number of streaming units (SUs)

Choosing how many SUs are required for a particular job depends on the partition configuration for the inputs and on the query defined for the job. You can select up to your quota in SUs for a job. By default, each Azure subscription has a quota of up to 500 SUs for all the analytics jobs in a specific region.
Valid values for SUs per job are 1, 3, 6, and up in increments of 6.

The keys to determining the appropriate SU from load testing are

SU% utilization should not greater than 80% (must be less than 80%)
Any backlogged input events should not be occurring (slowly increasing or non-zero) In this case, the workload may require more computing resources, and the number of units needs to be increased.

Please refer to Understand and adjust Streaming Units.

Load testing should be conducted to determine how many SUs are needed.

Let's see how we checked the metrics to determine the number of SUs required when we conducted our load tests. We performed load tests on 1, 3, 6, 12, and 24 SUs respectively.

Here are the metrics for the 3 SUs as an example.

As shown in the figure above, SU % utilization is 52%, which means that it meets the criteria of less than 80%.
Watermark delay is the time stream got out minus the time stream got in. 3 SUs had a maximum delay of 3.18 min.
Backlogged Input Event should be as close to 0 as possible. If not 0, it means that the number of SUs is not enough for the job. For 3 SUs, the backlogged input event was a large 5.95k, indicating that the processing was not able to keep up.

The table below summarizes the results for the other patterns.

As a result of the load testing in our case, we found out that 6 SUs are required for the Standard plan.

Additional required step : Consider combining services to save to a specified table name dynamically

Stream Analytics has limited flexibility in export destination and cannot dynamically save to a specified table. Therefore, other services need to be combined to achieve that. In our project, we selected Functions and conducted load testing in combination with Stream Analytics.

The cost of the Functions required to dynamically save to the specified table is shown in the figure below (The detailed costing method is described in the next section).

Just as explained above, if you choose Stream Analytics, you will need to combine it with Functions, which will cost you the sum of the costs of both services.

Functions

Step 1 : Consumption, Premium or App Service Plan

In our case, the system will potentially be scaled down or switched off periodically to reduce cost. Therefore, the consumption plan is not an option because it cannot be set to Always on and will result in a cold start.
Cold start is a term used to describe the phenomenon that applications which haven’t been used take longer to start up. In other words, a cold start is an increase in latency for Functions which haven’t been called recently. The Always on setting is available only on an App Service plan, which means that cold start isn’t really an issue.

As for the Premium plan, it can avoid cold starts with perpetually warm instances. (Ref: Azure Functions Premium plan)

In our case, the cold start issue in needs to be avoided since the system may be scaled down or switched off periodically to reduce cost.

Based on the above, we will compare the premium plan and app service plan in the next step.

Step 2 : Instance, Number of instances

Load testing should be conducted to determine which instance to use and how many instances are needed.

The following are some examples of points to check the metrics in Functions during load test execution.

Check if all the messages you sent are processed
Check if the CPU usage rate is not over 80%
Check if the memory is settled
Check that the total execution count matches the number of inputs to the functions as expected
Check if the average duration for processing in the functions is not too long

As a result of the load test, we found out that the Premium plan requires 6 EP2 instances, and the App Service plan requires 4 P1v3 instances.
The results of the cost estimation are as follows.

Aside from the cost, the Premium Plan was not able to process as stably as the App Service Plan even when it scaled out sufficiently.

Therefore, when selecting Functions, setting up 4 P1v3 instances of the App Service Plan was the optimal option for us.

Tips

Tips 1 - Stream Analytics : Limited flexibility for output destinations

For example, if you want to store your telemetry messages in appropriate tables created in 10-minute increments based on the timestamps contained in the telemetry messages, Stream Analytics alone will not be able to accomplish this.
If you have a requirement to dynamically specify the destination table like our case mentioned above, you will need to use a different service together, which will increase the cost.

3. Where to persist : CosmosDB vs Table Storage

Selection criteria

Keep data for 48 hours
Retrieve the latest n data
2 regions (Japan East / Japan West) for redundant failure

How to calculate

CosmosDB

I recommend using capacity planner to calculate the cost of Cosmos DB.

Step 1 : API

There are multiple choices: SQL API, Cassandra API, Gremlin API, Table API and Azure Cosmos DB API for MongoDB etc.

In this example, SQL API is selected.

If you are using API for MongoDB, see how to use capacity calculator with MongoDB article.

Step 2 : Number of regions

Azure Cosmos DB is available in all Azure regions. The number of regions required should be selected for your workload.

In this example, the requirement is for two regions, Japan East and Japan West, so enter 2 as the number of regions. The conditions will be aligned since we have selected GRS for the Functions described later.

Step 3 : Total data stored in transactional store

Total projected data stored(GB) in the transactional store in a single region.

The data will be stored for 48 hours and then deleted, which means that 48 hours of data will always be stored in the storage. Thus, the calculation formula is as follows.

Step 4 : Expected size of items/documents

The expected size of the data item (for example, document), ranging from 1 KB to 2 MB.

As mentioned above, the data size of a per-device telemetry message is 2.5KB.

However, you can only input data in units of 1KB into the capacity planner. Therefore, in this example, we will use 2KB.

Tiny advice : When manipulating the expected size of items/documents in the capacity planner, the key cursor can be used to change small values.

Step 5 : Number of Point reads/Creates/Updates/Deletes operations expected per second per region to calculate RU (Request Unit)

The calculation of RU (Request Unit) for Azure Cosmos DB is not as simple as 2k docs x 1000 = 2000RU.

I highly recommend to use capacity planner to calculate RU by imputing number of Point reads/Creates/Updates/Deletes operations expected per second per region.

The following figure shows the result of the cost estimation after inputting the above information into the capacity planner.

Table Storage

Step 1 : Redundancy

First of all, you need to select the best redundancy option.
There are 6 options :

Locally redundant storage (LRS)

Within a single physical location in the primary region, the data is copied three times synchronously.

Zone-redundant storage (ZRS)

Copy data synchronously between the three Azure Availability Zones in the primary region.

geo-redundant storage (GRS)

Replicate synchronously three times (at one physical location) in the primary region using local redundant storage (LRS), and then asynchronously to the secondary region.

read-access geo-redundant storage (RA-GRS)

In addition to geo-redundant storage (GRS), you have read access to data located in a secondary region. If the primary becomes unavailable, you can read the data from the secondary.
RA-GRS is more expensive to use than GRS, but avoids data read downtime while the primary region is unavailable and a failover to the secondary region is performed.

geo-zone-redundant storage (GZRS)

Replicate synchronously between the three Azure Availability Zones in the primary region using Zone Redundant Storage (ZRS), and then asynchronously to the secondary region.

read-access geo-zone-redundant storage (RA-GZRS)

In addition to geo-zone-redundant storage (GZRS), you have read access to data located in a secondary region. If the primary becomes unavailable, you can read the data from the secondary.
Although the cost of using RA-GZRS is higher than GZRS, it is recommended to use RA-GZRS when even a small amount of downtime due to failover is not acceptable.

You can find out more about each redundancy option here.

In our case, we chose GRS, an option that allows data redundancy in another region hundreds of kilometers away geographically, and is more available, and sustainable than LRS or ZRS.

Step 2 : Storage capacity in GB per month

The data will be stored for 48 hours and then deleted, which means that 48 hours of data will always be stored in the storage. Thus, the calculation formula is as follows.

Step 3 : Storage transactions

$0.00036 per 10,000 transactions for tables will be charged. (Ref - Table Storage pricing)
Any type of operation against the storage is counted as a transaction, including reads, writes, and deletes.

Since we plan to delete the entire table at once, the number of delete operations is very small. Therefore, only write operations are counted here.

Tips

Tips 1 - Table Storage : Easy to retrieve the n entities most recently added

If you consider only the cost, Blob Storage is cheaper than Table Storage. The cost of Blob Storage (Standard/48 hours in hot/GRS) is as follows.

However, as mentioned in the selection criteria, there is a requirement to retrieve the latest n data, and Blob Storage, which is not searchable, is not suitable for this requirement.

Please see more details about log tail pattern and the solution.

Tips 2 - Table Storage : Easy to delete a table instead of entities

As mentioned in the selection criteria, telemetry data will be stored for 48 hours and then deleted.

Another advantages of using Table Storage is that it allows you to delete a table from the database, instead of deleting it by entity. In other words, the cost of the operation is lower than the cost of deleting them entity by entity because it can be deleted by table.

Please refer to the document about deleting table.

Other option : Data Explorer

Data Explorer was also considered as an option that combines both where to process and where to persist.

I recommend using Azure Data Explorer (Kusto) Cost Estimator to calculate the cost of Data Explorer.
The data collected per day is 2160000 KB (i.e. 0.00216TB). However, since the estimator does not allow to be entered less than 0.01TB, 0.01TB was entered.

The result of the cost estimation shown below is the minimum cost of Data Explorer (when data is retained in hot for 48 hours) without any load testing.

At first glance, Data Explorer may seem low cost, but again, it is the minimum cost to prepare two E2A_v4, the smallest machine. (Ref - Azure Data Explorer pricing)

Moreover, in our case, the cost estimate was made considering future scalability. However, it should also be noted that this cost will be incurred even when the amount of data is not much larger than this estimate.

With Data Explorer, the more data you have, the greater the cost benefit. Also,
since the powerful analysis function is one of the most attractive features, choosing Data Explorer should be considered appropriately depending on the requirements, such as analyzing the stored data.

Result summary

So far we have detailed the cost estimation for each service to determine the architecture.
The data used in the cost estimation was the projected data.

Let's go over the requirements and potential services for each part of the overall architecture again.

Where to send

Event Hubs resulted in significantly lower costs than IoT Hub.
However, as mentioned earlier, while the IoT Hub can manage devices and provide two-way communication (C2D, D2C), the Event Hubs can only provide one-way communication. Thus, when choosing Event Hubs, it is another option to use it together with S1 of IoT Hub and security should also be considered.

Where to process

Since Stream Analytics has limited flexibility in output destination, it needs to be combined with other services such as Functions in order to dynamically store the data in the specified table storage, which is more expensive than Functions alone.

In addition, there is a requirement that the processing time between device and storage should be less than 10 seconds, so performance must be checked in the load test to determine the appropriate number of throughput units and instances.

Where to persist

Table Storage results in significantly lower cost than Cosmos DB.
The reason for this is that with Cosmos DB, as the number of operations (Point reads, Creates, Updates, Deletes) increases, the RU increases and the cost becomes higher.

Also, 2 regions was one of the selection criteria this time, which made the cost higher. If only one region is applied, the cost of Cosmos DB is simply halved, but Table Storage is still cheaper.

Furthermore, the fact that Table Storage can be used to retrieve the latest n data was also a big advantage in this scenario.

Estimated cost of the determined architecture

Based on the above cost comparison as well as project's specific requirements, the final architecture we decided on is shown in the figure below.

Here is the total cost of Azure for this architecture.

The architecture fits neatly into the budget of $8 per connector.
The processing time between device and table storage also cleared the requirement of 10 seconds or less, and the average processing time was 3.934 seconds by querying 20 minutes of data.The method of how to calculate the processing time will be written in another article.

Importance of load testing

While some calculations can be done theoretically based on the amount of data etc, load testing was necessary to estimate the cost of Stream Analytics and Functions and measure the processing time between device and storage.

In our case, we conducted load tests using the IoT telemetry simulator. Just as the maximum size of telemetry messages sent by 10 devices is applied in the cost estimation, the message to be sent using the simulator was also made to be the maximum size to put an assumed load on it (i.e. 25KB).

Conclusion

Cost is an unavoidable issue in developing a cloud-based solution.

Cost estimation has advantages beyond understanding the cost, such as finding the best method with a limited budget, having elements to beat the competition, and considering the architecture with future scalability.

Although this might be a slightly complicated task, it is recommended to do a cost estimation when considering the architecture.
Hope this article will be helpful for you to understand how to determine the architecture from the cost estimation.

TOC

Requirements for cost estimation

1. Projected amount of data in the future

Actual telemetry data

Number of devices

Frequency

2. Data storage

Keep data for a certain period of time

3. Budget

Compared with competitors

Architecture options

Cost comparison

1. Where to send : IoT Hub vs Event Hubs

Selection criteria

How to calculate

IoT Hub

Step 1 : Basic or Standard tier

Step 2 : Number of messages per day

Step 3 : Number of billing messages per day

Step 4 : Edition Type, Number of units

Event Hubs

Step 1 : Basic, Standard or Dedicated tier

Step 2 : Ingress events

Step 3 : Throughput units

Step 4 (Optional) : Capture feature

Tips

Tips 1 - IoT Hub : Increase in cost

Tips 2 - IoT Hub: Number of connectors allowed in S3 and calculation method

Tips 3 - Event Hubs: Consider the Auto-Inflate setting

Tips 4 - Event Hubs: Increase in cost

Tips 5 - Compare IoT Hub and Event Hubs

2. Where to process : Stream Analytics vs Functions

Selection criteria

How to calculate

Stream Analytics

Step 1 : Standard or Dedicated plan

Step 2 : Number of streaming units (SUs)

Additional required step : Consider combining services to save to a specified table name dynamically

Functions

Step 1 : Consumption, Premium or App Service Plan

Step 2 : Instance, Number of instances

Tips

Tips 1 - Stream Analytics : Limited flexibility for output destinations

3. Where to persist : CosmosDB vs Table Storage

Selection criteria

How to calculate

CosmosDB

Step 1 : API

Step 2 : Number of regions

Step 3 : Total data stored in transactional store

Step 4 : Expected size of items/documents

Step 5 : Number of Point reads/Creates/Updates/Deletes operations expected per second per region to calculate RU (Request Unit)

Table Storage

Step 1 : Redundancy

Step 2 : Storage capacity in GB per month

Step 3 : Storage transactions

Tips

Tips 1 - Table Storage : Easy to retrieve the n entities most recently added

Tips 2 - Table Storage : Easy to delete a table instead of entities

Other option : Data Explorer

Result summary

Where to send

Where to process

Where to persist

Estimated cost of the determined architecture

Importance of load testing

Conclusion

References