Tidjani Belmansour, Ph.D.

Posted on Mar 26, 2021

Manage your data retention policies with Azure Storage Lifecycle Management

#azure #storage #cloud

This article is part of the Azure Spring Clean initiative

I would shoot out a big thank you to Joe Carlyle and to Thomas Thornton for giving me the opportunity to be once again part of this incredible journey. Thanks Joe!

If you're not yet aware of the Azure Spring Clean initiative, please head to https://www.azurespringclean.com where you'll find out more great content presented to you by awesome people. You can also share your excitement on Twitter by using the hashtag #azurespringclean.

Okay. Now, let's get to our topic for today!

Intro

You may know it by now: data is what fuels applications and services. In order to feed our applications and services with the data they need to perform their operations, we need to store that data somewhere, may it be in a database, a directory or somewhere else. On Azure, a storage account is often considered as an option for storing data since it provides us with a reliable, cost-effective, easy-to-use-yet-powerful storage mechanism.

However, and regardless of the storage mechanism we rely on in order to store our data, it worth mentioning that such data isn't used in the same way nor at the same frequency. For that matter, the Azure Storage Account service provides us with 4 services, namely: Blob, Files, Queues and Tables which allow us to store different kind of data and work with it in different fashions. For more information on these four services, you can refer to the official Microsoft documentation for the Azure Storage Account service.

When it comes to storage access tier, the Azure Storage Account service provides us with 3 options:

Hot: Optimized for storing data that is accessed frequently;
Cool: Optimized for storing data that is infrequently accessed and stored for at least 30 days;
Archive: Optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements, on the order of hours.

Generally speaking, data in the Hot access tier are more expensive to store than data in the Cool access tier which, in turn, is more expensive to store than data in the Archive access tier. This is also true when it comes to the Azure Storage Account service.

So, you may ask, we should always use the Archive access tier when it comes to storing our data, right?
Well, it's not that simple...

Although storing data in the Archive access tier is the cheapest, it is the most expensive when it comes to retrieving that data. And when I say "more expensive", I'm not only referring to the monetary cost of the transaction to retrieve that data but also to the time required to retrieve it. For example, it might take up to several hours to retrieve data that is stored in the Archive access tier. Thus, you definitely don't want to retrieve data from the Archive access tier if the request was initiated by a user through your web application, knowing that this user is waiting in front of his/her screen to be presented with these information!

So, why would I care about lifecycle management at all?!

Right now, you might have totally changed your mind and thinking that you should always rely on the Hot access tier. Am I right?
Well, once again, it's not that simple...
It is still worth to set the appropriate the right access tier depending on the context we're in. Note that you don't necessarily have to choose only one access tier for ALL the data you store in an Azure Storage Account instance. You may (and should) fine tune the access tier for each piece of data.

Another factor to take into account is that, as time is passing by, a given data might need to be set to a different access tier.
For example, let's say that my application is collecting the sales data for a given store and processing it to, let's say, evaluates inventory provisioning for the most in-demand products. I would probably need to use the Hot access tier for that data since I'm expecting to use it quite often in the upcoming days. However, the sales data of the last quarter or even last year might not need to be accessed so often. Thus, the Cool access tier would probably be more appropriate. Now, when it comes to the sales data of the last 5 or 10 years, then chances are that the Archive access tier will be more appropriate since you'd probably not need to access that data so often but still want to keep it for any reason, shall it be for legal reasons, for later ML processing of that data or simply... "just in case".

How to set up lifecycle management?

We could definitely write some code or script for that matter (we can think of an event-driven application such as an Azure Functions, a Logic App or an Automation Runbook). However, it is interesting to note that the Azure Storage Account service provides us with a functionality that does just this. This functionality is called "Lifecycle Management".

First, you need to know that the lifecycle management feature applies to Azure Storage Account instances of type General Purpose V2:

If yours is of type General Purpose V1, you can go to Configuration and click Upgrade:

Okay, now let's set up a lifecycle management policy!

Here's our use-case scenario:
* we may want to change the access tier of a blob from Hot to Cool if the blob hasn't been modified for the last 14 days
* we may want to change the access tier of a blob from Cool to Archive if the blob hasn't been modified for the last 30 days
* we can also define a rule to delete the blob after a given period of time if that's want we need but in this case, we won't.

The Lifecycle management feature can be found under Blob service.
Once we get there, we simply click on Add a rule.

In the Details window, we can set a name for our rule and decide the scope and type of the blobs to which the rule should apply.
Note that we can access the Filter set window only if we choose Limit blobs with filters option:

The Base blobs window is where we apply the rules we've defined in our scenario above. We can do that by creating as much if-then blocks as needed.

The Filter set window is where we indicate which blob or container these rules apply to.
We can specify multiple entries to the "Blob prefix" in order to apply the defined rules on multiple blobs and/or containers within that storage account.

There's a catch however: if you don't specify at least one value in the "Blob prefix" list, then the rules will apply to every blob in the current storage account instance! Thus, you'd want to pay a special attention to that.

Our Azure Storage Account instance has three containers, namely: "images', 'reports' and 'invoices':

We want to apply this new rule only to the blobs located in the 'invoices' container. Thus, we define the Blob prefix as follows:

We finally click the Add button and our new rule is created and enabled!

Special notes

It is worth mentioning that, although we will demonstrate the creation of the lifecycle management rules right from the Azure Portal, we could also create them using SDKs or CLI (e.g. Azure CLI or PowerShell).

Another thing to mention is that, at the end of the day, these rules translate into a JSON format. Thus, we can grab that JSON, store it somewhere (e.g. in our source code management of choice) and reuse it on another project.

We can enable or disable the rules we've created or we can delete them:

At the time of this writing, we can only define rules based on the last modification date of the blobs.

When does the rule applies?

The rules are applied once every 24 hours. Thus, if we create a new rule or modify an existing one, it may take up to 24 hours for it to be executed.

At the time of this writing, there is no way to define a schedule for when the rule should be executed nor to execute it in real time.

As a conclusion...

Today, we saw that Azure Storage Lifecycle Management provides us with an easy and automated way to set the storage access tier of our data to the right value and at the right time without involving extra development nor extra services.

Let's keep in touch

You can reach me on Twitter.

See you soon !

DEV Community