What is Retention Period?
Retention Period on RDS Instances decides how long automated backups have to be stored.
While creating a DB instance, by default the automated backup frequency is set to 7 days, it means a backup of the instance is stored till 7 days.
In our case, lot of times folks who created the instance forgot to reset this in case the backups weren't needed.
The retention period can be set to 0 in which case no backups will be taken for the DB instance especially in Dev envs where we had no need to store backups of databases.
When we took the clean up task every week manually, we saw that the process to set this retention period to 0 was tedious as we had to wait for the instance to start in case it was in stopped state then modify it, wait for the modification to complete & then stop the instance.
Repeat this over for a lot of instances & you could easily spend half a day doing this.
To automate this, I thought of going with a Serverless solution comprising of Step Functions, Lambda & Eventbridge scheduler.
Prerequisites:
- Role for the step Functions to invoke the said Lambda functions & to perform start, stop & modify actions on the RDS instances
- IAM Role for Lambda functions to allow sending callback response to Step Functions(sendTaskSuccess & sendTaskFailure)
The Step Functions has the foll. states-
- Lambda Invocation
- Map state
- A Choice State.
- Workflow 1 when the DB is in stopped state
- Workflow 2 when the DB is in starting/rebooting/backing-up state.
- Workflow 3 when the DB is in available state.
Lets go over each state using a short demo:
1.Lambda Invocation : [AWS SDK Integrations]
2.Map & Choice states :
3.DB is in Stopped State:
In the Lambda function we discussed in the demo, the SDK waiter API is implemented for the instance as shown in the below code snippet.
In the same function to perform the modify action I had to introduce a lag of a few seconds before calling the waiter API, this is because after firing the modify_DB_instance SDK call, the operation is started after a lag of few seconds.
Note: If the lag is not introduced, as the DB is in available state, the waiter call will be skipped and the token returned thus moving to the next state without applying the modification.
Lets see how the execution flow works for a single DB instance in Stopped state.
Demo: Step Function execution for Instance in Stopped state
The entire execution for the single instance takes around 14 minutes.
4.DB is in starting/rebooting states:
Let's see how the execution flow works for a single DB instance in Rebooting state.
Demo for Instance in Rebooting State
The entire execution for the single instance takes almost 5 minutes.
5.DB is in Available State:
Lets see how the execution flow works for a single DB instance in Available state.
Demo: Step Function execution for Instance in Available state
The entire execution for the single instance takes almost 3 seconds.
We can introduce a Eventbridge schedule to execute this step function twice a week or once a month to automate the process further.
I have shown a single use case of setting RDS Retention Period to 0 using Step Functions but this can be adopted for other use-cases too where we need to perform a bulk-modify action on RDS instances.
What did I learn from this exercise?
Callback pattern are a powerful feature of Step Functions. I had spent some time adding wait states after calling the start/modify SDK API call but quickly realised that this time could not be predicted as it varied based on the DB engines. Thats when callback pattern came to my rescue
Please find the code in my github repo:
https://github.com/neetu-mallan/retentionperiodreset
What Resources did I use?
To learn & understand the step functions I have gone through the AWS documentation:
https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html
https://www.youtube.com/watch?v=jXxKRd_9nC0 -- this Step Functions Crash course by Manoj Fernando really helped me in understanding how step functions help in practical use cases.
- The below 2 links helped me understand the callback pattern implementation:
https://docs.aws.amazon.com/step-functions/latest/dg/callback-task-sample-sqs.html
https://docs.aws.amazon.com/step-functions/latest/dg/connect-to-resource.html
Top comments (3)
Thanks for this demo. I am learning AWS and this use case sounds very necessary.
One note. I think it is not a good idea to have a sleep function to wait for the instance to start. It can have some flaws. I think if you add a loop with a check of the instance state and a sleep inside it the waiting operation would be less error prone.
Sure Felipe!! Thanks for the comments will change the wait into a loop & re test!!
Detailed blog. Thanks for writing. Different way to managed retention period. 👏🏻👏🏻👏🏻