David Krohn for AWS Community Builders

Posted on May 29, 2023 • Originally published at globaldatanet.com

Enterprise-scaled Self-Healing StackSets

#aws #security #governance #devops

With more than 5 million articles from over 7,000 brands, OTTO is one of the leading German online shopping platforms. In the future, it will open up to even more brands and partners as part of its transformation. OTTO is part of the internationally active Otto Group, with headquarters in Hamburg, and employs 6,100 people throughout Germany. In the 2020/21 financial year, OTTO generated revenues of 4.5 billion euros.

At OTTO, we faced several challenges to operate AWS CloudFormation StackSets at Scale. We must govern several hundred AWS accounts for our product teams, all while balancing the need for agility and control.

At this scale, operations can take a lot of time, because there are multiple operational tasks that we need to do when AWS accounts are leaving the AWS Organization or Teams are nuking the AWS account, StackSets Instances get drifted, because not all required resources for compliance can be secured ( SCP Limitations ), existing AWS accounts are joining the AWS Organization and all mandatory StackSets needs to be deployed, and manual steps should be reduced to a minimum. Furthermore, there is no feature from the Service itself to gain an overview of the status of drifted Instances and the general health of your StackSet health and compliance.

The cloud competence center at OTTO IT, also known as the Governance at Scale (GAS) team, developed a solution for self-healing on StackSets, that is integrated into the OTTO tooling ecosystem with Confluence and Microsoft Teams.

OTTO worked with globaldatanet to set up its Landing Zone. globaldatanet is an award-winning AWS Advanced Consulting Partner and longtime Cloud Solution Provider for OTTO, supporting the team in cloud security and GAS. Their focus on building cloud-native solutions using Serverless supported over 100 companies within 5 years to develop and innovate products and services in the cloud.

In this post, we’ll demonstrate how to implement fully automated enterprise-scaled self-healing on StackSets using AWS StepFunctions and create a Dashboard to get an overview of your StackSet health and compliance and reduce operational time.

The solution workflow includes the following steps:

The tagging concept for StackSets
Automatically create StackSets configuration in SSM Parameter Store
Implementing StepFunction for StackSet Self-Healing

Let’s see how this works.

Prerequisites

The following prerequisites are necessary for following along with the contents of this post:

Two existing AWS Accounts
Few AWS StackSets

Solution overview

The following architecture shows the whole solution of the Self Healing StackSets.

Architecture of fully-automated Self Healing Solution with integration to Confluence.

Tagging concept for StackSets

The solution requires a JSON file in the AWS parameter store, the easiest way is to create it automatically based on the StackSet configurations and the tags assigned there. We'll go into more detail about this in the next section of the Automatically create StackSets configuration Parameter Store article. In the following, we describe which tags we introduced to our StackSet and what we need these tags for.

⚠️ AWS tags do not allow commas in value, so ":" as divider for arrays

Key	Value	Result	Example
antidependson	StackSet Name	antidependson marks stacksets which collide with each other.	MYSTACKSET
dependson	[List of StackSet Names]	List of Stacksets that need to be rolled out before deploying this stackset (e.g. Enable Config before Activate Config Rules). NOTE : Please reduce to only one dependson-stackset for now. Form "chains" for multi-dependencies.	MY-STACKSET1:MYSTACKSET2
mandatory	true or false	The stackset instances must be present on all AWS accounts	true
selfhealing	true or false	StackSet can be healed via Delete & Redeploy (exception e.g. IDP roles) - Parameter Overwrites will be cached.	true
region	[Regions]	List of Regions in which the stackset instances are to be deployed	eu-west-1:eu-central-1:us-east-1

Automatically create StackSets configuration Parameter Store

The automated generation of the Stackset-configuration via JSON inside the ParameterStore is a multi-purpose-utility:

Removing the chore to configure manually a JSON-document
Ensure the Account vending-machines knows what to deploy in which order
Supporting the self-healing StepFunction about the expected setup of the member-accounts

The Lambda responsible for the task is invoked via a Events-Rule:

Every time a Stackset-Operation has been finished with status "succeeded".

This is due the tags on a Stackset are part of the stackset, not Additional items describing a Stackset, therefore a change to the tags always will result in a Stackset-Update-operation.

In terms of computerscience the Lambda is quite interesting, as the primary problem was to build a nonweighted tree based on the "dependson" and "antidependson" tags and then compile an ordered one-dimensional list, like in the good old "travelling salesmen"-problem.

Implementing StepFunction for StackSet Self-Healing

AWS Step Functions is a cloud service that enables you to coordinate the components of distributed applications and microservices using visual workflows. It allows you to build and automate the execution of complex processes and tasks across multiple AWS services, using a visual interface to define and execute your workflows. Since the Self Healing Solutions needs a complex workflow we decided to use Step Functions for this Usecase. Following we will explain you the workflow of the Self Healing.

StepFunction Workflow

Functionality

ƛ Serverless Functions

StackSetInitCleanupLambda: Performs a search to identify StackSet instances of AWS Accounts that are either not present within the AWS Organization or deployed to AWS accounts that are suspended. Once identified, proceed with the deletion of these instances from all associated StackSets.
MandatoryStackSetDeploymentLambda: Search missing StackSets Instances (which are tagged with mandatory = true) and deploy those Instances
StackSetDriftDetectionLambda: Trigger Drift Detection on all StackSets
TriggerDriftStatusLambda: Check if Drift Detection is completed on all StackSets
SearchStackSetInstanceToHealLambda: Searches for drifted StackSet Instances from StackSets which are tagged with Selfhealing = true
StackSetCleanupLambda: Removes unhealthy StackSet Instances and redeploys them. Parameter Overrides will be cached so the new deployed instance will have the same setting as before.
StatusPrepareHTMLLambda: Prepare the HTML output Dashboard for Confluence and Json log file of the current StackSet Healthiness State
TeamsNotificationLambda: Send Teams Notification which summary to notify the GAS Team after each execution

？！Decisions

InitCleanup Complete: Check whether all unnecessary instances have been removed. If not, StepFunction is triggering the StackSetInitCleanupLambda function again.
MandatoryStackSetDeployment Complete: Checks whether all mandatory instances have been deployed. If not, StepFunction is triggering the MandatoryStackSetDeploymentLambda function again.
StackSetDriftDetection Complete: Wait until StackSet Drift Detection has been finished on all StackSets
Healing Complete: Check if all unhealthy Instances are healed otherwise invoke StackSetCleanupLambda again

Limitations

While developing the solution we faced several limitations. Here are our findings and solutions for that.

🚨 StackSets instance operations: Maximum number of stack instances, across all stack sets, that you can run operations on in each Region at the same time, per administrator account is limited to 10.000 operations.

✅ We implemented a counter to count the current StackSets operations which are in progress, in addition we also catching the Exception from CloudFormation and waiting few seconds to try the operation again.
🚨 Parameter Overwrites Caching: Whenever removing a drifted StackSet Instance which has Parameter Overwrite you will lose the individually parameters of the Instance.

✅ Before deleting the drifted StackSet Instance we cache the Parameter Overwrites and deploy the StackSet Instance after successful deletion again with the cached Parameter Overwrites again.
🚨AWS Step Functions Payload size: AWS Step Functions supports payload sizes up to 256KB. For our solution we need more Payloads between the States especially when we want to pass our log or the concurrent Parameter Overwrites per StackSet.

✅ We are storing our states in an S3 bucket to pass the state. At the end of the execution we are deleting the state from S3 to not to influence the next Step Function execution with the wrong state.

Documentation

After each execution of the StackSet Health StepFunction, we aim to notify our GAS team about the actions taken during the previous run. Therefore, we have implemented a Teams notification that includes a status update, a link to the generated dashboard, and a link to the log file.

The following screenshot illustrates an example of a Teams notification. It provides a summary report and directs you to the dashboard and log file for further details.

Dashboard

Our StackSet Health Dashboard is a simple HTML file which will be generated trough a Lambda Function, saved in S3 and will be distributed trough a CloudFrount. You can integrate this Dashboards in your Confluence or any other internal Wiki. This Dashboard is secured via CloudFormation Function - additionally you can also add a Firewall to restrict the access to an specific CIDR or Geographic region and prevent access from third parties. The screenshot below provides an example of the overall StackSet Health status information for an entire AWS Organization.

Conclusion

In this post, we demonstrated a solution to automatically heal AWS CloudFormation StackSets at scale. By implementing this Solution Organisations we reduced manual effort for StackSet cleanup operations by 4 hours per week, improved the overall reliability of our StackSets, increased our compliance in the organisation, and managed to get a daily updated overview for all StackSet Instances using the dashboards. In summary, the self-healing CloudFormation StackSets solution combines automation, monitoring, and self-recovery capabilities to deliver a robust and resilient system for StackSets.

Top comments (3)

Andreas Bergström • May 29 '23

It would have been interesting to read if you evaluated AWS Config and what limitations it (including building on top of its events) had in your case.

David Krohn • May 30 '23 • Edited

We are using Config for compliance within accounts and build compliance dashboards from it, but not for StackSets.

Maybe we will write another blog about our Config Compliance Dashboards soon.

Andreas Bergström • May 30 '23

Ah! I thought it hard support for something related to config drift and Stacksets but can't find anything on it now, so I was probably talking nonsense!

DEV Community