At some stage in development of a high availability application you will want test what happens when an Availability Zone goes down in AWS.
Disabling AZ
Blocking all network traffic to AZ seems the best way to simulate this. The method I used was to change the ACL for all the subnets on an AZ to new ACL. The AWS cli creates ACL with Deny All traffic by default for new ACL's.
#!/bin/bash
# prereq
# - jq
# - aws-cli
AZ=eu-west-1c
# use the subnetId to get the NetworkAclAssociationId to create the new acl association
for SUBNETID in $(aws ec2 describe-subnets --region ${AZ%?}| jq ".Subnets[] | select(.AvailabilityZone==\"$AZ\")" | jq -r '.SubnetId')
do
aws ec2 describe-network-acls --region ${AZ%?}| jq -r ".[] | .[].Associations[] | select(.SubnetId==\"$SUBNETID\")" | jq -r '.NetworkAclAssociationId' >> NetworkAclAssociationId.tmp
# Need to take a backup of the original NetworkAclId's to be able to reverse the change
aws ec2 describe-network-acls --region ${AZ%?}| jq -r ".[] | .[].Associations[] | select(.SubnetId==\"$SUBNETID\")" | jq -r '.NetworkAclId' >> NetworkAclId-restore.tmp
done
As I have multiple VPC I needed to create a different ACL for each VPC .
# create the dummy ACL and create a file containing the NetworkAclId for the dummy ACL for each VPC
for VPCID in $(aws ec2 describe-subnets --region ${AZ%?} | jq -r ".Subnets[] | select(.AvailabilityZone==\"$AZ\")" | jq -r '.VpcId')
do
aws ec2 create-network-acl --vpc-id $VPCID --region ${AZ%?} | jq -r '.NetworkAcl.NetworkAclId' >> NetworkAclId.tmp
done
I then created a function that takes the lists of NetworkAclAssociationId and NetworkAclId and changes the ACL association
# Function ChangeAcl takes two arguments for disable or enable
# $1 should be NetworkAclAssociationId filename
# $2 should be NetworkAclId filename
function ChangeAcl() {
# needed to read from two files so used a count to poll through the lines of the second file
count=1
cat $1 | while read NetworkAclAssociationId
do
echo $(sed -n "${count}p" < $2)
echo $NetworkAclAssociationId
aws ec2 replace-network-acl-association --region ${AZ%?} --association-id $NetworkAclAssociationId --network-acl-id $(sed -n "${count}p" < $2)
((count=count+1))
done
}
# Call the function to create new disable ACL association
ChangeAcl NetworkAclAssociationId.tmp NetworkAclId.tmp
At this point I have disable all traffic to a particular AZ and now I can check if resources are redistributed as expected and there is no downtime.
Re-enabling again
It takes a few extra steps to re-enable again
# Get the new networkAclAssociationId for the subnets
for SUBNETID in $(aws ec2 describe-subnets --region ${AZ%?} | jq ".Subnets[] | select(.AvailabilityZone==\"$AZ\")" | jq -r '.SubnetId')
do
aws ec2 describe-network-acls --region ${AZ%?} | jq -r ".[] | .[].Associations[] | select(.SubnetId==\"$SUBNETID\")" | jq -r '.NetworkAclAssociationId' >> NetworkAclAssociationId-restore.tmp
done
# Restore the subnets to the original ACL's
ChangeAcl NetworkAclAssociationId-restore.tmp NetworkAclId-restore.tmp
# delete the dummy ACL's
cat NetworkAclId.tmp | while read deleteNetworkAclId
do
aws ec2 delete-network-acl --network-acl-id $deleteNetworkAclId --region ${AZ%?}
done
That's it, all traffic should be restored to original configuration.
Top comments (0)