Amazon Elastic Container Service (ECS) is a highly scalable, high performance container management service that supports Docker containers and allows you to easily run applications on a managed cluster of Amazon Elastic Compute Cloud (Amazon EC2) instances.
We have recently decided to move our deployments to ECS. As, the uncle Ben said — With great power, comes great responsibility — setting up the monitoring and alerts for this new implementation was quite interesting and this blog post deep dives into one specific alarm that we configured i.e., How do you get alerted when an ECS task fails during the deployment ?
We had microsoft teams webhook in place, to which the alerts have to be sent when a failure / error occurs. This alert dump can be any other application in your case.
We have leveraged Amazon EventBridge rules for creating alerts when ever an ECS task fails. This gives us the opportunity to check what went wrong. Below is a high level block diagram of the implementation.
And yes, I know what your eyes are looking for — The event rule pattern. Here is the pattern we used.
{
"detail": {
"group": ["service:<<name of your service>>"],
"lastStatus": ["STOPPED"],
"stoppedReason": [{
"anything-but": {
"prefix": "Scaling activity initiated by (deployment"
}
}]
},
"detail-type": ["ECS Task State Change"],
"source": ["aws.ecs"]
}
gist : https://gist.github.com/akhil-ghatiki/1251a54da8eaaca3c20f5322f5106319.js
You can use any pattern for your rule. In the above example, we are filtering it based on the group key whose value is the service name. So, any event with the service name will be sent to the SNS. Below is a sample event that can help you with other attributes you might need for your rule pattern.
{
"version": "0",
"id": "XXXXXXXXXXXXX",
"detail-type": "ECS Task State Change",
"source": "aws.ecs",
"account": "XXXXXXXXX",
"time": "2022-10-10T10:06:29Z",
"region": "XXXXXXXX",
"resources": [
"arn:aws:ecs:XXXXXXXX:XXXXXXXXX:task/XXXXXXXXXX/XXXXXX"
],
"detail": {
"attachments": [
{
"id": "XXXXXXXXXXXXX",
"type": "sdi",
"status": "DELETED",
"details": []
},
{
"id": "XXXXXXXXXXXXX",
"type": "eni",
"status": "DELETED",
"details": [
{
"name": "subnetId",
"value": "XXXXXXXX"
},
{
"name": "networkInterfaceId",
"value": "XXXXXXXX"
},
{
"name": "macAddress",
"value": "XXXXXXXX"
},
{
"name": "privateDnsName",
"value": "XXXXXXXX"
},
{
"name": "privateIPv4Address",
"value": "XXXXXXXX"
}
]
},
{
"id": "XXXXXXXXXXXXX",
"type": "elb",
"status": "DELETED",
"details": []
}
],
"attributes": [
{
"name": "ecs.cpu-architecture",
"value": "XXXXXXXX"
}
],
"availabilityZone": "XXXXXXXXa",
"clusterArn": "arn:aws:ecs:XXXXXXXX:XXXXXXXXX:cluster/XXXXXXXXXX",
"connectivity": "CONNECTED",
"connectivityAt": "2022-10-10T10:05:38.02Z",
"containers": [
{
"containerArn": "arn:aws:ecs:XXXXXXXX:XXXXXXXXX:container/XXXXXXXXXX/XXXXXX/XXXXXXXXX",
"lastStatus": "STOPPED",
"name": "XXXXXXX",
"image": "XXXXXXXX.dkr.ecr.XXXXXXXX.amazonaws.com/XXXXXXXX/XXXXXXX:XXXXXX",
"runtimeId": "XXXXXX-1979092248",
"taskArn": "arn:aws:ecs:XXXXXXXX:XXXXXXXXX:task/XXXXXXXXXX/XXXXXX",
"networkInterfaces": [
{
"attachmentId": "XXXXXXXXX",
"privateIpv4Address": "XXXXXXX"
}
],
"cpu": "0"
}
],
"cpu": "512",
"createdAt": "2022-10-10T10:05:34.301Z",
"desiredStatus": "STOPPED",
"enableExecuteCommand": false,
"ephemeralStorage": {
"sizeInGiB": 20
},
"executionStoppedAt": "2022-10-10T10:05:46.497Z",
"group": "service:XXXXXXX",
"launchType": "FARGATE",
"lastStatus": "STOPPED",
"memory": "1024",
"overrides": {
"containerOverrides": [
{
"name": "XXXXXXX"
}
]
},
"platformVersion": "1.4.0",
"startedBy": "ecs-svc/XXXXXX",
"stoppingAt": "2022-10-10T10:05:56.53Z",
"stoppedAt": "2022-10-10T10:06:29.152Z",
"stoppedReason": "CannotPullContainerError: inspect image has been retried 1 time(s): failed to resolve ref \"XXXXXXXX.dkr.ecr.XXXXXXXX.amazonaws.com/XXXXXXXX/XXXXXXX:XXXXXX\": XXXXXXXX.dkr.ecr.XXXXXXXX.amazonaws.com/XXXXXXXX/XXXXXXX:XXXXXX: not found",
"stopCode": "TaskFailedToStart",
"taskArn": "arn:aws:ecs:XXXXXXXX:XXXXXXXXX:task/XXXXXXXXXX/XXXXXX",
"taskDefinitionArn": "arn:aws:ecs:XXXXXXXX:XXXXXXXXX:task-definition/XXXXXXX:XXX",
"updatedAt": "2022-10-10T10:06:29.152Z",
"version": 4
}
}
gist : https://gist.github.com/akhil-ghatiki/e2654a551d6989ed0cb652318357f20b.js
God Speed !!
Top comments (0)