As a scaling software company, we experience periods of unpredictable platform usage spikes which can come from all over the world. This volatility fosters complexity around instance scaling and other core dev infrastructure. If our instances don’t scale properly, our users experience a less performant product. If our servers are too large, it is a waste of dollars.
Hooking into Datadog gave our company newfound and much-needed insights into our AWS infrastructure. Moreover, the alerting functionality built into Datadog’s event monitoring unlocked the ability for us to monitor, investigate, and ultimately resolve issues with our infrastructure. However, although Datadog alerted us to issues, we still had to manually manage our AWS infra. Our goal was to build a platform that could remediate infra challenges in an automated way by leveraging Datadog alerts.
I eluded to this use case in the intro paragraph. Mainly, with Datadog alerting and WayScript, users can set up alerting for high or low levels of usage of a particular instance. To do this, users quickly set up a Datadog trigger on WayScript.
For example, if we get an alert that our instance has high traffic load, we can automate running a script to add another EC2 instance to our system with this type of code:
import boto3 def turn_instance_on( instance_id ): ec2 = build_client() current_state = check_instance_state( instance_id ) if current_state == 'not running': try: response = ec2.start_instances(InstanceIds=[instance_id], DryRun=False) new_state = response.get('StartInstances').get('CurrentState').get('Name') return 'Success' except ClientError as response: return response else: return 'Instance Already Running'
The same type of logic can be run to turn instances off with low traffic.
When building a remediation tool, we found ourselves needing automated tasks mixed with ‘Human-in-the-loop’ interactions. Mainly, we wanted to design a program that would roll our production server back to the previous version based on a Datadog alert. Moreover, once the alert hit our backend we wanted to generate a text message approval by our lead backend development team. Once approved by a text message response, the rollback kicks off via CircleCI.
So how does this work? First, we set up an event alert on Datadog that is linked to our Rollbar incident reporting for deployments. If this incident is marked as a
bad_deploy, our trigger fires. Next, a python script interprets the event and determines if a rollback is necessary:
event = variables['Event'] title = event.get('Title') try: if 'bad_deploy' in title: status = 'bad' variables['status'] = status else: status = 'good' except: variables['status'] = 'good'
If a rollback is necessary, we use the Twilio API to send a text message to our backend dev team. If/when a dev response with ‘approve’, CircleCI is set to rollback to the previous working version of our production system.
As we continue to scale, there have been instances where we have experienced unanticipated database issues such as deadlocks from long running queries. This type of event can cause significant degradation of performance for our user base. Therefore, we wanted to build a process for logging the deadlocked process, but then ultimately killing the query in an automated way (we determined this is better than user wide degradation).
In order to do this we set up Datadog alerts for High CPU usage (degraded status) or High memory usage on our database. When this alert hits WayScript, it kicks of a couple of processes.
Initially, we use Python & SQL to grab all currently running queries on our db (RDS on AWS). The first process builds a Pandas Dataframe of the running queries information, stores this in a file, and then emails the file to our backend dev team. The second process looks for queries that have exceeded an expected time threshold. For these queries, they are passed to a third process which kills them based on their RDS ID.
Example of Pulling the Running Queries:
import boto3 from botocore.exceptions import ClientError def build_client(): ec2 = boto3.client( 'rds', region_name = 'us-east-2', aws_access_key_id=context['key_id'], #stored in .secrets aws_secret_access_key=context['key_secret'] # stored in .secrets ) return rds rds = build_client() response = rds.execute_statement( continueAfterTimeout=False, database='database-1', includeResultMetadata=False, resourceArn='aws:rds:us-east-<DB_ID>', schema='string', secretArn='string', sql='string', transactionId='string' )
Our goal was to build a platform that could remediate infra challenges in an automated way by leveraging Datadog alerts.