Warren Parad for Authress Engineering Blog

Posted on May 30, 2022 • Originally published at dev.to on May 30, 2022

AWS CloudWatch: How to scale your logging infrastructure

#logging #aws #cloudwatch #microservices

An obvious story you might decide to tell yourself is Logging is easy. And writing to the console or printing out debugging messages may seem easy, and when running a service locally it usually is. As soon as you cross the magical barrier that is the cloud, for some reason this gets really complicated.

So complicated that so so so many companies that think they can compete on delivering this exact solution. But this isn’t an post about which of those to use, nor is it a marketing ploy for a specific provider. (And I’ve used a lot of them, and for some reason they are all terrible, the only thing that was more terrible than using a SaaS provider for logging, was running an open source thing, with ELK being the worst logging infrastructure ever created. Your logging infra should cost at most 10% of your spend and next to 0% of your development time. Yet when you use any provider, it’s like 50%)

For something that usually costs around 30% of your total cloud spend, you would expect to get something useful out of logging. And you do, logging is critical for the sustainability of your service and your business. At Rhosys, we frequently need to know not only if our services are working, but how effectively they are working. Dashboards that monitor call counts and latencies are worthless to a business, we need to know exactly what business relevant logs look like. Like most security conscious companies (non-security conscious companies probably want to ignore what I say next, otherwise it will feel like a bit of holy water burning your internal devil), we have multiple AWS accounts each with a dedicated purpose assigned to only one team, only that one team that has access to that specific AWS account. You don’t share accounts.

The Setup

How we set them up is less important, what is important is that each product gets its own AWS account. It just makes sense, and it’s required when a different team owns each one. Since Rhosys has three core products (at the time of writing), we have something like 40 AWS accounts (because AWS of course):

1 AWS account to run Authress
1 AWS account to run Standup & Prosper
1 AWS account to run Modulemancer
1 AWS account for open source and a bunch of our partnerships with AWS
1 account per developer
and then tons more because why not

This isn’t a story about security though, it’s about maintenance, and since each of our products is in a separate account, there are some complexities with actually figuring out the core problem of How is our service doing.

Because we are using AWS and lots of serverless technologies, we make heavy use of CloudWatch Logs. CW Logs is great. It’s better than every other SaaS logging tool out there, it’s fantastic for monitoring as well. (But it’s terrible at alerting.) At this point we still don’t have a great solution to “report this problem to the dev team”, and that’s because CW Logs doesn’t offer a way to send an email or trigger an alert that Actually includes what is wrong. This is because the monitoring solution actually aggregates data instead of annotating and indexing it. And you’ll need to use SNS + CW Insights to help you.

The Logs

So back to focusing on our three product accounts. For the most part, and I’m glossing over some finer details, we log directly to CloudWatch logs, and it’s great. What isn’t great is if you wanted to see all the logs in one place (which is usually wrong, because different teams can have different solutions). But you might want to see all the — alerts, business problems, critical issues in a digestible format. This doesn’t have to be one place. It’s sufficient to have one CW dashboard per account, and easily switch between them.

Another solution is multiple instances of log collection. That is deploy log aggregation services to every AWS account. The problem is that means we would need to run a worker in every account to handle logging. That’s wrong. Having to deploy an agent for every service or for every region, or even every AWS account, is bad architectural design, and it doesn’t scale. This has to be automated, and require near zero burden on accounts that opt in.

Like the good microservice architects we are, we funnel the relevant business related logs to a secured logging account. For our expectation on how and what we log, there’s a separate article where I speak in depth about our expectations around logging, their purpose, and how to get the most value out of them.

The TL;DR of that article is that we have log statements that look like:

logger.log({
  title: '[Action Required] Failed to automatically handle plan
upgrade, review and determine why it failed and how to
more gracefully improve this problem in the future.',
  level: 'ERROR',
  details: { 
    accountId, error
  }
});

This gets converted in CloudWatch Logs to a base64 mess that we need a complex handler to disentangle. This the is meat of our log aggregator:

(Note: the awslogsData is actually list from CW)

for (let logEvent of awslogsData.logEvents) {
  let parsedLogEvent = {
    logStream: awslogsData.logStream,
    logGroup: awslogsData.logGroup,
    region: config.region,
    requestId: logEvent.extractedFields.request_id,
    extractedTimeStamp: logEvent.extractedFields.timestamp
  };

  let event = logEvent.extractedFields.event;

  // Handle timeouts explicitly
  if (event && event.match('Task timed out after')) {
    parsedLogEvent.data = { title: `${event.trim()} (RequestId: ${parsedLogEvent.requestId})`, level: 'ERROR' };

  // Handle everything else
  } else {

    // We want to pull out the JSON object from our logs
    const eventMatcher = event.match(/^(INFO|TRACE|ERROR|WARN)\s+(?:[\w+\s]*\s+)?(\{.*\})\s*$/s);
    const fallbackLevel = eventMatcher[1] || 'INFO';
    const loggedMessage = JSON.parse(eventMatcher[2]) || {};

    // If the message is a special error which has the code === 'ForceRetryExecution' then ignore it, we use this for enabling internal retries
    if (typeof loggedMessage.code === 'string' && loggedMessage.code.match(/^(ForceRetryExecution)$/i)) {
      continue;
    }

    // Normalize a bunch of properties depending on exactly where the real message data is
    const stringOrObjectMessage = loggedMessage.message || loggedMessage;
    parsedLogEvent.data = typeof stringOrObjectMessage !== 'object' ? { message: stringOrObjectMessage } : stringOrObjectMessage;
    parsedLogEvent.data.level = parsedLogEvent.data.level || fallbackLevel;
    parsedLogEvent.data.stack = parsedLogEvent.data.stack || loggedMessage.stack;
    parsedLogEvent.data.reason = parsedLogEvent.data.reason || loggedMessage.reason;
    parsedLogEvent.data.promise = parsedLogEvent.data.promise || loggedMessage.promise;

    // Actually do something with the message
    await handle(parsedLogEvent);
  }
}

So hopefully that abbreviated mess above shows where the value comes in our structured logging. Since all of our services log with structure, it’s easy for us to parse them and handle them in a unified way. I highly recommend a consist logging approach which is something like this. Using structured logs allows easy debugging of any service you have, without having to relearn a new pattern. Of course across teams this can be different, but then they’ll have their own needs and their own aggregation systems. And when you want to additional value to every account logging source, you can do it in one place without updating some library which you force every one of your services in every AWS account to update (as if that is even a real strategy).

The Fallacy

The trouble is here, while we have a great way to handle logs and a great way to log data, we have no way to easily port the logs from one AWS account to another. You would think in the infinite ability of an IAM system that you would be able to assign a valid resource policy to the lambda function and use it across AWS accounts. Alas, you cannot. It turns out that building a successful authorization framework is a huge challenge, and while AWS did a great job thus far, we can attest that managing one and solving for every edge case is a Sisyphean burden. (How do we know? We did it using Authress).

The only way to port logs from one AWS account to another in an automated fashion (remember we want a full service solution, we don’t want to deploy a log subscription lambda function to every region for every account), is to use AWS Kinesis OR AWS Kinesis Firehose.

Wait, those are different things you say? YES they are!

For lack of clear documentation from AWS, Kinesis is a shared database and Kinesis Firehose is a transport mechanism. So you can either stick the data from CloudWatch logs into a specialized shared DB (Kinesis) or you can delegate that work to transporting the data somewhere else (Kinesis Firehose). Since Kinesis Firehose forces you to stream to a DB. Your options are Database or Database. And that Database cannot be CloudWatch Logs, nor does Kinesis support calling lambda directly, because hey, WHY NOT!

Since Kinesis is always on, it costs the wrong kind of money. We want full scalability, so we’ve gone with Firehose. Spin up a Firehose in the logging account, and use that with every CloudWatch subscription.

I can just set a resource policy on my Firehose to allow my whole AWS org access to make subscriptions from CW Logs, right? NOPE, you need to create what are known as custom Log Destinations, and enable other accounts to use that. That’s multiple additional AWS resources to manage.

Oh, also Kinesis Firehose isn’t a valid event source for lambda. “WHAT” — you say. That’s right, you need to funnel the data to an S3 bucket, and then use a Lambda trigger to actually hit the Log parsing Lambda Function.

(And for fun, the data that comes into the lambda via S3 from Kinesis, isn’t delimited. The data is directly concat-ed. Why in the world does it not automatically put delimiters between the records by default, is beyond me. And nothing that a simple .replace(/}{/g, ‘}\n{’).split('\n') couldn’t fix)

The Solution

As a result there are number of moving pieces to this which allow use to aggregate the logs in a single account for alerting purposes. Remember you want to also deploy this in every region, not just one. Your logs should stay in the same region they are generated in:

Multiaccount AWS Architecture Diagram

And the relevant CloudFormation Template to generate these resources in the Logging AWS Account:

Create the bucket where we temporarily store the logs:

CrossAccountLogBucket: {
    Type: 'AWS::S3::Bucket',
    Properties: {
      AccessControl: 'Private',
      BucketName: { 'Fn::Sub': '${AWS::AccountId}-${AWS::Region}-cross-account-logging-sink' },
      NotificationConfiguration: {
        LambdaConfigurations: [{ Event: 's3:ObjectCreated:*', Function: { Ref: 'LambdaFunctionAlias' } }]
      }
    }
  }

Allow it to directly invoke the Lambda Function LambdaFunctionAlias

S3LambdaInvokePermission: {
    Type: 'AWS::Lambda::Permission',
    Properties: {
      FunctionName: { Ref: 'LambdaFunctionAlias' },
      Action: 'lambda:InvokeFunction',
      Principal: 's3.amazonaws.com',
      SourceAccount: { Ref: 'AWS::AccountId' },
      SourceArn: { 'Fn::Sub': 'arn:aws:s3:::${AWS::AccountId}-${AWS::Region}-cross-account-logging-sink' }
    }
  }

Create the Kinesis Firehose

LogDeliveryStream: {
    Type: 'AWS::KinesisFirehose::DeliveryStream',
    Properties: {
      DeliveryStreamName: { 'Fn::Sub': '${serviceName}-${AWS::Region}-Log-Sink' },
      ExtendedS3DestinationConfiguration: {
        BucketARN: { 'Fn::Sub': '${CrossAccountLogBucket.Arn}' },
        RoleARN: { 'Fn::GetAtt': ['LogStreamRole', 'Arn'] }
      }
    }
  }

Allow the Firehose to write to S3

LogStreamRole: {
    Type: 'AWS::IAM::Role',
    Properties: {
      RoleName: { 'Fn::Sub': '${serviceName}-${AWS::Region}-CrossAccountKinesisLogStream' },
      AssumeRolePolicyDocument: {
        Statement: [{
          Effect: 'Allow',
          Principal: { Service: ['firehose.amazonaws.com'] },
          Action: ['sts:AssumeRole'],
          Condition: { StringEquals: { 'sts:ExternalId': { Ref: 'AWS::AccountId' } } }
        }]
      },
      Policies: [
        {
          PolicyDocument: {
            Statement: [
              {
                Effect: 'Allow', Action: ['s3:PutObject'],
                Resource: [{ 'Fn::Sub': '${CrossAccountLogBucket.Arn}' }, { 'Fn::Sub': '${CrossAccountLogBucket.Arn}/*' }]
              },
              {
                Effect: 'Allow', Action: ['kinesis:GetRecords'],
                Resource: [{ 'Fn::Sub': 'arn:aws:kinesis:${AWS::Region}:${AWS::AccountId}:stream/${serviceName}-${AWS::Region}-Log-Sink*' }]
              }
            ]
          }
        }
      ]
    }
  }

Create the CloudWatch Destination which can write to Firehose

AggregateLogEventsSubscriptionDestination: {
    Type: 'AWS::Logs::Destination',
    Properties: {
      DestinationName: { 'Fn::Sub': '${serviceName}-CrossAccountLogStream' },
      RoleArn: { 'Fn::GetAtt': ['CloudWatchDelegatedRole', 'Arn'] },
      TargetArn: { 'Fn::Sub': '${LogDeliveryStream.Arn}' },
      DestinationPolicy: {
        'Fn::Sub': JSON.stringify({
          Statement: [{
            Effect: 'Allow', Principal: { AWS: '*' },
            Action: 'logs:PutSubscriptionFilter',
            Resource: 'arn:aws:logs:${AWS::Region}:${AWS::AccountId}:destination:${serviceName}-CrossAccountLogStream',
            Condition: {
              StringEquals: { 'aws:PrincipalOrgID': [AWSOrgID] }
            }
          }]
        })
      }
    }
  }

And enable it to write to Firehose

CloudWatchDelegatedRole: {
    Type: 'AWS::IAM::Role',
    Properties: {
      RoleName: { 'Fn::Sub': '${serviceName}-${AWS::Region}-CloudWatchCrossAccountAccess' },
      AssumeRolePolicyDocument: {
        Statement: [{
          Effect: 'Allow',
          Principal: { Service: [{ 'Fn::Sub': 'logs.${AWS::Region}.amazonaws.com' }] },
          Action: ['sts:AssumeRole'],
          Condition: {
            StringEquals: { 'aws:PrincipalOrgID': [AWSOrgID] }
          }
        }]
      },
      Policies: [{
        PolicyName: 'FirehoseAccess',
        PolicyDocument: {
          Statement: [{
            Effect: 'Allow', Action: ['firehose:PutRecord'],
            Resource: [{ 'Fn::Sub': '${LogDeliveryStream.Arn}' }]
          }]
        }
      }]
    }
  }

The last step here is create a subscription filter in the log source accounts on existing CloudWatch Log Groups, and you are done.

DEV Community

AWS CloudWatch: How to scale your logging infrastructure

The Setup

The Logs

The Fallacy

The Solution

Top comments (0)

Read next

Deploying a Node.js Application on AWS EC2 Using Tabby SSH Client

Choosing the Right Java Microservices Framework: Spring Boot, Quarkus, Micronaut, and Beyond

Understanding RabbitMQ Brokers in AWS

Building a Scalable Serverless Image Processing Pipeline with AWS SQS and Lambda