Sebastian Korfmann

Posted on Jul 25, 2023

A Cloud Development Troubleshooting Treasure Hunt

#terraform #iac #aws #cloud

In the process of constructing a small example project utilizing the Website SDK resource of Winglang, I unintentionally found myself engaged in a troubleshooting quest due to some unforeseen behavior.

In summary, the application I've created is depicted as a Wing diagram, and you can interact with the same version on the Wing playground.

The specifics of what this application does are not our focus today. However, it's crucial to note that the Website resource is composed of a Cloudfront Distribution that serves as a proxy to a public S3 Bucket, which has been configured for hosting. Today, these will be our main subjects of discussion.

A Brief Introduction to Wing

Wing integrates both infrastructure and runtime code into a single language. This integration empowers developers to remain in their creative flow, enabling the delivery of superior, faster, and more secure software.

Wing is designed to compile down to various targets. This feature provides the flexibility to execute tests against each of these targets. You have the option to perform swift tests against the Wing simulator by using the command wing test main.w. Alternatively, you can run a fully integrated test against the cloud provider of your choice with wing test --progress -t tf-aws main.w. This command will render the application down to Terraform and JavaScript, allowing you to deploy the entire application and run the test against it, all without modifying a single line of code.

Understanding the Wing Examples Repository

The examples repository leverages GitHub Actions to test each of the examples against both the simulator and the tf-aws target.

The setup employs a GitHub/AWS OIDC configuration, utilizing a finely scoped AWS IAM role to manage the resources of all the examples within the repository. This aspect is vital, not just for security purposes, but also to trace the thought process that will be explained later.

First Act - Building & Testing Locally

One of the great features of Wing is that it allows you to build and test your entire application locally on your own computer. Since the application itself is rather straightforward, I quickly assembled some resources and gave it a go with wing it.

The impressive web interface for my small application is also functioning perfectly ;)

After adding a test, I was pleased by the speed and simplicity of the whole process.

Naturally, running the same test against AWS took a bit more time because it involved a Cloudfront Distribution. However, the deployment and tests also executed flawlessly on my machine.

Now, let's move on to the next step.

Second Act - Creating the Pull Request & So Begins the Treasure Hunt

I created a pull request to initiate the previously mentioned Github Actions workflow. This would run all the tests in the CI setup. In previous new examples, I encountered some permission issues for the narrowly scoped CI role, so I wasn't too surprised when I saw some permission issues arise.

✖ terraform apply
Command failed: terraform apply -auto-approve
╷
│ Error: creating CloudFront Distribution: AccessDenied: User: arn:aws:sts::***:assumed-role/wing-examples/gh-actions-winglang-examples is not authorized to perform: cloudfront:CreateDistribution on resource: arn:aws:cloudfront::***:distribution/* because no identity-based policy allows the cloudfront:CreateDistribution action
- terraform destroy
│     status code: 403, request id: d1dfd897-05fa-4c0c-a69f-f082f9c1af28
│
│   with aws_cloudfront_distribution.env0_cloudWebsite_Distribution_5CC04CFB,
│   on main.tf.json line 133, in resource.aws_cloudfront_distribution.env0_cloudWebsite_Distribution_5CC04CFB:
│  133:       }

To fix this, I proceeded to add the necessary permissions to manage Cloudfront distributions for the CI role. I clicked on re-run, only to encounter another strange error related to S3.

Command failed: terraform apply -auto-approve
╷
│ Error: Error putting S3 policy: AccessDenied: Access Denied
│     status code: 403, request id: V1A0DQ4Y85ZJQQ6K, host id: wR65P8tjzKvu+1f7RZMhBJ65aFNWYCBVS90ck43Ji+YNDyHJw140V0dz5j/6XV+AHiDKG7ktrfM=
│
│   with aws_s3_bucket_policy.env0_cloudWebsite_PublicPolicy_67A62A0C,
│   on main.tf.json line 394, in resource.aws_s3_bucket_policy.env0_cloudWebsite_PublicPolicy_67A62A0C:
│  394:       }

Hmm, an Error putting S3 policy: AccessDenied: Access Denied for aws_s3_bucket_policy seemed unusual. It pointed to the s3:PutBucketPolicy action.

Trap Door Number 1 - IAM Role Permissions?

So, what's up with that s3:PutBucketPolicy permission which's pretty obviously missing. It's working locally with admin rights for the very same AWS account, after all.

I double-checked the permissions for the CI role and even set s3:* and eventually *, but to no success. Puzzling.

I decided to do some research on Google to see if I could uncover any helpful information. I found several discussions related to recent changes in AWS concerning public buckets. The threads suggested that one now has to explicitly configure the public settings for an S3 bucket. This seemed promising, so I checked the implementation of the Wing SDK bucket, which the website was using. It looked reasonable.

  if (isPublic) {
    const publicAccessBlock = new S3BucketPublicAccessBlock(
      scope,
      "PublicAccessBlock",
      {
        bucket: bucket.bucket,
        blockPublicAcls: false,
        blockPublicPolicy: false,
        ignorePublicAcls: false,
        restrictPublicBuckets: false,
      }
    );
    const policy = {
      Version: "2012-10-17",
      Statement: [
        {
          Effect: "Allow",
          Principal: "*",
          Action: ["s3:GetObject"],
          Resource: [`${bucket.arn}/*`],
        },
      ],
    };
    new S3BucketPolicy(scope, "PublicPolicy", {
      bucket: bucket.bucket,
      policy: JSON.stringify(policy)
    });
  }

However, this path seemed to be a dead end. It was time to go back to the drawing board.

Trap Door Number 2 - What Else Could Cause This Behavior?

To recap, we're dealing with a situation where a Terraform application can be deployed from my local machine, where I have admin rights to the AWS account. However, when deploying the same application to the same AWS account from Github Actions, the deployment fails due to permission issues. The Github runner uses OIDC for authentication and authorization. Could it be related to this configuration? Or could organization policies disallowing public buckets be at play?

To get more insight, I decided to examine the CloudTrail logs to see if anything unusual might catch my eye. The most conspicuous difference was the presence of webIdFederationData in the sessionContext for the CI role.

"webIdFederationData": {
    "federatedProvider": "arn:aws:iam::<accountId>:oidc-provider/token.actions.githubusercontent.com",
    "attributes": {}
},

It seems unlikely that this should make a difference, but let's confirm this. I assumed the CI role on my local machine and ran the AWS test again. Unsurprisingly, the operation was just as successful as before.

I briefly considered the potential influence of Organizational policies or session policies, only to conclude that it's quite certain we're not dealing with a permissions issue after all. Time to move on.

Trap Door Number 3 - Is It a Terraform Bug?

The next step was to delve deeper into the actual Github Actions workflow.

Could the answer lie within the Terraform trace logs? Before proceeding with this, I needed to ensure that this wouldn't expose any sensitive data given that we're working on a public repository. An alternative approach could be to SSH into the Github runner instead. Or perhaps use act to simulate the situation locally? However, neither seemed feasible as getting OIDC to work or obtaining a quick shell via the command line wasn't possible. Back to the logging idea then.

The examination of the logs seemed safe enough for public visibility. I enlisted the help of ChatGPT to modify the workflow to print out logs when a failure occurs.

- name: Execute wing test in matrix directory
  env:
    TF_LOG: info
    TF_LOG_PATH: ${{ runner.workspace }}/terraform.log
  run: cd ${{ matrix.example.directory }} && wing test --debug -t tf-aws main.w
- name: Output Terraform log
  if: failure()
  run: cat ${{ runner.workspace }}/terraform.log

Alright, that should do it. Now let's analyze what we've got. As a precaution, I decided to double-check the versions of Terraform and AWS Provider. Both terraform 1.5.x and AWS Provider 4.65 looked correct for both local and CI environments – nothing suspicious there.

I then proceeded to comb through the vast log output - Terraform trace logs are indeed very detailed. In doing so, I came across a particularly intriguing find:

2023-07-19T15:49:49.105Z [WARN]  Provider "provider[\"registry.terraform.io/hashicorp/aws\"]" produced an unexpected new value for aws_s3_bucket_public_access_block.env0_cloudWebsite_PublicAccessBlock_E7BC7F4B, but we are tolerating it because it is using the legacy plugin SDK.
    The following problems may be the cause of any confusing errors from downstream operations:
      - .block_public_acls: was cty.False, but now cty.True
      - .block_public_policy: was cty.False, but now cty.True
      - .ignore_public_acls: was cty.False, but now cty.True
      - .restrict_public_buckets: was cty.False, but now cty.True

Keen observers may notice that this isn't just a TRACE log; it's actually a WARN statement. Hmm, what's going on here? It indeed concerns the aws_s3_bucket_public_access_block resource, which we previously verified was correctly configured in the Wing SDK implementation. If this is throwing an error, and the aws_s3_bucket_public_access_block values must be false to define a public S3 Bucket, it's no surprise things are going awry. But wait a minute - why is it working locally again? Could it be... perhaps... a bug in Terraform?

Third Act - Unraveling the Mystery

Well, at least we now have a promising lead. Some diligent googling and browsing through Github issues in the AWS provider project yielded no directly related findings. However, I did come across a few recent bug reports about the recent change AWS made regarding the treatment of public buckets. And interestingly, they described precisely the behavior I was encountering.

Error putting S3 policy: AccessDenied: Access Denied

What's more, within these very issues, I stumbled upon a potential workaround.

resource "aws_s3_bucket_policy" "b" {
  ...
  depends_on = [aws_s3_bucket_public_access_block.b]
}

Thus, the good news: it appears this isn't truly a Terraform bug – although the 'ERROR' log from earlier remains mysterious, perhaps it's just an AWS hiccup. Furthermore, it seems like this might simply be an issue with dependencies. However, what still baffles me is why this problem only surfaces in the CI environment and not locally.

Fourth Act - Hooray - We've found the treasure

It's time for the grand finale - attempting to replicate the CI behaviour locally and identify the simplest example that demonstrates the issue.

Given that it seems to work on MacOS, it might be related to some specific behaviours in Linux or Docker. I've decided to use the Docker image I created for the Wing Github Action - ghcr.io/winglang/wing-github-action:v0.1.0 to start my investigation.

Again, I'll double-check the Terraform versions just to be sure:

Terraform v1.5.0
on linux_amd64
+ provider registry.terraform.io/hashicorp/aws v4.65.0

Terraform v1.5.2
on darwin_amd64
+ provider registry.terraform.io/hashicorp/aws v4.65.0

With the setup ready, it was time to initiate a test run within the Docker container.

Success! We were able to reproduce the error. The next step was to trim down the Terraform configuration. After several iterations, it was narrowed down to these three critical components:

aws_s3_bucket
aws_s3_bucket_policy
aws_s3_bucket_public_access_block

Interestingly, as the number of resources decreased, so did the error rate. The full Terraform application saw failure in about half of the attempts when run in the Docker container on my machine. In contrast, the slimmed-down version only failed in about one-third of the attempts. Strikingly, all builds failed when run on the Github Actions runner.

To quantify this, an informal test was performed, applying and destroying the minimal setup ten times in succession:

Linux: 3/10 runs failed
Mac: 0/10 runs failed

Applying a fix – an explicit depends_on relation between the aws_s3_bucket_policy and aws_s3_bucket_public_access_block resources – successfully resolved the error on Linux. So, it appears we've found our culprit: a race condition among resources that sometimes allows for success, and other times results in failure.

Failed run

aws_s3_bucket.cloudWebsite_WebsiteBucket_EB03D355: Creating...
aws_s3_bucket.cloudWebsite_WebsiteBucket_EB03D355: Creation complete after 2s [id=cloud-website-c8e58765-20230719204937669500000001]
aws_s3_bucket_policy.cloudWebsite_PublicPolicy_44BB71F3: Creating...
aws_s3_bucket_public_access_block.cloudWebsite_PublicAccessBlock_18A70311: Creating...
aws_s3_bucket_public_access_block.cloudWebsite_PublicAccessBlock_18A70311: Creation complete after 1s [id=cloud-website-c8e58765-20230719204937669500000001]
╷
│ Error: Error putting S3 policy: AccessDenied: Access Denied

Successful run

aws_s3_bucket.cloudWebsite_WebsiteBucket_EB03D355: Creating...
aws_s3_bucket.cloudWebsite_WebsiteBucket_EB03D355: Creation complete after 2s [id=cloud-website-c8e58765-20230719205053652900000001]
aws_s3_bucket_public_access_block.cloudWebsite_PublicAccessBlock_18A70311: Creating...
aws_s3_bucket_policy.cloudWebsite_PublicPolicy_44BB71F3: Creating...
aws_s3_bucket_policy.cloudWebsite_PublicPolicy_44BB71F3: Creation complete after 1s [id=cloud-website-c8e58765-20230719205053652900000001]
aws_s3_bucket_public_access_block.cloudWebsite_PublicAccessBlock_18A70311: Creation complete after 1s [id=cloud-website-c8e58765-20230719205053652900000001]

Apply complete! Resources: 3 added, 0 changed, 0 destroyed.

The order of resource creation notably varied. For a deeper technical explanation of the fix, feel free to explore the corresponding pull request.

In a nutshell

The fix boils down to establishing the dependency outlined above, and it was achieved through a mere 2-line change. As for the discrepancies between operating systems, my guess is that it's tied to the availability of computing resources, although this is purely speculative. It might be illuminating to run the tests hundreds or even thousands of times to see if the error would appear on my local MacOS machine. But considering the problem is resolved now, I'm content with leaving the matter as is. You are welcome, though, to offer your speculations in the comments!

Despite the relatively simple resolution, this bug proved to be particularly elusive. It's difficult to attribute it definitively in this instance. The Wing resource was likely developed before the changes to S3 permissions and probably on a Mac. Meanwhile, the integration tests are still in their infancy. As for Terraform, it's unclear whether the dependency was mandatory before AWS altered S3 bucket permissions. Nevertheless, helpful pointers in the resource documentation would have been appreciated (there is indeed an issue open for this). Alternatively, a specific "public bucket policy" could be beneficial, though that's maybe a little too far fetched. All in all, a great example for moving targets in cloud engineering.

Key Lessons

This episode embodied the perfect storm of challenges, with cryptic error messages, slight variations in behavior across different operating systems, and a dollop of my own confirmation bias. Yet, it serves as a striking example of the hurdles one might encounter in cloud development. I have no doubt that this particular issue alone consumed countless hours of troubleshooting time, not to mention the time spent grappling with IAM issues in general. The AWS Terraform provider repository hosts numerous examples of similar dilemmas, and AWS CDK is no stranger to such issues either.

Developing in the cloud (like AWS) can be an absolute delight when everything slots seamlessly into place, freeing you from substantial operational burdens. However, when things go awry, it can feel like an insurmountable barrier. Frequently, we're left with limited error context, inadequate or non-existent logs, ever-changing targets, and outdated documentation or blog posts.

Projects like Wing are making strides towards simplifying the complex task of developing and operating applications in the cloud. The cloud resource SDKs being developed encapsulate the collective hard-earned knowledge and experience, distilling it into reusable components. While the system might not be perfect or bug-free at this stage, there's an incredible team and an active, albeit small, community at the ready to offer assistance.

Finally, if I could make one request of AWS, it would be to offer greater error context surrounding permission issues. If not directly in the API call error message, then at least within CloudTrail. I'm convinced that such a feature would save an unfathomable amount of engineering hours daily.

Perhaps such a feature already exists, and I'm just not aware of it? Please let me know in the comments! I'm also eager to hear about your experiences with similar challenges.

DEV Community