Introduction
Towards the end of 2019, I said I would look into the Infrastructure as Code setup as my current workplace and try and improve it. Specifically I've promised that I'll get our DevOps capability to a point where I can build our system in full and tear it down once I'm finished, that it'll be easy for developers to use and change, that it'll be reliable and version controlled... basically a whole lot of standard promises that you seem to make as DevOps/Site Reliability Engineers. I looked into a variety of technologies to achieve this and I've more or less settled now on AWS CDK (at least for the AWS stuff). Having been using it for a bit I decided it's well worthy of a quick article, I haven't been able to find an example of an article from someone whose really tried to use it in anger so hopefully this is of interest to some.
In a nutshell: CDK is great. I really think it's a gamechanger in the DevOps space. In this article I'll explain why I think this, what the motivation was for using it and what sets CDK out from the other options. It's quickly become my favourite IaC tool on the market having previously always used things like Terraform, bespoke bash scripts, Azure ARM templates, that kind of thing.
I'm intrigued to see as well where CDK for Terraform goes, however this article does not touch on this at all and is purely about AWS CDK which is almost certainly the more mature of the two at time of writing.
What's not so great about CDK
So let's get the downsides out of the way first. CDK is quirky, sometimes behaves unexpectedly, has some work to do on documentation yet (though it's pretty good), feels a bit immature still, even buggy in places and I wouldn't blame folks for waiting around a bit for it to "stablise" and "become a bit more production ready". As with any new tool, eventually common problems will either dissipate out as they find better solutions, or StackOverflow will become flooded with helpful tips/tricks, or more people will write helpful articles about it.
The biggest drawback really for CDK is that it's AWS specific and uses CloudFormation under the hood. I mean that's design really, but it is kind of the most disappointing thing about it. I think you can write modules in it but there's little to no documentation at present and even having read through some of the code I wouldn't know where to begin (I can't find many examples of people who have done this yet). Compared to something like Terraform that has a multitude of providers, CDK doesn't (and I'm guessing won't any time soon) come anywhere near that level of support. Even for AWS there are things I've found missing in places that Terraform can fairly well do like posting swagger to an AWS API Gateway. I have managed to find workarounds (I'll be posting one about using Terraform with CDK and custom resources in Lambdas soon hopefully) but they're not ideal.
CDK is clearly bound by CF (CloudFormation) under the hood which at times is just a bit ropey. It also means you have to get to grips with both CDK and CF. Sometimes CDK can improve on this which is great but they're still using CF under the hood and you can tell. Examples I've come across are:
- I wanted to create two indexes on a DynamoDB table: you can't do this in parallel. If you try to create two indexes in CF it will try to run them in parallel and fail to do so. There is a workaround I found here where you create them in a Lambda synchronously (one after the other) and then create a CDK
CustomResource
that polls that Lambda. Sadly it would be much simpler if CF could just know it needs to do this synchronously (in particular sequentially). In my opinion this is a CF weakness that CDK could get over this way potentially, but it seems they shouldn't really need to. - CloudFormation is not great at deleting stuff and therefore neither is CDK. Sometimes this is great when you would otherwise accidentally delete stuff but sometimes this can be very frustrating and time consuming. When I'm developing I want to be able to destroy a stack and then recreate it; however with CF I need to go through and destroy the DynamoDB tables, S3 buckets, Route 53 hosted zones, etc etc, otherwise it will fail when I rerun it as these are already there. This is almost certainly a CF problem but it still effects CDK.
Also recently we moved to using nested stacks and now we can no longer get diffs: they just tell you your asset hash changed but not what changed within them. Worse still it auto deploys everything without prompting me when I run a deploy, I have no idea what it will deploy without trying it any more. This one was a real doozy as we ended up rolling back some EC2's that were being used by the dev team by just putting a syntax error in something, they were not happy, had it been prod this would have been a disaster (though yes you're right, we shouldn't be trying stuff out on stuff that's being used, you're quite right, that would be our bad and we're not any more).
On a similar note: I've seen it so many times where if I put a syntax error in my code it will just deploy whatever it has found so far but also it will destroy everything it hasn't found (if I auto-approve or use nested stacks anyway). Surely the default would be if there's a syntax error just do nothing. Maybe this one is JS specific as compilers would catch it in Java say but surely if I put a syntax error in I should fix that before CDK does anything. However if I put a syntax error in right at the top it will literally teardown everything.
Why do I love CDK?
So with all these drawbacks, and some of them are big ones, why do I think it's the best thing since sliced bread? Well, it's got a great helpful community who are happy to answer questions, it's frequently coming out with updates that make it tangibly and noticeably better, it's fast and it's fun to work with.
Fundamentally though, it all comes down to one specific reason, code is better suited to this task than configuration or DSL's. It really is true Infrastructure as Code, this is what makes it such a game changer for me. It's not complex configs that are unreadable (I've seen beforehand JSON/YAML files of the order of thousands of lines that just seems like a nightmare to maintain). It's possible to put logic into it without having to learn a whole extra DSL (yes Terraform, I'm looking at you). I get to choose from TypeScript, Python, Java or any .NET language (more are likely to be added in time I understand). JS/TS is the one you'll get most examples in and help from the community with, plus some tools exist there where they don't in the others (like a testing assertion library which is really cool). However aside from this, from what I can tell they're all about as good as each other and I can define my whole AWS infrastructure in an AWS account using that language of choice (NB: I've only tried Python though, and did change to JS in the end because of the previous listed reasons).
Now I can hear a lot of DevOps people shouting, but Infrastructure as Code should be configs, if it's not fairly straight forward configs it's just too complex and that should be a huge code smell. I hear you, but I disagree for the following reasons:
- Firstly code can be much more readable than configs. Configs I have to use JSON or YAML or whatever and have little control over the structure, I'm constantly having to refer to documentation to work out what something does, in JSON I can't even put comments in explaining why I'm doing something (yes I shouldn't have to in an ideal world, we all know it isn't). With code, I can write it in such a way that it's as clear as I want. Badly written code won't have this of course but that's on the head of the guy who wrote it. I'd much rather see something like the following:
service_abc_lambda = aws_lambda.Function(
self,
"LambdaServiceABC",
runtime=aws_lambda.Runtime.PYTHON_3_6,
handler=lambda_function.handler,
code=aws_lambda.S3Code.from_bucket(code_bucket_name, lambda_code),
environment=environment,
memory_size=128,
role=lambda_role,
timeout=aws_core.Duration.seconds(lambda_config.get("timeout", 30)),
)
bucket_name = "service-abc-awesome-bucket"
bucket = aws_s3.Bucket(
self,
"BucketForServiceABC",
bucket_name=bucket_name,
encryption=aws_s3.BucketEncryption.S3_MANAGED,
block_public_access=aws_s3.BlockPublicAccess.BLOCK_ALL,
)
principals = [aws_iam.ArnPrincipal(service_abc_lambda.role.role_arn)]
bucket.add_to_resource_policy(
aws_iam.PolicyStatement(
resources=[bucket.bucket_arn, bucket.bucket_arn + "/*"],
actions=[
"s3:GetObject*",
"s3:ListBucket",
"s3:PutObject*",
"s3:DeleteObject",
],
principals=principals,
)
)
service_abc_lambda.add_environment(
"MY_BUCKET", bucket_name
)
I've changed the names to protect the innocent and simplified it, but fundamentally this is pretty close to code I've written and something like this may be used in our end product. To me this is highly readable, I'm creating a Lambda for something called Service ABC
, it'll be a Python 3.6 function, 128 meg, I've cheated a bit as the environment and lambda_role were defined above that I've cut out but you can see clearly it's creating a Lambda function. Then I create a bucket and attach permissions to that bucket that allow the Lambda to perform various read/write operations on that bucket and finally I add an environment variable to the Lambda so the Lambda knows the bucket name to interact with.
I could do most of this in Terraform too and it might be readable there as well with modules and the like, but once I get beyond this into ifs and loops, I can just write Python or JS in a much nicer, more readable and more reusable way.
It saves me (and the rest of my team) having to learn a new DSL and I can leverage existing skills (hopefully, or at least encourage people to learn really useful new skills, that's still great).
When I do want to do something complicated, I can do it. I might end up with just enough rope to hang myself but then that's on me if I write something horrific and unmaintainable, just as the case with application development. It should not be on the tool to decide how I write my IaC, it should be on me and my team to make sure that what we have is good, correct, reliable, etc.
If at some point we decide to change what we're using, assuming whatever we're changing to can be written in JS then I don't have to restart everything. Yeah I mean I'll have to change a lot potentially but I won't have to start again. Or I may decide to start again and base my new tool off the old one, in the places where it was good enough copy and paste can save me time (alright this isn't always a good thing).
Now I know that this isn't the case for very much at the moment, the only other thing I've seen that can do this at present is Pulumi which I discounted for my use case because of the price. I predict however a lot of things will go this way in time for the reasons I've listed here. Alright it's more complicated for the writers of the tool to allow people to write Python or TypeScript or Java or whatever but AWS have proved it's possible here.
If (as was my case) I'm taking over an existing codebase, assuming that codebase is one of the language options then I can leverage that existing codebase. If I decide to go down a Terraform route I've got to start all over again and for the existing complexities, I need to rethink all of them (that might not be a bad thing but if it's complex enough it'll be a huge pain and time sink).
If I really want configs I can write my code to read directly from configs and then just update configs in the future, I could even have configs that live on another server or code repository somewhere if I really wanted. It really does give me the best of both worlds here and I've certainly done this in my use of it. I'm not against configs, far from it, I just think code is always going to be better when it comes to something complex.
Experience with CDK
I won't go into too much details about how the project I've been running has gone because that's not the point of the article. However I will say that after a year we're still using it, more people have gotten involved in the project and they like it too and we're still actively developing solutions with this. It's quite a bold move (given the disadvantages section, some of them are big) to use CDK instead of Terraform, but I think it's one that will pay off in time still.
Would I use CDK again in the future? Well depends on the requirements of the project. If they're even remotely thinking about multi-cloud then no sadly. If they're fixed along the AWS route then I'd certainly think about it. For projects that are AWS based I think it will eventually be the goto tool of choice, AWS put so much effort into this and I really believe the disadvantages section will shrink over time. Don't be surprised if in a years time this article is just not relevant any longer (or if I've time and inclination gets a big overhaul).
Conclusion
Fundamentally, when it comes down to it I'm from a development background and I like writing code, I admit this. I know that there are other folks who come from an ops background who are more used to thinking of infrastructure in terms of UNIX configs and the like, I get that. My personal opinion is that people trying to find their way in DevOps who are weak in either development or operations are going to get left behind by those who can do both. Part of the DevOps movement I would hope is that we're moving to a world where people can do both and this should be encouraged. I had to learn my way around infrastructure in my transition to a Platform engineer/DevOps engineer/SRE (or whatever you want to call it, I do not wish to get into that debate), I'd strongly encourage those with an ops background to learn Python or something like that, you'll thank yourself for it later I promise (maybe I should write a Learning Python for those with an Operations Background
to help, hmm).
I really truly hope other tools go this way, particularly open source tools (if I've missed any please ping them in the comments, I'd love to hear about them). It's a bigger challenge for those writing the tools but I think the benefit is worth it. AWS really seem to have shown us a potentially better way of life here, fingers crossed anyway.
Top comments (2)
Hashicorp, makers of the 'Terraform' IaC, now have a version of AWS's CDK that targets multi-cloud, so GCP and Azure, check it out: CDK for Terraform.
Yup, be interesting to see if this takes off. I haven't looked at it at all and would be curious if it suffers from similar fates to the above?
NB: from the article (easy to miss these things I know):
I still suspect AWS CDK is more mature than CDK for Terraform.