Danny Reed

Posted on Apr 27, 2022 • Edited on Jul 22, 2022

Self-Provisioning Runtimes & Serverless DX

#serverless #cloud #dx

Introduction

Developer experience (DX) has come a long way since the punch card days, but in the world of serverless development, we're actually approaching a significant regression in DX.

Some of the practices and principles that unlock the power of serverless also make developer experience worse. Self-provisioning runtimes are the solution, and that's only the beginning of their value.

TL;DR:

Using serverless best practices makes for a poor DX
We should care about that because it costs us both money and happiness
We should pursue self-provisioning runtimes for better DX (and many other reasons)

Serverless best practices push us into bad DX

In a very general way, I'm wrong. Serverless affords us the luxury of spending very little time on "undifferentiated heavy lifting" and more time writing the "special sauce" code that makes our projects distinct and competitive. That, however, is just one metric that contributes to the overall DX.

Consider these best practices that double as DX detractors:

Infrastructure as Code

IaC is critical to the successful management and evolution of serverless applications. I think the benefits of IaC are generally well understood, so I'll forego that argument here.

Let's consider CloudFormation. Perhaps you've experienced the very long dev loop involved with writing infrastructure templates. You get an error, make a fix, push to develop, wait, repeat. I think it's fair to say that IaC is hard because of the long dev loop and the challenges of "config languages" like YAML and JSON.

IaC use cases demand a lot from configuration-oriented languages. We aren't just configuring, we are also defining. We want maps, string manipulation, !Ref-ing, and lots of other things that are a stretch for "config." Sometimes even simple things like concatenation can wind up being tricky to read:

!Join
  - ''
  - - 'arn:'
    - !Ref AWS::Partition
    - ':s3:::elasticbeanstalk-*-'
    - !Ref AWS::AccountId

IaC templates are slow and difficult to develop, but because we believe in the benefits of IaC, we dutifully shoulder that burden.

(Don't worry, CDK fans. We'll consider CDK in Appendix A.)

Configuration Over Code

"Configuration over code" falls in line with the maxim, "code is a liability" and that we should prefer to configure a general-purpose service over writing custom code to maintain.

One example of this is using Step Functions instead of Lambda. Based on our practice of "configuration over code" we should (generally) use Lambda only when Step Functions can't reasonably be used instead.

The depressing ramification of "configuration over code" is that now business logic is infrastructure, too.

If YAML was rough for infrastructure, imagine how bad it is at business logic? Whether you use YAML, JSON, or a domain-specific language (DSL), defining business logic as config is a nightmare:

No debugger
Almost no use of IDE tools like IntelliSense
The available linters aren't great
IaC-defined business logic doesn't read at all like a program

We've reset the DX clock by decades.

Look at this (silly) number classifier written in Amazon States Language:

{
  "States": {
    "ClassifyNumber": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.value",
          "NumericEquals": 0,
          "Next": "IsSmallNumber"
        },
        {
          "Variable": "$.value",
          "NumericGreaterThan": 100,
          "Next": "IsLargeNumber"
        }
      ]
    },
    "IsSmallNumber": {
      "Type": "Pass",
      "Next": "SuccessState"
    },
    "IsLargeNumber": {
      "Type": "Pass",
      "Next": "SuccessState"
    },
    "SuccessState": {
      "Type": "Succeed"
    }
  },
  "StartAt": "ClassifyNumber"
}

Unless you're an ASL guru, I trust you found that difficult to read and quite verbose. Now compare to this code:

function classifyNumber(num) {
    if(num > 100) return 'Large';
    return 'Small';
};

I admit that's not a realistic program, but we can see the extra layer of developer-effort that has to transform simple logical statements into a highly-specialized domain-specific language (DSL).

(Don't worry Workflow Studio fans, I'll address that in Appendix B)

The Costs of Bad DX

Gap-Closing

We might think very highly of ourselves for accepting these tremendous burdens in order to achieve a philosophically-consistent outcome. This sort of digital Asceticism isn't doing us favors.

One metric that contributes to DX is:

How much distance is there between the developer's intent and their code?

Another way to phrase it is:

How different is the pseudocode from the final code?

I think it's safe to say that nobody has ever pseudocoded something that looked remotely similar to the ASL definition we saw above. So there's distance there -- there's distance between intent and code, and the developer has to close that gap. Doing so is not free:

Gap-Closing Costs

Bugs introduced while transforming intent into final code
Higher maintenance costs due to unintuitive/unreadable codebase
Developer time
Training/ramp-up costs (especially with DSL's)
Developer morale, retention, etc.

Is "developer morale" a stretch? Consider this: I personally dislike JavaScript for various reasons, but it's one of my go-to languages. Why? I'm productive with it. We developers crave productivity -- it boosts our work-satisfaction and perceived effectiveness. When we have to use technology with poor DX, we feel less productive/effective/satisfied and that is a cost we shouldn't ignore.

Recruiting

You will pay more to find and retain a developer who is thoroughly-versed in an obscure DSL than you will pay to hire a developer skilled in general-purpose languages like TypeScript, C#, or Python.

If you want to make life easier when it comes to recruiting, don't make your stack special. Make it normal.

Have Your Cake, Eat it Too: Self-Provisioning Runtimes

I first came across the term "Self-Provisioning Runtimes" on this episode of the excellent Serverless Chats podcast by Jeremy Daily. In this episode he interviews Sean (swyx) Wang and a good portion of their discussion centers on the topic. Not knowing what self-provisioning runtimes (let's call them SPR's, OK?) were, I listened as they began to describe how, essentially, you just write your business logic, and then the system looks at your code, figures out what infrastructure you need, and then runs it.

Swyx poses the idea this way:

If the Platonic ideal of Developer Experience is a world where you ”Just Write Business Logic”, the logical endgame is a language+infrastructure combination that figures out everything else.

Jeremy amusingly described his reaction to reading this sentence for the first time:

I put my arms out like this, lights light up, music starts playing, doves fly out from behind me. I'm like, "Yes! Yes! That. Why do more people not get that?"

I had a pretty similar reaction.

This is fake code that doesn't work with any SPR, but it gives you the idea of what we're going for:

const uploadEndpoint = spr.postEndpoint('/upload');

// Upload handler
uploadEndpoint.onUpload(file => {
    spr.store(file)
       .wait(Duration.Days, 14)
       .archive(file)
       .wait(Duration.Years, 1)
       .delete(file);
});

Let's imagine that this snippet gets processed by the SPR, and it spits out an API Gateway, a Step Functions state machine, and an S3 bucket with some lifecycle rules.

The distance between intent and code is extremely short. There's no accompanying infrastructure template to write. We just write business logic, and the SPR figures out how to architect that.

All Your Practices Are Belong to the SPR

Developers no longer need to subject themselves to the pain of writing business logic in config languages, deal with the quirks of YAML, or learn obscure DSL's. The SPR knows all the best practices and will create the best-fit architecture for the business logic at hand.

DX Regained

Now that we're back to writing business logic in programming languages like TypeScript, C#, Python, or whatever you want, you can use your debugger, your IDE tools, your language skills, etc.! This is the DX we want, and it's the DX we should push for.

It's not because we can't handle the pain, but because we are actually worse and more expensive developers when we are in pain.

Rainbows and Unicorns

You're saying, "great, but it's made up, right?" Yes, kind of. Why doesn't this exist? Swyx does a pretty good round-up on the contenders out there in his article. The most mature one is Serverless Cloud, and you should definitely check them out.

I think we need to see more attempts, loftier goals, and more folks throwing energy at this. I think Serverless Cloud is awesome, but I have a mercilessly-ambitious wishlist, about which I'll share in a future article.

Appendix A: CDK

I think of CDK as a stepping stone on the way to where we want to go.

The CDK is great. I found it long before I heard of SPR's and I thought it was the most epic thing I had ever seen. The reasons I liked it basically boiled down to DX pain-relief.

No more YAML/JSON
Constructs let me reuse stuff, leverage abstractions
IntelliSense works
I can catch many errors before running my long pipeline
Easy infrastructure diffs
Productivity

Ben Kehoe and others have raised concerns with CDK for various valid reasons, but I was so hungry for a better DX I was disposed to try to defend it against all the arguments, no matter how good they seemed.

The CDK does help with DX, and that's good, but it's not solving the whole problem, and whatever does solve the whole problem will make CDK-like tools easy to abandon.

Why is CDK just a stepping stone?

For one thing, you still have to write infrastructure separately from your code, so we haven't achieved swyx's "Platonic ideal". Look at this Lambda definition. It clearly points to where the business logic lives, over in lib/lambda/myLambdaFunction. This is an improvement over doing things in YAML/JSON, but it's just a way to do the same thing with less pain.

const myLambda = new lambda.Function(this, 'my-lambda-name', {
    // See? your business logic lives separately!
    code: lambda.Code.fromAsset('lib/lambda/myLambdaFunction'), 
    runtime: lambda.Runtime.NODEJS_14_X,
    architecture: lambda.Architecture.ARM_64, 
    handler: 'index.handler',
    functionName: 'my-lambda-name',
});

What about CDK-defined step functions?

Let's look at a CDK-defined state machine that I created recently. It runs an Athena query, then uploads the results to an external web service using Lambda.

Here, we did the following:

Used a programming language
Practiced "infrasructure as code"
Practiced "config over code"

Home run? Almost. Here's why it's not quite there:

We aren't describing business logic; we are still describing infrastructure.
The developer still has the burden of translating business logic into "state machine form" and then defining the state machine with CDK. We haven't closed the "intent to code" gap yet.

const artifactStateMachine = new stepFn.StateMachine(this, `StateMachine-${artifact.viewName}`, {
  // Step 1: Start the Athena query
  definition: new tasks.AthenaStartQueryExecution(this, `Run Athena Query: ${artifact.viewName}`, {
    queryString: `SELECT * FROM ${artifact.viewName}`,
    resultConfiguration: {
      outputLocation: {
        bucketName: dataPipelineBucket.bucketName,
        objectKey: artifact.viewName
      },
    },
    queryExecutionContext: {
      databaseName: 'MyAthenaDB'
    },
    workGroup: 'MyAthenaWorkGroup',
    integrationPattern: IntegrationPattern.RUN_JOB
    // Step 2: Invoke uploader Lambda
  }).next(new tasks.LambdaInvoke(this, `Upload to destination: ${artifact.viewName}`, {
      lambdaFunction: myLambda,
      inputPath: '$.QueryExecution.ResultConfiguration',
    })),
  role: stateMachineRole,
});

Here's an example of what we might do in an ideal world, with an SPR:

const queryResult = await data.query(viewName);
await externalService.upload(queryResult);

The SPR would "compile" that down into something like what we saw in the CDK example above.

Notice that the developer doesn't need to:

Know what a state machine is or how to define one.
Know if a state machine is the best choice for this workload.
Write the Lambda code separately

Appendix B: Step Functions Workflow Studio

AWS went to the trouble to make us a really nice Workflow Studio for Step Functions, so why doesn't that count as a great DX?

Fair point. Even better when you use the YAML/JSON export features to create your ASL files, and then reference them using serverless transforms in your CloudFormation template:

Resources:
  MyStepFunction:
    Type: AWS::Serverless::StateMachine
    Properties:
      DefinitionUri: aslDef.yml # <-- behold, the generated ASL file
      Events:
        # [omitted for brevity]

This is an improvement on writing those things by hand, but here are my challenges with this arrangement:

You can use the Workflow Studio to make your ASL file, but you still have to dig into it manually when it comes to merge conflicts or other version control-related tasks.
You can't "read" the logic of the state machine easily in ASL format, so you have to copy/paste it into the Workflow Studio to see what it really does. This means as a developer you can't really understand your code by looking at the code in the repo any more.
Copying and pasting YAML back and forth from the browser to the IDE seems error-prone at worst, and clunky at best.

Feedback

Thank you for considering my lengthy thoughts on this topic. I'm very excited to see what can be done in the wonderland of SPR's, and I'm hoping to push out some more articles that focus on defining what an ideal SPR looks like, what characteristics it needs to have, and what obstacles are in the way.

Please share your thoughts on the topic, relevant resources, or ideas for where to go next in my exploration of SPR's.

Top comments (6)

Daniel Fyhr • Apr 29 '22

Great post! Thanks for sharing. "If YAML was rough for infrastructure, imagine how bad it is at business logic?" had me laughing. The combination of CloudFormation deployment times and the .$s etc can be really tedious.

One thing that's scary for me with abstractions like this is what if an edge case isn't supported? With the raw language you know you have the full set of features. And the edge cases might not be that rare. What do you do when (if?) you hit that bump?

Danny Reed • Apr 29 '22

Hi Daniel, thanks for sharing your thoughts!

I'm not sure if I'm understanding your question properly, so if my answer sounds like nonsense please help me out :)

Part of the reason we want SPR's is to enjoy the benefit of abstractions, but yeah, it's always scary to embrace a new abstraction (or set of them) because we know we are giving up control, and potentially the ability to account for edge cases.

I like Swyx's comparison to programming languages. We used to have to manage memory ourselves, but now we have garbage-collected languages that abstract away the burden for us. Yeah, we lose some control, and even some efficiency, but in the end, we finally embrace the productivity and safety (and security, sometimes) of the abstraction and accept the minor performance tradeoffs.

Those tradeoffs then become some else's optimization problem (which is probably best!). Someone over on the C# team gets to be the one to optimize garbage collection to be smarter, and I don't have to do it.

Back to SPR's -- I think the key to successful abstractions will be to provide reasonable, overridable, defaults. An SPR won't be successful (or even usable, in some cases) if it doesn't provide a mechanism to peel back a layer and make some fine adjustments and/or some extensibility model that allows for custom abstractions to cover specialized use cases.

Meanwhile, having provided developers with the tools to do what they need to do, whomever is responsible for developing the SPR becomes responsible for making the abstractions smarter, thereby gradually reducing the need for "special" code to customize the behavior of the available abstractions.

In a future post I want to explore the ideal characteristics of an SPR and how it would address concerns like this one. I'd be happy to hear any other thoughts you have on the topic, including other potential "bumps" that an SPR might face.

Daniel Fyhr • May 1 '22

Hello again!

The comparison with programming languages is good. Abstractions like that are exactly what I would be looking for in an SPR.

I came to think of another comparison. It might not make sense. What if instead of creating a higher level programming language, we are creating an ORM but for the cloud? Going into this I should say that I am not a big fan of ORMs. Day 1 the productivity gained is great. Fast forward a couple of years and you have slow queries you don't know how where they come from. Maybe you should have spent some time reading, learning SQL, and modeling instead.

As I said I'm not sure if this comparison makes sense, or if it's a fair one to make. Anyway this is something I would look out for in a SPR.

Danny Reed • May 3 '22

I haven't thought of comparing it to an ORM, but I think that's a fair comparison of one abstraction tool to another.

While ORM's may not always output an optimized query, they usually do provide some mechanism for hinting (like Entity Framework's .Include()). I think abstraction tools that offer that productivity boost but still provide a hinting/customization/override mechanism are the least risky to adopt.

Rak • Sep 14 '22

Nice post Danny, couldn't agree more about the need for an SPR... my team and I have gone through similar realizations with config and IaC.

Would you be interested in checking out our open-source SPR framework and giving us feedback? Would love to hear your thoughts on how to improve it, especially as DX has been a core focus. nitric.io

Danny Reed • Sep 20 '22

Hi Rak, thanks for reading and sharing about your experience. I'm very glad to know about nitric as it is one I haven't run across yet. Will be taking a look for sure!