If you by any chance have deployed or maintained solutions using any of the largest cloud providers (AWS, GCP and Azure),
you may have read or seen guidelines for how to create a well-architected, performant and flexible cloud-native solution. At the same time, these cloud providers to varying degrees almost encourage you to make a bit of a mess.
If you use a major cloud provider, one or more of these statements may be true for you.
You or your team have set up and managed cloud resources but clicking around in the web interface of the cloud provider. There may not be a whole lot of structure to it. There is a chance that you may not have full control over who does what with different resources and what it costs.
You have heard that cloud solutions can be this well-architected nirvana. Everything is automated, multiple production updates per day, optimized use of resources which cost a fraction of what it used to do in the pre-cloud days - and you have embarked on a journey to get there and avoid the clicking in the web interface.
You have reached cloud nirvana, in a sense. The work is far from done, never will be.
Most people working with the cloud will probably be at or have passed point 1 and most likely have not reached point 3. It is hard work, and it kind of always passes point 1, and one has to undo some of the habits established there.
But why is it so? Why cannot clicking in the web interface be the first step towards cloud nirvana and continue to be part of that journey?
I will mostly refer to AWS when I provide examples here but to varying degrees, this applies to other cloud providers as well.
One of the key elements when setting up cloud solutions that are well-architected and well maintained is to put appropriate labels or tags on different resources.
For most resources in AWS, that is one of the last views in a long sequence of views or some area which you can optionally expand - you are not required to enter anything.
There are usually no guidance or instructions on what to put there.
Why is such a key element to avoid making a mess not handled better? You should be able to define and store tag/label templates to use and re-use for whatever resource you set up. Am I setting up test resources for project XYZZY? Then I should be able to pick the XYZZY project template with pre-defined things pre-filled for the test environment and make it much easier to do the right thing.
In fairness to AWS, they do nowadays have an option to group various existing resources and apply tags/labels on them. But that is far from as effective as it could be if it is set directly on resource creation.
It is easier to apply tags/labels to multiple resources if you use Cloudformation to automate the creation of resources, although there is room for improvement there as well.
With Cloudformation you can automate lots of cloud resource set-up, it works great. But it is often not as easy as setting up something through the web interface.
In AWS it was until somewhat recently no way of incorporating what you had set up via the web interface into Cloudformation or other infrastructure-as-code automation tools. Instead, one had to rely on third-party tools like Console Recorder for AWS, or re-create it again from scratch.
Even with import capabilities into Cloudformation, it is not as easy as it could have been and not incorporated into the web interface when you set-up various resources.
Various tools make the process easier and allow to iterate on solutions for machine learning, analysis and other areas - why not for the infrastructure-as-code set-up and automation?
Various wizards make it easier to set-up some of the services easily, but along the way, there may be additional resources also created, like IAM policies and roles, S3 buckets, security groups and other things. It may become tricky to keep track of if there are a non-trivial amount of resources, and there is not enough planning ahead.
All public cloud providers emphasize that they take security seriously and refer to it as "Security Is Job Zero", among other things.
And there are a lot of good things provided by AWS, GCP and Azure here - it is certainly possible to make things secure. Guiding principles here is to have a least-privilege policy - only set the permissions that are needed.
However, this can be quite hard! Knowing which service API calls should be allowed for what services and under what circumstances can be tricky and time-consuming to set up.
At the same time, there are plenty of pre-defined managed roles (in AWS case), which are easy to pick - but often give too many permissions than what is required.
And how often are these changed to be the minimum needed afterwards, when everything is working? Probably very seldom.
Possibly to a lesser extent, firewall rules (e.g. security groups in AWS) may fall into this as well.
Even if you look at some tutorials or blogs from AWS and others, they give in many cases way too much permissions "to keep things simple" - and leave it at least partially to the readers to sort it out.
There have at least been some improvements to avoid that cloud users do not shoot themselves in the foot too much, but this is usually not further than setting up a single resource - not
when setting up a bunch of resources that are combined.
Maybe an ability that for a limited time tracks exactly what service calls and resources are accessed and then can trim down the IAM policies, firewall rules based on that?
Any tools that help take a whole solution and help set absolute necessary permissions required for this solution to work would be a great asset. And incorporate this into the web interface for the cloud provider.
With public cloud providers, the cost of the resources and services you use tend to follow a pay-only-for-what-you-use model. Also quite often, there are multiple pricing dimensions for each and every type of service/resource and understanding or predicting what things will cost is challenging.
This is also an area where there could be more upfront cost estimates before something is created. GCP has some support for this and can provide some cost estimates when you create a resource in the web interface - which tends to be missing in AWS.
Other services, like Terraform Cloud also has some level of cost estimation for AWS, GCP and Azure.
I like this quote:
The Pit of Success: in stark contrast to a summit, a peak, or a journey across a desert to find victory through many trials and surprises, we want our customers to simply fall into winning practices by using our platform and frameworks. To the extent that we make it easy to get into trouble we fail.
Rico Mariani, MS Research MindSwap, Oct 2003.
In many ways, public cloud providers have succeeded in making the cloud accessible of most of us. They also have some quite good advice and best practices to build great cloud-native solutions.
But they have also still not done good enough work to make everyone win and make the winning practices the easy path, which you can simply fall into.
Instead, there may be a climb uphill to get to that sweet cloud nirvana, and if you are not careful, you may fall from that hill and have to climb up again.
I do not expect it to be easy to create pits of success for most, if not all, cloud customers. But it should certainly be a goal I think, more than it is today.