DEV Community

Cover image for How do you manage deployment configs? (Especially large scale cloud agnostic ones)
Eugene Cheah for Uilicious

Posted on

How do you manage deployment configs? (Especially large scale cloud agnostic ones)

Context

For the past few weeks, I have been using a crazy mix of the following. All within a git repository (infrastructure as code), as we migrate most of our infrastructure and workload from one provider to another.

  • terraform
  • nodejs
  • bashscripts
  • kubernetes yaml

With lots and lots of json, and yaml configuration files. Mostly generated from one system, to be piped to another (about 10k lines worth).


terraform which is commonly sold as a single solution for everything (it isnt), rapidly broke apart for us once we started doing deployments outside the usual AWS / GCP, and its poor support for kubernetes (which we use heavily)

So we started patching up missing orchestration with nodejs + bashscripts.

And now we have a giant soup of scripts updating other scripts configuration and applying them. Not exactly the most "elegant" solution.


The question : How do you do it?

So wondering out loud, for those who do really large scale deployments, especially with a small team.

Is it normal to always throw in the towel at the end, and code up a custom configuration management script to handle all this chaos? If so do you all normally do it in bash? or a custom application (like java)? or some other CLI scripting language.

Alternatively, is it normal to just grow a really large sysadmin team, each managing a subset of the system?

It feels like I am reinventing the wheel on these things, yet I somehow feel like there would have been a solution out there for this.

Sidetrack: a large part of me just feels like redoing terraform in nodejs out of frustration, to support my use cases.


Clarification on scale

I do believe there are multiple cloud specific offerings out there, the reason we do not use any of them is currently we run on the following list of providers.

  • gcp
  • aws
  • digital ocean
  • linode
  • alicloud
  • hetzner
  • bare metal on-premise stuff

Covering the following regions

  • singapore
  • unitedkingdom
  • germany-frankfurt
  • india-bangalore
  • canada-toronto
  • usa-newyork
  • usa-sanfranc
  • netherlands-amsterdam
  • indonesia-jarkata
  • hongkong
  • taiwan
  • +3 other data centers

The complex web of providers comes in part from the need to support regions where another provider either does not exist, or does poorly performance wise.

All to run UI tests at https://uilicious.com !??!

Top comments (21)

Collapse
 
david_j_eddy profile image
David J Eddy

"...terraform was suppose to be a single solution for everything,..." Nothing is a 'fix all the things' solution. Anyone who tries to sell you on that concept (for anything in life) is either lying or ignorant. Terraform is provisioned, Ansible is configuration management, K8 is container orchestration and management. Each tool has a job it is good at.

"Have university degree, write YAML for a living" - DJE, 2019

"...Is it normal to always throw in the towel at the end and code up a custom configuration management script to handle all this chaos?" - No two systems are alike; every system requires some level of customization. that being said if your application follows established patterns the vast majority of the processes implementable with little customization.

"...It feels like I am reinventing the wheel on these things..." You probably are. :D When I get feelings like this is when I start searching. 99.9% of everything you want to do, has been done before, organized, and turned into a design pattern. Just have to find the right pattern and apply it.

"...Sidetrack: a large part of me just feels like redoing terraform in nodejs out of frustration, to support my use cases..." You might want to look at pulumi.com/.

"...does not exist, or does poorly performance wise..." Ah, the real meat of the situation. Th need for multi-provider / multi-region is performance. Does your organization currently have acceptable performance metrics codified? SLA, SLO, ROI allowances? I #feel# like unless you MUST have 100% real time low latency communication (tele-conference surgery, Air traffic control, et al.) a second or two of latency for a 25% reduction in complexity might be worth looking at.

My mantra when developing anything:
Make it work, Make it right, Make it fast, In that order.

I'd love to sit down and discuss your situation in detail and provide some outside feedback. Sometimes it is hard to see the forest when you are stuck in the weeds.

Collapse
 
picocreator profile image
Eugene Cheah • Edited

"Have university degree, write YAML for a living" - DJE, 2019

Laughing out loud at this - yes thats what I feel at times now.

"...does not exist, or does poorly performance wise..."

Its not so much on the performance on the user side, as the line may imply. A huge miswording on my part, its more akin to not fitting the requirements.

I might be lacking context on this one. So after taking a night sleep and tackling at the problem again, with fresher mind. Might be better to phrase it this way.

In general, we have 3 major layers in our infrastructure (atleast in the context of this discussion)

  • Proxy layer
  • Testing Browsers
  • Everything Else

The layer which causes the most pain, configuration wise is the proxy layer, on our "pro plus" testing plan, we allow our users to run UI browser test scripts in a country of their choice. So that they can test IP based geo restrictions/behavior of their servers. (we call it our "region selection" feature).

When it comes to the size of these servers, as they are just custom configured secure proxies, they are typically the equivalent of AWS micro to small servers (depending on workload for a region).

But its where configuration hell starts from. For example, alicloud is effectively the only major provider for indonesia, and is not supported in the current version of terraform.

GCP (our main cloud provider) is out of the picture, amusingly in part because their network is too optimized. No matter where your servers physically are, the recieving server either thinks its from USA, or the same data center they are at. Throwing geo detection out of control.

However, as we slowly scale up the number of regions / countries we support on this layer, from 12 to N. It multiply the configurations needed for the lower layer

Moving down to the testing browser layer, this generally run in 1 of our 2 main GCP clusters. Due to the limitation of selenium servers, this ends up in kubernetes yaml to deploy a group of container per proxy above. We used to do update this configuration by hand until misconfiguration became an increasingly common mistake in caught in testing (we test ourselves!).

So now we are transiting to generate the configuration, based on output from the "proxy layer" given by either terraform or the cloud provider API (eliminating any possible typo in ip addresses)


"...It feels like I am reinventing the wheel on these things..." You probably are. :D

Doing a shout out here on dev.to, cause it seems like everywhere I looked its either Ansible, or terraform. (Or DIY)

You might want to look at pulumi.com/.

Definitely will look into this (thank you!)


I'd love to sit down and discuss your situation in detail and provide some outside feedback. Sometimes it is hard to see the forest when you are stuck in the weeds.

Feel free to DM me directly on Twitter - twitter.com/picocreator or on dev.to

Collapse
 
david_j_eddy profile image
David J Eddy

"...Throwing geo detection out of control..." Can you use the location of the requesting browser rather than GeoIP?

"...common mistake in caught in testing (we test ourselves!)..." That is awesome to hear! This practice is often called 'dog fooding'. It is where you 'eat' the thing you 'provide'.

dev.to DM coming your way. :) I look forward to talking with you soon.

Thread Thread
 
picocreator profile image
Eugene Cheah • Edited

"...Throwing geo-detection out of control..." Can you use the location of the requesting browser rather than GeoIP?

Unfortunately that is subjected to the "testing website" implementation >=(

it seems that for majority of websites we are helping test - "ip based" detection as opposed to GPS (probably cause the browser will prompt for permission)

It also really gave me lots of insight into how heavily optimized GCP networking is on the lowest level possible, including even BGP, when I deep dived into why this is happening. (but thats a huge side track)

Collapse
 
mhalano profile image
Marcos Alano

Did you tried some tool like Ansible (I really love Ansible)? You could use to deploy your infrastructure across multiple cloud providers and multiple regions. I know you would need to write some code, but deploy will become clearer than Terraform with bash scripts.

Collapse
 
david_j_eddy profile image
David J Eddy

I to enjoy Ansible; it does infra provisioning as well as software configuration management?

Collapse
 
mhalano profile image
Marcos Alano

Yes, it does infra provisioning and software configuration.

Thread Thread
 
david_j_eddy profile image
David J Eddy

Interesting. I will need to do some research into this. do you have any resources your find especially helpful?

Thread Thread
 
vinayhegde1990 profile image
Vinay Hegde

This is something I found very useful to begin with on Ansible

Thread Thread
 
picocreator profile image
Eugene Cheah • Edited

Thanks for the info. A few of the infrastructure folks I spoke to personally seems to echo similar experiences. Of trying various different tools like terraform - and going back to Ansible.

Sure initial setup is much more work, but it works and scale well, and it doesn't feel like your fighting the tool.

Personally never dived too deeply into it, and will look into it more.

Thread Thread
 
vinayhegde1990 profile image
Vinay Hegde • Edited

While I've not tried Terraform yet, Ansible is something I setup recently.

I agree there's initial hiccups but once wrapped up, these are few benefits:

  • You can adhere to Infrastructure-as-Code
  • because it needs a control machine relying solely on SSH, it can be configured on virtually any OS
  • It also eliminates installing / upgrading / maintaining any agent services that can become a point of failure
  • YAML's make it easy for everyone to understand

Once you're done with it, you can also look to integrate it with Rundeck for more visibility via UI & more fine grained controls

Collapse
 
derek profile image
derek

Sounds like you need to start working on simplifying the problem before you can start simplifying the solution.

I say that 👆🏽 because your title question was "how to manage deployment configs?" with emphasis on managing the configuration files, which in my opinion in itself is an interesting problem.

But as I read further it sounds more like the real question is... How do you perform multi-cloud deployments with various tooling and tech with a small team?

And... if that's the problem at hand then:

  • Complex solution: spinnaker might be a viable option, but it adds yet another tech layer to learn and use. Definitely solves multi-cloud deployments at scale, the tradeoff is... probably not the best fit for a "small team" (small being relative 🤷🏽‍♂️. I have no idea what you mean by small 3 or 300 people). Spinnaker can be a job in itself-- deploying, maintaining, and admining.

  • Quick solution: Hire or promote more people to be SME (subject matter experts) for each respective cloud and tech stack deployment. I say quick because I think of this more as a short term solution to either get things stable or to buy more time for business reasons. But personally in my opinion, the ROI of man power to productivity is a bad tradeoff. Plus more cooks in the kitchen and all that... which has high potential of creating more problems especially over time. I think of this more as a bandaid and is dependent on circumstance of time and resources.

  • Simple solution: Prune cloud support, prune tooling and technologies. I understand this is wayyy easier said than done--incredibly hard to implement/migrate to... but if you don't foresee an increase in man power (people count) you would see a a huge ROI on people to productivity and quality deployments.

Collapse
 
picocreator profile image
Eugene Cheah • Edited

Hmm might be lack of clarity on my part. It is still mainly a "config" issue more than a multi-cloud issue.

The multi-cloud part just multiplies the problem by forcing multiple tools to be used.

X servers, with their IP and configs (like certificates), needing to be passed along to Y servers in another cloud environment. The problem between the permutation and combination of configuration settings of X*Y*Z, would still have a problem (abit greatly simplified) if it's within a single cloud provider on a larger scale.

Collapse
 
derek profile image
derek

Ah gotcha...

Yes as aforementioned 👇🏽

"with emphasis on managing the configuration files, which in my opinion in itself is an interesting problem."

So far "generally" speaking yaml files with interpolation is a safe suggestion. Otherwise there's always machine learning to predict the interpolation 😆

Collapse
 
yaser profile image
Yaser Al-Najjar • Edited

Firstly, I have not done a large scale deployment, but that's how we deploy our multi-images app to production and staging servers using:
Docker images + AWS ElasticBeanstalk.

Can you make the deployment automated (from your local machine)?
Like writing a script in Python or any other lang (preferably Bash)?

If so, you can use gitlab with env vars to control the deployment flow (configure the deployment).

The gitlab runner will run our deployment script which deploys the new docker image into the staging server automatically (with every commit).
And into the production server on demand (when we click on gitlab job "run" button).

You can have multiple jobs that run in a parallel manner (say for different services).

  • Previously, we did the exact same way (different scripts) for bare metal servers deployment but we found AWS ElasticBeanstalk to be a much better option.
Collapse
 
david_j_eddy profile image
David J Eddy

Beanstalk is handy indeed. Love that you are able to leverage GitLab runners for parallel execution!

Collapse
 
chuckyz profile image
Charles Z.

Terraform + Ansible can handle anything. My suggestion is to write external Terraform providers to cover infrastructure pieces with whatever language your company uses (you said nodejs above), as that's a realistic expectation to have installed everywhere.

Ansible should be used to ease anything that requires a -> b -> c flow. Doing this with Terraform is possible, yet I find it's much more digestable at a glance with Ansible hence why I suggest using them side-by-side.

Think of Terraform as your environment builder, and Ansible as your task-runner to run things in said environment.

It's also important to have things structured well. My current position uses a Makefile across all Terraform projects with well defined plan/apply key-words that are applicable across them all. Terraform should also have a single "module repo" with nested modules accessed via:

source = "git@gitrepo.fqdn:org/project.git//path/to/thing?ref=commit_hash"
(don't forget to pin your modules!!)

I find that those two tools fill 95% of use-cases, and the other 5% are better served with one-off tools anyway.

Collapse
 
dploeger profile image
Dennis Ploeger • Edited

Let me throw in one of my babies here: socko. It's a hierarchical generator for basically anything, but it's current focus is on configurations. With it, you can apply a hierarchy to templates, which is great to reduce duplicated configs.

There's a containerized version available, so that you can use it in InitContainers for example.

Collapse
 
juancarlospaco profile image
Juan Carlos

NimScript.

Collapse
 
david_j_eddy profile image
David J Eddy

Clarify please.

Collapse
 
juancarlospaco profile image
Juan Carlos • Edited

Basically things that are explained here: nim-lang.github.io/Nim/nims.html
:)