DEV Community

Harry@StartQuick Tech
Harry@StartQuick Tech

Posted on • Originally published at startquicktech.Medium on

Software Upgrade Can Cause Service Outage

Software upgrade or migration can be a headache for operation team. You need to think about all the potential issues which might cause the system down. This will decide how long is your maintenance window going to be.

I have got an interesting topic for you to think about.

Imagine you have lots of VMs or containers holding the application and a Database instance which can be either MySQL or PostgreSQL. As a web app, you also have load balancer to serve the traffic.

Now you have a requirement to upgrade to a new version of the software on your servers or containers and also at the same time, the database is also upgraded. Once this happens, the database is not compatible with the servers with the old containers.

What questions should you ask?

  • Does the software upgrade require the database upgrade?
  • Do we expect an outage?
  • What deployment strategy should we use?

With regards to upgrading the software, we need to run a upgrade script on the container or servers. This script not only upgrade the software on the server, it also need to upgrade the database such as changing the add new tables, changing schemas…etc. Once the database is upgraded to the new version, the containers or servers with previous software versions cannot query correctly.

This can cause service outage, right? You can probably pause here and think about your solutions.

Our real case is having multiple containers running on AWS Fargate and the database we use is Amazon RDS. You may be not familiar with AWS Fargate, so I will try explain my thoughts in a simple way.

I have two options.

Option 1 — Add new container to run upgrade script first

  • Create a new container with upgraded software in the cluster and let it run upgrade scripts for the database. If the upgrade is done, the script does nothing.
  • When the upgrade script is done, you can imagine the old containers cannot take requests as they are not compatible with the upgraded database. With the health check configured on ALB, the old containers will be marked as unhealthy and removed and rebuilt with the new version of software, but it takes some time which causes service outage.
  • You can manually remove the old containers on your ALB when the new container is healthy, but it still takes time though it might be shorter than letting ALB automatically do it.
  • In our case, another interesting thing we noticed during testing phase is we cannot create multiple new containers for the upgrading process as the upgrade script will run at the same time which caused error as well. So we have to add one first and rebuild the rest.

Option 2 — Add new containers to the new database

  • In this method, you need to duplicate the database first.
  • When you create the new containers, you need to let the containers connect to the new database, which means old containers talk to old and new containers upgrade the new database and talk to it.
  • But one thing needs to be very careful, you need to stop all the write behaviour to the database. Otherwise, database will happen.

Above methods both introduce outage if we follow the common deployment strategy. However, the second one just require you to stop write actions which might not introduc outage for some types of sites or systems.

Do you have any good ideas? Feel free to leave your comments.

Harry@NZ

Top comments (0)