Every system administrator, at least once, in throughout their life career has, or will have, to handle with a server that simply doesn’t boot anymore.
It might happen when we are running a system upgrade that requires a reboot, a kernel panic and you need to bring the server back or just the reboot of a server that is running for years and nobody knows what to expect.
The recovery strategy used to require physical access to the server. Maybe you have a small bootable Linux distro in your thumb driver or an installation CD-ROM. If you are lucky, your server has a Console available via ethernet (iLO, iDRAC, etc) and you can mount the bootable ISO file using the virtual driver.
Things started to get a little bit simpler with the virtual machine environment. Similar to Console connections, you can simply login in your hypervisor and watch the boot process from there and map your ISO file to boot first.
And then, came the VMs running in the Cloud…
At Google Cloud VMs (GCE) there are no access to the bootloader. Most of the cloud providers were created with the assumption that the VMs are disposable and easily replaceable, once you have an image template that allows you to just delete the faulty VM and deploy a new working one.
Unfortunately, that is not always the case. Oftentimes, like backups, when we realise we need it is already too late.
I know that, I also get myself many times thinking:
“It’s just a simple upgrade, nothing can go wrong…”
...and we keep staring at that ping output, praying for the server to come live again.
To help you with those situations, GCE Rescue was created to boot Google Cloud VMs (GCE) in rescue mode.
GCE Rescue uses a similar approach to physical servers recover process. It will create a temporary minimal small Linux image disk and add on the top of your boot disk list.
The tool is developed in Python 3 and it’s available on PyPI. I personally recommend using in the Cloud Shell, of the Google Cloud Console page. If you prefer to install it in your OS, it is also a good idea to use a virtual environment for that.
$ pip3 install gce-rescue
While in rescue mode, you will be allowed to take the necessary steps to fix your boot disk, as it is listed as secondary.
GCE Rescue will try to detect your faulty boot disk and mount read-write automatically on
/mnt/sysroot. This will allow you to chroot to your disk, edit files, filesystem checks, install or remove packages - including the grub itself - any action necessary to fix it, according to the error you have..
Your VMs will continue in rescue mode until you use GCE Rescue against the VM one more time. Once you are done, GCE Rescue will detect the VM is in rescue status mode and will restore the instance's original configuration.
I hope this post help you to considerably speed up the process, when you need to recover a VM in Google Cloud.
Check out the GitHub page (https://github.com/GoogleCloudPlatform/gce-rescue) to be up to date with new features and development contributions.