April 2020. I'd just begun coding regularly in Python. Built a toxic tweet classifier using LSTM models. Model wasn't spectacular by any standard, but during the whole process, I was introduced to something that would change the way I look at software development completely. It was, the Cloud.
I need to run the model faster so I used Virtual Machine in Google Cloud Platform to run the training on a higher configuration. And that was about it.
Fast forward a few months. August 2020. I was experimenting with random tools of NLP and tested out a chatbot using Rasa. Because of this minor experience, I was connected by a friend to an NGO, Weunlearn which had proposed to build a chatbot for providing Gender, Sex-ed and Mental health content. Needless to say, I signed up to help. Because for me, a chatbot simply meant a bunch of NLU training data and conversation stories.
After basic onboarding, it became quickly apparent that I was woefully under-prepared and under-equipped because we were soon to have a bot but where would it go?
Soon, the NGO had obtained credits in AWS through the Activate Program. Now it was my responsibility to put them to good use.
And I did. Just in a clumsy, haphazard and unprofessional manner.
In the first iteration of the chatbot, technically we faced a lot of complications that I feel require documentation. I believe we should share our stories not as simple tech tutorials or guides but also as journeys because it is part of what separates us from the machines we so passionately work on. What follows are the mistakes, glitches and poor planning decisions that only as software noob would do.
Let's start with in reverse chronological order of hiccups. Git is a version control system(VCS). GitHub is a cloud based service that allows you to host your git repositories. GitHub(or any VCS) allows for easy collaboration and code management across the team. The tech team at the NGO was 2 people, including me and we chose GitHub to store the repos.
GitHub is a place to store code files. Not binaries. Without understanding this, we had stored the Rasa NLP models in the GitHub repo. Which was fine until the day of deployment. A few last minute increases to NLP training data had pushed the model above the GitHub file limit. 2 hours to deployment and first message.
Needing to find alternate methods to store the binary and deploy, we turned to...
Needless to say, as a complete beginner I was unaware of how git-lfs works. I chose to use it in the heat of the moment. Git-LFS works by replacing the file contents with a pointer. If one of the system using the repo does not have git-lfs, the whole thing breaks down. And it did.
I was left with multiple Dockerfiles with no docker related lines but a pointer to a git lfs object we could not reach. What happened after that is a hazy mess as best, but we did manage to SCP the code from a local machine to the VM.
Always learn how to use VCS. Binaries do not go into GitHub. Use storage services like S3(free tier) or Google Cloud Storage(provides a 300$ starting credit for new sign up). If you do want to use a tool, understand the consequences before going ahead.
Don't experiment on the day of deployment
I had abundant credits in AWS and fair knowledge of Docker. So we decided to use a simple EC2 instance to host the application.
Rasa provides minimum system requirements to run their bots and we adhered to that. However, we were unsure of what to do in cases of overload. Although we tested with only a handful of users, inexperience got the best of us.
Failover strategy: To download the code into another VM(in Stopped state) and wait for the application to fail, so that we can manually start the backup instance.
However, I'd not setup a monitoring system to alert anyone for failures. Which essentially meant we had no way of knowing if it failed. It also meant any changes in training data would require manually updating multiple instances and verifying them.
Always try to ensure that your operations overhead is less(more on this in next topic). Use cloud based infrastructure to monitor your application as much as possible.
Continuous Integration/Continuous Delivery is one of the most important practices that a developer would need as they are moving from just coding to developing applications. Application development doesn't always mean cloud and even if you dont have access to cloud resources, CI/CD is something that should be part of your day to day development.
Continuous Integration means whatever changes you make in our code, are easily added to the source code or it's binaries. For example, if your application is executed as a binary, then every time you update the code, the latest binary must be built on its own. This allows you to focus on the code part of it, than building versions of binaries. Git uses branches to version code, any update in master branch may build a stable binary, whereas in the development branch the binary maybe tagged "latest". Allowing for easier understanding of changes and versions of code.
Continuous Delivery is taking the updated artefacts(binaries, docker images) and updating the application servers without manual intervention. Let's say you use Docker images in your application. Every push into the master branch should trigger a CI workflow which build a docker image and stores it securely. CD workflow would replace the outdated docker image in the application and restart application with the new one.
All done automatically.
None of this was a part of our first iteration of the bot. Models were manually built, testing was unheard of and any changes had to manually modified in each server.
Even the smallest of the application benefits from CI/CD practices.
It helps ease workload by automating most of it
As is evident from these, our bot was shabby and clunky technology wise. It made our job a lot harder because we had to deal with problems that arose not from the product, but from the practices that were being followed.
The story however, does have a happy ending. Watch out for Journey through DevOps - Part 2: The Awakening.