Thierry Enongene

Posted on Aug 31, 2023

Infrastructure as Code Challenge - Terraform with GitLab CI/CD

Full Project link on Github: https://github.com/users/tenongene/projects/14

Summary of Challenge and Plan

I recently came across an infrastructure as code challenge on Github called "DocNetwork's DevOps Challenge" and I decided to star it and give it a try. I am very passionate about furthering my skills and knowledge in all things cloud and devops, and this challenge seemed like a good one to test my skills.

A description of the challenge:

Challenge Link: https://github.com/docnetwork/Infra-as-code-challenge

PLANNING

-Design and code a Quote Generator single page Application with NodeJS and Express, that will fetch a
random quote from a free quote API, and also display a random inspirational image of nature also
fetched from a free photo collection API.

-Containerize the application with Docker, and pushing the image to the DockerHub registry.

-Write and code terraform manifest files for deployment on AWS via an AutoScaling Group of EC2
instances, configured with a load balancer.

-Build a CI/CD pipeline with GitLab, configuring a Webhook in DockerHub and writing a Lambda function
for updating the instances.

-Write and code terraform manifest files for deployment on AWS via an AutoScaling Group of Task
definitions in an ECS cluster, configured with a load balancer.

=================================

Initial Quote Generator application design with NodeJS and Express

The challenge said it had provided an already containerized application. That did not seem to be available as it's an old repository.
I decided to build my own from scratch using NodeJS, Express and some free APIs.

Wrote a single page quote generator application using NodeJs and the Express framework. The page is intended to generate random inspiring quotes from a free quote API called Zenquotes (https://zenquotes.io).
Along with the quote is displayed a random photograph of nature from a free photo collection API called Pexels (https://www.pexels.com).

A user will click on a "Generate" interactive button that will fetch a new picture along with a new quote with each call of the APIs.

Configured clicking of the "Generate" button to hit the "GET" route in express which will make sequential calls to the quote API, then the photo API, returning and resolving promises at each step to return the "picture", "quote" and "author" variables for injection into the page.

Masked the API key and Port Number using the dotenv module for security.

Made use of a random number function that generates a random number which I then attached as a variable to the API query string to generate a random quote/image.

Used the EJS templating engine to pass in the variables "picture", "quote" and "author" from the API that would be received from the APIs.
I used simple styling from materialize css.

Initial pageview from localhost:7272 during development.

==============================

Containerization of application with Docker

-I wrote the Dockerfile for building the application image. I used a slim node base image available in
docker hub.

-Copied the current directory to the image working directory which I named as "app".
Ran "npm install" so that all the application dependencies from the package.json file got included with
the image.

-Exposed port 7272 which is the application port as written in Express.

-The command was "node index.js" which will start the application in the container.

-Built the image by running "docker build -t tenongene/quotegen ." in the current directory.
Ran "docker push tenongene/quotegen" to push the image to docker hub.

Application image now in Docker Hub.

=========================

Writing Terraform manifests for deployment on AWS

Wrote manifest files to deploy application on AWS via an autoscaling group of EC2 instances.

Configured the VPC using the Terraform provider VPC module.

Entered security group rules for load balancer access, SSH and metrics export for possible observability.
Created 4 public subnets, 1 in each of 4 availability zones which will each host the application, for high availability.

Entered the autoscaling group launch templates as required by the module. As this is a simple exercise, did not go into much detail as to configuring launch configurations, scaling rules and instance lifecycles as may be required in production settings.
Specified the desired capacity to 4 with the intent on placing an instance in each subnet per availability zone.

Obtained an ami id number from the console to use in the manifest.

Registered the autoscaling group as targets to the load balancer target group by referencing the target group arn.

Entered a base64 encoded user data script which will bootstrap the EC2 instances on launch with instructions to install Docker onto the instance, login to the docker registry, run a docker container by pulling the image that was built and pushed to docker hub. Mapped the application port 7272 to the instance port.

Defined the target groups for subsequent registering of autoscaling group.
Associated the VPC by it's id.
Configure an http listener for the load balancer, with ingress at port 80 with a backend target port of 7272 for the instances/application via the autoscaling group.

Initially had access to application from both the instances and load balancer as I had configured the security group to allow access to port 7272 from everywhere ("0.0.0.0/0"). However I later realized this did not satisfy the requirement of access only via the load balancer. Thus created a second security group resource unique to the load balancer. Configured the default security group in the ASG to accepting incoming traffic into port 7272 only from the load balancer security group.

Wrote an output file for important infrastructure resources identification.

=============================

Infrastructure deployment results

Ran Terraform Init, plan, and apply and infrastructure was deployed.

i)- The 4 EC2 instances launched.

ii)- The security groups (default and load balancer) and associations/rules.

iii)-The deployed network load balancer

iv)- The target group, showing one instance per availability zone in a subnet.

v)- The deployed autoscaling group.

View from 4 different web browsers (Firefox, Opera, Edge, Chrome) when application is accessed via the load balancer.

Accessing each instance via it's public IP address at port 7272 does not reach the application, thus proof that application is only accessible via the load balancer.

===============================

Building a CI/CD Pipeline in GitLab, with a DockerHub Webhook configured for Lambda Function

I migrated the application code to GitLab as my preferred platform for continuous integration and continuous deployment.

I configured environment variables in the GitLab pipeline with credentials to access the DockerHub registry for automatically pushing any image updates.

Wrote the .gitlab-ci.yaml file definition for the pipeline that will propagate changes in the application to the image upon a push to the code repository.
I used a Docker in Docker runner for the pipeline.

Ran initial pipeline with a test change to the application code to verify successful build of application docker image.

I wrote a lambda function in AWS named "quotegen-update" that will update the application image in the EC2 instances using the NodeJS SDK for AWS. The function URL will be used to configure a Webhook in docker hub. The Webhook will send an HTTP POST request to the function URL with each push of a new image to docker hub.

The idea is for the function to send instructions to terminate the currently running instances, at which point the Autoscaling group will fire up new instances to replace the terminated ones. However, based on the embedded user data script, the new instances will have the latest application image pulled from docker hub when creating the containers.

The lambda function handler makes use of the DescribeInstancesCommand and TerminateInstancesCommand
methods of the EC2 Client Module.

The DescribeInstancesCommand was used as a method to filter out the instances by the resource Tags of
Name and Instance Running Sate. The instance ID was not ideal because each new instance will have a
new instance ID and a new public IP address when fired up by the autoscaling group, so it wouldn't be
possible to code which instances to stop.
The resource Tags are the most efficient method since that is common to all instances created.

The result of the filtered instances provided the running instances, which I transformed into an array to be used in the TerminateInstancesCommand method.

Created function with function URL: *https://aicu5rzgtfpqmwhaua4z7sijyi0snlsk.lambda-url.us-east-1.on.aws/
*

I configured a webhook for the application image repository in docker hub with the URL of the lambda function for which a POST request will be sent for image update.

===========================

Demonstration of application CI/CD with Results

With the complete pipeline now in place from the GitLab code repository through to the instances in AWS,
I made a change to the application page by adding a greeting in the html page.
I then committed the changes and pushed them to the GitLab code repository.

Push to the repository triggered the CI/CD pipeline which immediately began building the image.

Successful completion of the pipeline resulted in a push of the new image to DockerHub, which then triggered the Webhook to send a post request to the lambda function URL, and thus invoking the lambda function.

The lambda function executed by terminating the running EC2 instances.

The Autoscaling group then began initializing new instances to replace the terminated ones, but the new instances will initialize by pulling the newly updated image.

The updated application is now available when accessed via the load balancer as before.

============================

Writing Terraform manifests for deployment on AWS via an Autoscaling group associated ECS Cluster

To deploy the application via ECS, I had to modify the autoscaling group to be a capacity provider for a new ECS cluster and write new resources for the creation of a cluster, a service and a task definition to run the containers.

The load balancer and target groups had to be modified as well.

I decided to go with an application load balancer this time around, since I garnered several insights from the first phase of the project. I changed the application port in the application to serve at port 80, and added a health check route at "/health" in the application, and that is how the load balancer would report instances as healthy.

Created a launch template resource with the required EC2 configuration. The ami used thi time was an ECS optimized Amazon Linux ami.
Saved the cluster name to the environment variable and inserted into the ECS agent config file.
Created and IAM instance profile for the instances for authorization.

Created the cluster, capacity provider and attached the capacity provider to the cluster.

Wrote the service definition file, referencing the capacity provider strategy for autoscaling, the load balancer, the task definition (below) to use and I attached an IAM role that I created on the console to be used by the service load balancer.

The task definition for the container parameters with the container name, the image to pull form docker hub, the port mapping, and attached the execution role created on the console for authorization.
The task is to use the "bridge" network mode.

I created the lambda function for the CI/CD process, whose URL will be used as a Webhook in docker hub.

The idea is to use the "Update Service Command" of the ECS client Nodejs SDK. It has a method called "ForceNewDeployment", and when set to "true" and executes on an event, it simply instructs the service to create a new deployment. The service will then create new deployment of the same quantity, but in a ROLLING UPDATE strategy where one container at a time is terminated then updated. This prevents downtime for the service and creates a smoother transition to any image updates. The new containers will then have to pull the newest image in docker hub.

The function URL was:

=============================

Infrastructure deployment Results with ECS autoscaling

The deployed ECS Cluster

The service

Cluster infrastructure with containers in each availability zone

Load balancer and target group

The autoscaling group

Accessing application

The CI/CD process and making a change to the application.

Updated image in DockerHub

Cloudwatch logs showing function invocation by the webhook

Rolling update of the service in the cluster with new containers being spun up in turn, over about 15 minutes.

Updated application is now accessible with the newly updated image

Project challenges, insights and lessons learned

The with the initial phase of the project with EC2 instances, my user data script was not bootstrapping the EC2 instances with docker in order to pull images and then run the container:
Further research pointed out that there should be no indentation to the right of the user data script, in the terraform manifest file. When I took out the indentation, the EC2 instances then bootstrapped correctly with docker running.
I also learned, only after completing the 2nd phase of the project with ECS that I also had the option of using an Amazon Linux ECS optimized AMI, which already comes with docker pre-installed.

Indentations in user data script

Do not indent.

I initially could not reach the application although everything seemed to have been configured correctly.
My debugging method was to access the instances via SSH and check the docker logs. I noticed that docker could not load the express module in the application during the image pull step. That means I forgot to run "NPM install" to install the dependencies from the package.json during image the building step. I added this, and rebuilt the image and then the image pulled correctly.

During the ECS deployment phase of the project, at first, although things seemed to be configured correctly, the tasks could not start and none of the containers could be deployed for some reason. I couldn't find any issues and the EC2 instances where running and passing system checks.

Upon further research, I learned I could check the logs of the ECS agent at /var/log/ecs/ecs-agent.log to get a clue of what could be wrong. I checked the logs and saw that the ECS agent did not have the proper credentials and permissions to pull images. My understanding was that an instance profile was required to do so. So I created an instance profile and attached to the ASG definition and that got resolved.

After that resolution the new issue was an error that the containers could not be registered because the cluster was inactive, although I could clearly see on the console that the cluster was active.
Further research with an answer from stack overflow suggested that the ECS agent only creates a "default" cluster, unless the cluster is specifically added as an environment variable (ECS_CLUSTER) and added to the ECS agent's configuration. I added it via a user data script and everything got resolved.