- What are pipelines
- DevOps pipelines
- What CI/CD pipeline software to use?
- Why did we choose GitLab?
- Docker Technology
- gitlab-ci.yml file
- Pipeline lifecycle example
In computing, pipeline refers to the logical queue that is filled with all the instructions for the computer processor to process in parallel. It is the process of storing and queuing tasks and instructions that are executed simultaneously by the processor in an organized way.
This is different from the regular queue or stack data structures in computer science. These data structures work in FIFO or LIFO approaches, respectively. This means literally "First In First Out" or "Last In First Out" principles, no matter if working with elements, instructions, files, or any other arbitrarily listable item.
DevOps is a set of practices that combines software development and IT operations. It aims to shorten the systems development life cycle and provide continuous delivery with high software quality. DevOps is complementary with Agile software development; several DevOps aspects came from the Agile methodology.
So if you're planning on having an 'agile' work environment, you must have good software foundations and automation processes set in place to achieve fast-paced development and results. If everything is done manually, the environment would be rigid, stiff, and slow, thus being the exact opposite of agile.
In software engineering, CI/CD or CICD is the combined practice of continuous integration and either continuous delivery or continuous deployment. CI/CD bridges the gaps between development and operation activities and teams by enforcing automation in the building, testing, and deployment of applications. This would preferably shorten the gap between developers developing the project locally in their IDE and the project being published in their normal production environment, either on public domain for clients or private domain for staging with other branches of the product development lifecycle (backend/frontend/design/QA/testers/...).
Not only that we've shortened the gap between the developer and productional environment, but we've also introduced an assurance that it will work within the productional environment. The main component of CI/CD is that we run install/compilation, build, and test inside environment which mimics production. Therefore we've eliminated the chance of "but it works on my machine" happening.
There are many options available to provide continuous delivery and integration in your software development lifecycle. Some of the most popular CI/CD tools are:
The tool for the job is a personal preference and depends on which of these tools fit your project, budget, requirements, language, technology, etc.
From this point on, we will focus on GitLab's pipeline.
Well, the choice was not a hard one. GitLab is our default VCS and provides a rich set of features for CI/CD pipelines leveraging the Docker technology. The beauty of in-house CI/CD integration with your VCS is that the pipeline itself can be triggered upon various events from the developers. The pipeline itself can be configured in a way that it triggers certain code blocks in all kinds of situations. For example, it can be triggered upon pushing to a certain branch, or by providing some sort of specific trigger key within commit message, it can be triggered upon Merge Requests, upon the success of those merges, etc.
This approach allows the DevOps engineer to configure the pipeline in a way that other software engineers mindlessly continue with their usual workflows and are completely oblivious to what's happening in the background. This is great because now you need only one good engineer to configure and maintain it, whereas the rest of the team/organization does not have to bother with learning the technology inside out to use it.
Alongside being our default VCS and having great flexibility, it can also be hosted on-premise. This way we are not using GitLab's servers (also called runners) but our own. GitLab's pipeline will only charge you for computing time on their servers so we are saving some money. But GitLab does have fast servers which are relatively cheap (10$ for 1000min of computation). That would be a good approach for big companies, it would cost much less in the long run than configuring their cluster of runners.
Docker is a set of 'platform as a service' products that use OS-level virtualization to deliver software in packages called containers. Containers are isolated from one another and bundle their software, libraries, and configuration files; they can communicate with each other through well-defined channels. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries, and settings.
In raw, the docker allows you to simulate the production environment by replicating it beforehand and then containerizing it and publishing to hub.docker.com. The important thing to consider is that you want to keep your docker image as minimalistic as possible because it does produce a heavy load for the machine using it. You want to containerize only the service which you need to use and test, eg. for our backend stack, we've had to containerize PHP 8 on Ubuntu, and we've added node, npm, and composer to the container. That's it, voila! Now you'd think that it's smart to ship other technologies to the same container, for example, using this container to test the frontend services would be pretty inefficient.
A much better approach is to use a different image for each stack you require because otherwise, you will have unused software within your container.
"Yeah I have it on my computer and server too, so what's the deal?"
The thing is, for every job in the pipeline, once started, the docker fires a fresh container and boots it up. If you want it to be fast, you want it to be as minimal as possible. Meaning that the backend will have only backend software on its image, whereas the frontend will only boot up node.js image, without PHP and composer. Separating services in different containers will help you speed up the pipeline overall.
The boot time is the biggest overhead that you get with additional software, but only if you are using the self-hosted runner. Otherwise, if you're running the pipeline on GitLab servers, you need to download the image (container) at every step of the pipeline. And if you have big images with unneeded software, you will both download and boot longer, and GitLab charges per minute of processing time, meaning that you're twice as inefficient.
YAML is used because it is easier for humans to read and write than other common data formats like XML or JSON. Further, there are libraries available in most programming languages for working with YAML. For a syntactic guidelines of YAML files, Google is your best friend.
This file is the main configuration for your Gitlab pipeline. Whenever you add this file to your project, Gitlab will go through it whenever you change something within your project. Here you will define the flow of your pipeline, jobs, stages, what it will do, when will it execute, etc. By default, it will use shared GitLab runners where you'll have 400 free minutes monthly, by default.
The YAML file has a variety of keywords and control structures so you're able to define the what and when. The content of the job, within the
script tag, are the commands which will execute inside the docker container containing your application. Here are some of the most commoontrol structures:
image- defines which docker container will run for given job/pipeline (pulls from hub.docker.com)
stages- define stages in which you can group your jobs. Stages run serially (one after another), whereas jobs within the same stage run in parallel
only- defines when a job will run (eg. only on merge request)
artifacts- defines which files will be shared between different jobs (because the new container is initialized per job, thus contents are lost unless specified with artifacts)
cache- defines which files will be saved to the server for retrospection
script- defines what commands will be executed within the container (OS-level commands, eg.
composer install xy)
These are some of the main control structures out of many, for more you can check the documentation.
The pipeline starts from the
.gitlab-ci.yml file. Here we'll analyze a simple pipeline configuration and what happens each step of the way. We will consider the following pipeline for a front-end project.
image: node:latest stages: - build - run # Job's name first-job: # Define stage stage: build # What to run on the job. script: - npm install artifacts: paths: - node_modules second-job: stage: run script: - npm run start - node test.js artifacts: paths: - node_modules/ second-job-parallel: stage: run script: - echo "I'm running at the same time as second-job!!!"
As we can see, we have two stages, first stage installs the modules and has only one job. The second stage kicks in if the first one finishes successfully and starts two jobs in parallel.
The first thing that happens once you push your code is that GitLab scans the
.gitlab-ci.yml. If there are no limitations in configuration, the pipeline will be running on every push, merge request and merge result. Gitlab will then contact the 'runner', aka. the server which will be executing your pipeline. Note, if you have a single GitLab runner they will queue in FIFO principle.
Once the server responds, it will start the docker executor with the image you've specified. In our case, it is node:latest. If required, you can specify a different image for each job. If the server has the image cached, it will start using it, otherwise, it will have to download it first.
Then, once you have your container ready and booted up, your project will be downloaded from your repository and once you're placed into the project root directory, it will start executing commands you've provided within the
scripts list. In our case, it will install node modules. Once the job is finished, the artifacts will be uploaded so the next job using can download them back into the container. Then the cleanup kicks in and the container is closed.
The second job is different, but only slightly. It will download the artifacts right after the project is downloaded, so we have everything ready for scripts. Once this job finishes, it will again upload the modified artifacts for whoever might be using them next. Other than that, it is the same as the first job regarding the structure. But the commands are slightly different, and here as an example, we run test.js to test if our application is working. Here are the contents of test.js:
console.log('Hello from Node.js!') console.log(new Date().toUTCString()) console.log('Exiting Node.js...')
And the last job is no different from the others, it just proved to us that two jobs can indeed run in parallel.
To conclude the pipelines, I would say that it is a must for any serious company to have these operations set in stone. Human error is removed from the equation and monotonous tasks are automatized. A good pipeline will also deploy the code to a staging server, the company's internal server for testing, quality assurance, and collaboration; but ultimately all productional deploys should be done manually by setting
when key to
.gitlab-ci.yml file. Many many more things could be done within the pipeline such as benchmarking the app, stress-testing, and others. I might cover them in the next blog, but until then, what features do you think would make an awesome pipeline?