DEV Community

ShandonCodes
ShandonCodes

Posted on • Updated on

Pylint: A day lost

The Intro

It happens at some point in everyone's career, one minute you are making a small change to your software project and the next thing you know you come across an error. An error that no one else on your team has seen and worse, one you have little to no idea how to debug and find a solution. Strap-in as I walk you through an error that took me an entire workday to solve and most importantly, one that made me re-think the entire way my team operates.

Let me start by painting the scene, I picked up a maintenance ticket with a simple description: "Migrate Gitlab Runners to new vSphere instance". Sounds simple, enough but I decided to take the time to really improve how we deployed our runners. At the time, it was a very manual process to create new runners for our project, an engineer would need to manually create the VM, install the OS as well as other software, and they would have to remember all of the configuration steps that were used on previous runners. Needless to say this was not ideal, so I decided I would take advantage of tools like Packer to ease the future creation of VMs. Specifically, I used Packer to create a VM template with all the required configurations (I'll talk about that in the future). Once I had the template all I needed to do was perform a few clicks in vSphere and voilà, a new runner was available to the project.

The Error

Now that are runners were "migrated" I began testing our pipelines on one new runner (I decided to start testing with more modest one to simplify debugging). Everything ran just fine until the pipeline began its linting stage. The linting job would fail but there was no output, just a notice that the job failed and there was an exit code of "1"...now if you are thinking what I was you might be wondering why zero changes to the project source code somehow caused a linting error and you would be right to wonder. As far as the linter should be concerned nothing changed, yet it was finding an issue. I ran the linter locally to double-check for errors and none were shown. I doubled checked the pylint version being run in the pipeline vs locally and I confirmed they were exactly the same. While debugging this two main things stood out:

  1. The version of pylint was a major version behind (were were using 2.13.9 and the current release version (as of this writing) is 3.0.3

  2. An error was thrown in the pipeline, but no output was displayed.

First, I touched base with my team about the outdated version of pylint and they mentioned due to some breaking changes in the newest major version the linter would totally fail in our very large codebase and would require a very large refactor to work. While this was not ideal, I decided to move on from this and focus on the more pressing second issue. The pipeline was failing and I was getting no error. After about an hour or so of digging I learned our pipeline was outputting the pylint output to a file and because the job was failing all artifacts it created, like the linter record) was not saved. So there was an actual error, but I just could not view it.

The Problem

Easily enough I changed the job to output to stdout and then I could see the errors. The errors were related to some relative imports in part of the source and while this was great to know I was still puzzled by why a linting error suddenly came out of seemingly nowhere and why I could not replicate on my workstation. I mean sure, I could just fix the linter issues found, but how would I or anyone be able to ensure we will not receive linter errors in the pipeline if we cannot test them locally before hand? After scratching my head on this for quite a few hours I realized something (and you may have too), remember earlier when I mentioned I created one modest runner for testing? Well the exact specs were as follows:

  • 1 CPU
  • 64 GB RAM
  • 150 SSD

Compared to these specs from the now deprecated runners:

  • 4 CPU
  • 64 GB RAM
  • 150 HDD

The Solution

Notice the CPU count. You see, as I was searching "why are my lint results different from my pipeline" I did not get any exact results, but I stumbled across this bug report. Basically, the report says when pylint is run with multiple cores some errors may or may not appear even if the do or do not appear when running pylint on a single core. Now in our pipeline and locally we do not specify how many cores to use (i.e. pylint --jobs=0), so by default pylint will use AS MANY cores as possible. So in our pipeline the runner's pylint instance was using --jobs=1 and locally my workstation (4 cores) was using --jobs=4. To confirm I manually changed my pylint to use --jobs=1 locally and I was finally able to reproduce the errors locally! To complete my testing I updated the runner specs to have 4 CPUs and the pipeline passed with no issues!

To recap the issue was caused by a known bug in an outdated version of pylint, but it was only found by chance when I began working on a completely unrelated ticket. If I had used more CPUs on my test runner to begin with, I might not have ever found this issue at all (especially if we upgraded pylint in the near future).

The Path Forward

So how did I wrap everything up? I created 2 other runners and thanks to the higher performance specs used on the new vSphere cluster our release pipeline is now down from 4 hours to 2, a major win! Also, I did not actually fix the original linter error. I felt that it did not matter as the error was never thrown on multi-core systems and our runners/workstations always have multiple cores. I did however cite this issue in a ticket to update pylint.

I learned that efforts that might add little to no benefit in the present may drastically save time in the future. If we as a team had prioritized updating our pylint version than I very may well have not spent a full workday chasing bugs and learning more about an outdated version of pylint than I needed to.

Top comments (0)