DEV Community

loading...
Cover image for Avoid alert fatigue in your CI/CD

Avoid alert fatigue in your CI/CD

jadolg profile image Jorge Alberto Díaz Orozco (Akiel) ・4 min read

TL;DR

My team was having issues with not meaningful notifications for our builds getting in our way and we built a GitHub action to fix it.

A story

All those essential meetings are over. You finally have the time to focus and do some coding. One of your teammates volunteers for working in pairs and together start building that new feature everyone is so eager to get. Ten minutes into analyzing the task a chime sounds and a Slack notification shows on your screen. It is the alerts channel for your project's builds. Other teammates have just pushed code for their task and it has built correctly. Their code is now deployed and everything seems fine. You close the notification and continue with your task, but the rest of the team is also moving and before you notice, you have checked the notifications channel five times in the last hour only to see green builds.

You think, let's just turn on notifications only for the broken builds so I don't waste my time checking things that are not broken, but after some days with this approach, you notice two problems: You don't find out if the status of the builds is green again if you don't manually check and now it feels sad to check the channel and see only red statuses of broken builds. You also want to know when there has being a recovery. The problem: There's no simple way of doing it with GitHub actions which is what the team is using for CI/CD.

Do this story rings a bell for you?

A solution

Lucky for us, GitHub Actions is a very flexible platform and the GitHub API helps a lot in finding information about your projects and builds. At this point, we decided to make our own action to tell us if we should notify Slack about the current build status.

This is the strategy we wanted to apply:

  1. If the current build failed, then it will recommend sending the message
  2. If the current build succeeded, and the previous one failed, then it will recommend sending the message
  3. If the current build succeeded, and the previous one succeeded, then it will recommend NOT sending the message

With this stated, we can move onto the next step which is finding the right GitHub tools to build our action. Turns out that actions don't provide previous builds information out of the box, but we can always query the GitHub API and so we did. Turns out there's an API endpoint that will tell you the status of the runs for a specific workflow. For this, we need the current workflow id (which for some reason is not injected in the context by GitHub) but we can also get it from the information we get when querying for the current workflow run using this API endpoint.

With the previous build information, we now only need to check if any of the jobs for the build that are required have failed. For this, we will use in our job the needs property and the strategy here is passing it to our action and searching for the word failed there. If any of the jobs needed failed then we need to send a notification.

Show me how to use it already!

Here is what we ended up writing https://github.com/marketplace/actions/should-i-notify-slack and you can use it too in your actions since we decided to publish it and make it open source.

In the following example, we have two jobs in our workflow and we need to send notifications according to the previously discussed conditions. To accomplish this we

  1. Add a third job that sends the notification (slack-workflow-status)
  2. Add a condition to run this job even when the other jobs fail (if: always())
  3. Add job1 and job2 to the needs of the slack-workflow-status job
  4. Use our action to determine if we send the update to slack adding the condition if: steps.should_notify.outputs.should_send_message == 'yes' to the actual notification step which in our case uses the Gamesight/slack-workflow-status action.

Please notice that the example is using only the main branch:

job1:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v2
    - name: Run the script
      run: ./script1.sh
job2:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v2
    - name: Run the script
      run: ./script2.sh

slack-workflow-status:
  name: Post workflow status To Slack
  needs:
    - job1
    - job2
  if: always() && github.ref == 'refs/heads/main'
  runs-on: ubuntu-latest
  steps:
    - name: Determine if we need to notify
      uses: Jimdo/should-i-notify-action@main
      id: should_notify
      with:
        branch: main
        needs_context: ${{ toJson(needs) }}
        github_token: ${{ secrets.GITHUB_TOKEN }}

    - name: Slack workflow notification
      if: steps.should_notify.outputs.should_send_message == 'yes'
      uses: Gamesight/slack-workflow-status@master
      with:
        repo_token: ${{secrets.GITHUB_TOKEN}}
        slack_webhook_url: 'https://hooks.slack.com/services/...'
        channel: 'notifications-slack-channel'
        name: 'Your great build bot'
Enter fullscreen mode Exit fullscreen mode

Final words

It might seem like having meaningless notifications is not such a big problem but it actually is. Whenever this happens, you and your team will start ignoring them and will notice the problems later than expected. Cutting the notifications/alerts to only what is important is crucial for effectively handling problems when they occur and to protect your focus time from distractions.

Cover image from https://www.recordedfuture.com/security-operations-alert-fatigue/

Discussion (0)

pic
Editor guide