DEV Community

Cover image for The Story of a Bug: Discovery and Unraveling Through CI/CD and Unit Testing
Artak Avetyan
Artak Avetyan

Posted on

The Story of a Bug: Discovery and Unraveling Through CI/CD and Unit Testing

TL;DR

In the world of rapid software development, encountering a bug is as common as implementing a new feature. A bug is a piece of code with its logic, but it doesn't work as expected. Fixing a bug sometimes involves a detective-like process, where we carefully analyze its behavior, trace it to its roots, and eventually, craft a solution. While some bugs are quick to fix, others can hide in the shadows, requiring days of analysis before we can finally catch and fix them. The duration of bug fixing, without proper analysis, can be unpredictable, making it a challenging task.

Let the story begin


The Project: AREG Framework

At GitHub, my playground is an OSS project named AREG (Automated Real-time Event Grid). This asynchronous ORPC (Object Remote Procedure Call) framework simplifies the complexities of multithreading and multiprocessing programming by blurring borders between processes and threads, and treating remote objects as if they coexist in the same thread. While I won't dive into too many details, just mention that it has enough complexity and interesting features. Although the other developers periodically join the project and I appreciate their help, as it is a learning opportunity for me as well, most of the time, it is a solo project, which often results in a heavy load of feature implementation.

Months ago, I integrated unit tests (GTest) in the project to test the objects and catch issues. Yet, I didn't use them much. Most of the time, I tested things by creating examples, thinking I’m hitting two targets with one shot. So far, it worked fine. At the moment the project has 21 examples, ranging from basic feature demonstration to meshes in IPC (Inter-process Communication). These examples have helped me find and fix a bunch of bugs.


The Latest Task: A Logging Service

Currently, I'm focused on developing a logging service called Log Collector, a feature aims at enabling developers to manage logging from multiple processes during runtime without requiring a restart. Additionally, the service allows each process to save the actual logging state in a configuration file, ensuring that applications can restore their states upon the next start.

In recent days, I was busy developing communication protocols and optimizing the saving of log configuration. Since the Log Collector is still in progress, this time I've chosen a different approach and instead of writing an example I’ve decided to check the log state saving functionality in unit tests. So far, I've prepared several unit tests, ensuring everything aligns with my plans, and pushed these changes to GitHub.


GitHub's Surprise: Embracing CI/CD

I find the GitHub action workflow is invaluable. My favorite is CMake cross-compile action to compile code with various options and compilers across both Windows and Linux environments. This setup also allows me to compile and execute unit tests seamlessly.

Surprise awaited me

However, a surprise awaited me. Shortly after, the MSVC (Microsoft Visual C++) runs broke, while all the Linux and Cygwin builds and runs went smoothly. This puzzled me, as my primary development IDE is MSVS (Microsoft Visual Studio) and I test a lot. OK, probably I forgot something.


A Cryptic Message: The Bug

The logs displayed a cryptic message during unit tests, referring to non-native code:

3/7 Test #3: LogScopeTest.StartAndStopLogging ............Exit code 0xc0000409
***Exception:   0.83 sec
Running main() from D:\a\areg-sdk\areg-sdk\thirdparty\googletest\googletest\src\gtest_main.cc
Note: Google Test filter = LogScopeTest.StartAndStopLogging
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from LogScopeTest
[ RUN      ] LogScopeTest.StartAndStopLogging
minkernel\crts\ucrt\src\appcrt\stdio\output.cpp(282) : Assertion failed: ("Buffer too small", 0)
Enter fullscreen mode Exit fullscreen mode

WOW! What is the meaning of “Buffer too small” here? I searched the internet, but the discussions in other projects didn't clarify the real meaning of the assertion. The situation was frustrating.


Solving the Mystery: Experimenting

What confused me was that the code worked flawlessly on my local machine, but failed on GitHub. I created build artifacts, downloaded them to my local environment, and ran the tests – everything was fine, no errors, no warnings. The lack of a clear answer troubled me. Could it be the operating system? I'm still using Windows 10. Then why hasn't anyone with Windows 11 reported this bug? Or can it be the GitHub security limiting file system access? Numerous questions with no satisfying answers.

I have so many questions

When facing such a dilemma, experimentation becomes a necessity. Without the ability to debug on GitHub and having no logs, I had to experiment, starting with simple tests and progressively moving towards more complex scenarios.


Narrowing the region: Eliminating Possibilities

When the origin of a bug is elusive, it's essential to start narrowing down the possibilities by excluding segments of code that are unlikely to be causing the problem. This technique of narrowing is known as the Elimination Method.

Narrowing problem to a few options

First, I designed a few unit tests to read and write files in the file system, ensuring that accessing the configuration file was functioning as expected. Next, I created a test to initiate logging, and when I enabled logging, the same exception surfaced. Bingo! At least now I have a specific procedure that was failing, giving me a clear target for my efforts. The unique aspect of starting logging is that each process generates a log file using a ./logs/%appname%_%timestamp%.log masking pattern, where the %timestamp% is replaced with the local time's timestamp.

Subsequent unit tests confirmed that the failure indeed lay within the timestamp mechanism. How was this possible? This feature was probably tested the most; practically every process in the examples generated kilobytes of logs, each with its timestamp. However, I accepted the truth – a fact that couldn't be ignored. I decided to increase the buffer size for generating timestamp strings, ran the tests again, and again failed, but surprise! This time, the error message changed and again referred to non-native code:

5/19 Test #15: LogScopeTest.StartAndStopLogging ......................Exit code 0xc0000409
***Exception:   0.58 sec
Running main() from D:\a\areg-sdk\areg-sdk\thirdparty\googletest\googletest\src\gtest_main.cc
Note: Google Test filter = LogScopeTest.StartAndStopLogging
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from LogScopeTest
[ RUN      ] LogScopeTest.StartAndStopLogging
minkernel\crts\ucrt\src\appcrt\time\wcsftime.cpp(971) : Assertion failed: timeptr->tm_year >= -1900 && timeptr->tm_year <= 8099
Enter fullscreen mode Exit fullscreen mode

"What am I doing wrong to have an incorrect year? What does timeptr->tm_year >= -1900 && timeptr->tm_year <= 8099 assertion mean?" I wondered.

What does it mean


Unveiling the Solution: Bugs Hate Traps

With some progress under my belt, I continued my quest staying the course. Instead of testing just the DateTime object, I expanded my efforts to directly call the Win32 API wrapper functions, keeping the same sequence of calls. The next surprise was a trap – ASSERT(false) that triggered in the code responsible for converting UTC time to Local time. The Win32 API's GetTimeZoneInformation() method returned an unknown timezone:

TIME_ZONE_INFORMATION tzi{0};
if ( TIME_ZONE_ID_UNKNOWN != GetTimeZoneInformation(&tzi) )
{
    SYSTEMTIME local;
    if ( SystemTimeToTzSpecificLocalTime(&tzi, &utc, &local) )
    {
        _convWinSysTime2AregSysTime(local, aregLocalTime);
        result = true;
    }
}
else
{
    ASSERT(false);
}

Enter fullscreen mode Exit fullscreen mode

Eureka! This discovery seemed to make sense of everything. On local machines where the timezone is set, the method returns a valid value. However, on GitHub, the Windows images might lack timezone information, leading to the TIME_ZONE_ID_UNKNOWN return value. As a result, the fields of the aregLocalTime object remained zero, and during timestamp computation, it converted signed 0 - 1900 value to unsigned 4294965396. Additionally, the month 0 - 1 was set as unsigned 4294967295. The combined effect potentially triggered the original "Buffer too small" problem.


Triumph

Triumph Over Trouble: Fixing the Bug

I found you, you elusive Bug! Bugs are lines of code that must be eliminated, or else they will haunt us endlessly. It took me three days of hunting this bug, which seemed to lurk only on Windows machines without a timezone setting. Yet, the solution was straightforward and took a minute to fix. First and foremost, I must express my gratitude to GitHub actions for helping unearth this hidden bug, a ticking time bomb in my codebase. Furthermore, the importance of unit testing has been solidified in my mind; it's a powerful tool that deserves consistent usage.


Conclusion

In conclusion, the key takeaways from this story are as follows:

  1. Prioritize Unit Testing: Unit tests are essential and should be integrated into the development process from the beginning. Create a dedicated task for writing unit tests, monitor progress, and give developers the time they need for testing. Allocating time for writing unit tests, even in resource-constrained projects, is crucial to prevent potential crashes and ensure code stability.
  2. Test Thoroughly: A comprehensive approach to unit testing is essential. Test various scenarios, not just a limited number of cases. If you identify a bug, better create a unit test to reproduce it, helping to safeguard against future regressions.
  3. Quality over Quantity: While high code coverage may seem impressive, it is more important to focus on the quality of unit tests. Aim to catch crucial software issues rather than just chasing a high coverage number.
  4. Use Assertions: Incorporating assertions strategically can act as safety nets, detecting unexpected issues early in the development process. These assertions can be invaluable in preventing serious bugs from going unnoticed.
  5. Leverage CI/CD: Implementing Continuous Integration and Continuous Deployment (CI/CD) processes can greatly enhance your development workflow. Tools like GitHub Actions enable you to set up various development and testing environments, helping identify bugs that might otherwise remain hidden.

Call To Action

Postscript: Call To Action

I'm excited to expand the unit test coverage for the AREG project, and I invite you to join me. If you are new to unit testing but eager to learn, there is a place for you in the project. We have many open issues related to unit tests, and even a single test you contribute can make a significant impact.

If you are inclined to contribute to AREG, please don't hesitate to join us. We are in need of assistance, as our plans for the project are ambitious. AREG is at the beginning of a promising journey, and there is no shortage of tasks to solve.

Even if you can't contribute directly to development, your support is invaluable. Consider starring ⭐ the repo. Increased popularity boosts our chances of attracting more contributors, making the AREG project stronger and more vibrant.

What's your own bug-fixing story? We'd love to hear it!

Tell me your story


Top comments (0)