Testing Firmware is not obvious: production monitoring

#firmware #qualityassurance #production #c

I wanted to share here part of some production monitoring process initially posted on Medium at https://medium.com/equisense/quality-assurance-for-firmware-production-monitoring-68cd5fcf038d

You can test your software as much as you want, the product you are working on may work great at the office but will undoubtedly get plenty of unexpected issues in the hands of end users. Firmware is hard to debug in its real environment and developers have to admit they tend to overlook the issues occurring on the spot for the sake of convenience 🙄. The few solutions I want to share in this post made me realize how many flaws were on trackers in production… I now regret not to have implemented such ideas before.

Keep it light

First, to catch bugs in production, I wanted to log usage data. Logging any type of data on an embedded device is obviously harder than a personal computer as there isn’t much space to store the whole program on the target. Plus, the wireless communication to pass all the information quickly is eventually limited, using Bluetooth Low Energy for example. So imagine having to keep many human-readable sentences into the Firmware, to be sent wirelessly to the remote peer: that’s way too heavy.

With that in mind, I implemented a lightweight solution: each log have to be contained in 18 bytes maximum (BLE 4.0 has a restriction of 20 usable bytes in each packet). First, an error code uses one byte. Then, a human-readable string gives more information about the error: origin, causes or whatever you would expect to understand the issue. For example, here are some error codes about some storage issues:

#define WARN_STORAGE_CORRUPT        (WARN_STORAGE + 0)
#define WARN_STORAGE_LOW            (WARN_STORAGE + 1)
#define WARN_STORAGE_FAILURE        (WARN_STORAGE + 2)

Along with the WARN_STORAGE_CORRUPT code, I could pass the type of the corrupted data or the page number affected by that bug.

All warnings are sent to the remote as they happen through Bluetooth, or queued for transmission afterwards if it’s not connected. Then, the remote app send that data to our database, along with the user email address, phone model, firmware version, etc, for further analysis… (keep reading 😉)

Best use case: catching failed assertions

So now that I am able to send warnings, I have to figure out what to send.

My code is populated with assertions that are verified here and there (hopefully yours too). When an assertion fails while running in the “debug” configuration, I have a handler that can log the file, line and error code to the serial output or RTT (see Segger RTT). Here is an example from the Flash driver, which is not able to write chunks bigger than FLASH_PAGE_SIZE (512), line 123:

#define FLASH_PAGE_SIZE 256

void flash_write_page(uint32_t addr, uint8_t* data, uint32_t length)
{
    APP_ERROR_CHECK_BOOL(length <= FLASH_PAGE_SIZE);
    [...]
}

If I call flash_write_page, with a length higher than 512 bytes, the serial log prints the line below and reset:

code: 0x0, line: 123, file: src/flash.c

Which is very useful when debugging.

Obviously, I wanted to have that same feature on released firmware. Storing and sending the full file name in the 18 available bytes was too heavy, so I decided to have a hash table storing the relation between the source file name and a 4-byte long hash, generated at compile time, and usable in the warning message. For each compilation unit ( .o file), I generate a new hash, that can be compiled into the unit and the hash is appended into a CSV file FILENAME_HASTABLE_OUTPUT. Here is the interesting part of the Makefile:

FILE_CHKSUM = $(word 1,$(shell echo $(1) | cksum))
FILE_CHKSUM_HEX = $(shell echo "obase=16; $(call FILE_CHKSUM,$(1))" | bc)
# $1 command
# $2 flags
# $3 message
define run
$(info $(call PROGRESS,$(3) file: $(notdir $($@)))) \
$(NO_ECHO)$(1) -MP -MD -c -o $@ $(call get_path,$($@)) $(2) $(INC_PATHS) -DFILE_CHKSUM='((uint32_t) 0x$(call FILE_CHKSUM_HEX,$@))'
endef

# Create object files from C source files
# Write filename checksums in a file if it doesn't exist
%.c.o:
   $(call run,$(CC) -std=c99,$(CFLAGS),Compiling)
   @grep -s -q -F "0x$(call FILE_CHKSUM_HEX,$@) = $@" ${FILENAME_HASHTABLE_OUTPUT} || echo "0x$(call FILE_CHKSUM_HEX,$@),$@" >> ${FILENAME_HASHTABLE_OUTPUT}

Another concern I had to deal with is that on failing assertion, the device resets, meaning the warning message is not sent. So I implemented a RAM region that is not init at startup. The content is kept across resets so I can have several variables stored, and a CRC to ensure data integrity. I use that region to store failed assertion values (file hash, line and code). At reset, I can now send the error to the remote peer once connected.

Using the RAM region, I can also track any HardFault error or watchdog timeout and send the Program Counter or the task being executed when the fatal error is happening 🐛.

This feature is really useful to see which critical issues are actually happening on released firmware. A few days after the feature has been released, I have many entries of error codes and descriptions into the database. It’s already very useful, but as the amount of information got larger and larger, I quickly realized that I needed to make tools around that giant table.

Code quality over time

The next step has been to set up a dashboard to be able to track code quality over time. Every day, I can now check which errors occurred the most along with the number of people affected, running on a specific firmware version etc. For your information, all the warnings are sent to Big Query and then linked to Data Studio. These tools are pretty handy and entirely fill the need I have. I can share my dashboards to my workmates and track user bugs more easily by adding some filters and displaying beautiful graphs 🤩.

I have to say that I discovered new defects that even customers never noticed before. To my mind, assertion were used while debugging and I didn’t expect that many crashes occurring due to failed assertions for example:

The plotted values above are absolute and don’t take into account the number of trackers with the last firmware version installed (now keeping track of errors), as well as the total usage duration. But still, it is really helpful to see which parts are failing. I implemented a few stronger indicators to extract valuable insights about the code quality. One of the new KPI is the number of warnings per session recorded for example. I also added several levels of criticality for each bug to remove warnings that are not harmful from critical ones. It’s now getting really interesting to assess Firmware quality 📈.

Even if integration tests have the advantage of finding bugs before production release, it would have been way harder to implement them on my own and finally, I now have a great understanding of the glitches occurring on our trackers. Today, I feel like those simple steps of QA have the best Returns On Investments. As of today, I didn’t find an easy solution to implement integration tests with Bluetooth commands and low level drivers. There are tools like automated tests for nRF-Connect for example, but I think the best way would be to use our mobile application to record training sessions in a loop.

Now, I'd like to know how you handle QA for Firmware in your projects 😃