Software failures are nothing new in embedded, though with the continuing prevalence of fields like the internet of things (IoT) and cyber-physical systems (CPS), they ought to become more concerning. We're talking autonomous cars and robots, drones, and more all connected to the cloud (hello security!). What can possibly go wrong?!
Not worried yet? it might be an eye-opener to check out case studies about well-known historical software accidents and errors. You can find a good list here.
Is there hope to reduce the worry? Read on to find out.
I come from an industry background mostly in the field of automotive embedded electronics. Additionally, later in my career, and also through my education, I got involved with what is called functional safety. If you are not familiar with functional safety, then in simple terms it's about a system being able to deliver predictable or "correct" output based on the provided inputs it is fed. This is all to reduce the risk of mishaps leading to injury or damage to the health of humans.
During my time in the automotive industry, I served as a technical lead on several projects where part of the responsibilities involved dealing with customer returns. Returns would be electronic units/modules that experienced field issues where our team had to determine the root cause for the issue. If the source of the issue turned out to be a mistake in the module, a fix would be proposed and planned to integrate into a future update/release. Through that experience, I and the teams I worked with encountered a fair share of module returns that we had to deal with. At times, we spent days, if not weeks to figure out what had gone wrong to eventually report back to customers what happened and propose a fix.
I recall my first experience when I was still an intern doing module testing. We had several returned modules that were experiencing inconsistent behavior in the field. Some were working correctly, others weren't. After some significant time working with the software team to help debug, we figured out the source of the issue, it turned out to be a single uninitialized variable! At the time, I was surprised that this type of thing can go unchecked. As I grew in the industry, further experiences were only more complex and daunting in nature. At times we would get a single module among thousands that were shipped that experienced strange behavior. Typically the first step would be to try to replicate the behavior so that the source of the issue can be traced with special debug tools. That by itself would be a nightmare at times. It took days if not weeks to replicate certain behavior, and sometimes even longer because there would be a very specific combination of events that needs to happen to replicate the issue. In most cases, the combination of events was a trigger to some sort of unchecked race condition.
The majority of the issues in returns were software-oriented in nature. Interestingly enough this happened even in companies where quality was of paramount importance (many process checks and balances) and where most software engineers were really experienced. Essentially thorough testing, code reviews, code standard implementations (Ex. MISRA-C) would have been already been done on these same modules that experience the field issues.
What I found even more interesting is my experience when moving into functional safety. In functional safety, standards like the industrial IEC-61508 and the automotive ISO-26262 started gaining momentum at a certain point in the automotive industry. The purpose of following the standards was to more or less minimize the risk of failure of a product depending on how or where it was used. In the standards, more checks and balances in the development process were introduced such that a product can be qualified to a certain level of safety.
Looking at the ISO-26262 as an example, the standard provided a categorization for the types of failures that can occur in both software and hardware. Failures in hardware were categorized into two types; random hardware failures and systematic failures. On the software end, there is only one type which is systematic failures. Per ISO-26262, a random hardware failure is defined as:
Failure that can occur unpredictably during the lifetime of a hardware element and that follows a probability distribution.
The standard also adds that:
Random hardware failure rates can be predicted with reasonable accuracy.
Whereas a systematic failure is defined as:
Failure related in a deterministic way to a certain cause, that can only be eliminated by a change of the design or of the manufacturing process, operational procedures, documentation, or other relevant factors.
This essentially means that systematic failures more or less stem from human error which implies that they aren't really predictable. As such, if we look at fixing software issues, it means more added checks and balances (processes) are needed. Certainly, there are also approaches in application software that can be added to have it check itself (Ex. N-version programming), though these approaches are added to dynamic application code which leads to overhead that consumes more memory space. From a static perspective, there were several measures as well. Though it did not mean changing the programming language itself.
One would think that given what I had mentioned earlier, many would welcome such changes. On the contrary, it was the opposite, most engineers were opposed to integrating functional safety processes. As a matter of fact, many of the individuals managing safety development faced the challenge of getting engineers' and teams' buy-in. The push typically had to come from the top of the chain by establishing a "safety culture". I personally think that one part was due to the somewhat significant change in processes and additional documentation/deliverables that were introduced, not necessarily because individuals opposed the idea of safety itself. I guess if there is something an engineer dislikes, then it's writing documentation :)
Another challenge was that often teams had to be convinced that products needed to be created/designed with safety in mind from the start. Meaning, you typically are not supposed to take the path of grabbing an existing product and start applying features to it to make it meet safety requirements. In other words, patching up existing products to meet safety requirements. This was mainly stressed for hardware and application software. Though what was puzzling to me at times is why the same case was not stressed as much from a compiler/programming language perspective. Meaning that the languages used along with their compilers were not designed with safety in mind from the start. Instead, they are modified according to standard guidelines for usage in functional safety applications. This included approaches like creating "subsets" of the language, where what were considered unsafe parts were removed. That meant excluding certain programming language constructs that can be problematic. Which in a way felt like "patching up" the language and its compiler to qualify it for safety usage, opposing the "safety in mind from start" approach.
For those not familiar, the programming language most prevalent in automotive is C. From what I experienced firsthand, and I'm sure many might disagree, I personally don't see C/C++ as languages that can carry us through upcoming application trends. In fact, sticking with C/C++ in the long run really worries me. No matter how many standards are introduced. There is no question that C and C++ are powerful languages, though that power might be at the expense of more systematic failures. Current applications are going to be considered excessively simple compared to what is coming. Additionally, as the industry grows, there will be many software engineers that aren't all at the same level of experience. Thus there will be an increasing possibility for systematic error regardless of how rigid the processes are.
At this point, it begs the question, that with all of the years of experience and best practices, and patches, incorporated in languages like C/C++, if similar types of issues are still occurring due to systematic failures isn’t it time to at least consider the possibility of switching to a different language that was designed with safety in mind? Or even a more modern language for that matter. Again, as indicated by the definition of a systematic failure:
can only be eliminated by a change of the design
I personally found a possible path in Rust for embedded. It seems to fit the missing part of the puzzle that I experienced in my years in the industry. Rust was designed at heart in a way to be a safe language. Sure, switching an industry over to a new programming language is almost never convenient. Obviously, one of the deterrents is the well-established expansive ecosystem of toolchains and libraries with C/C++ in embedded environments like automotive. However, plans can be put in place to switch over gracefully.
In non-embedded environments, Rust has gained quite some popularity and investment from companies including Amazon, Discord, Dropbox, Facebook, Google, and Microsoft. In fact, Microsoft and Google already vouch for Rust helping eliminate up to 70% of their security issues in certain areas. Read here and here more about it.
Regarding embedded, thus far, there has been some interesting movement from certain entities. There is the Rust embedded working group that is working within the community to bridge the gap with the Rust teams and also grow the embedded ecosystem. It is quite impressive how fast the group has been growing the ecosystem. Highlights of the working group achievements can be found here.
Another entity is Ferrous Systems which has been also doing great work in supporting the Rust ecosystem. Ferrous led significant efforts in creating different tools and extensions for Rust and also are working on qualifying a Rust compiler toolchain for ISO26262 under the ferrocene project.
Interestingly enough, at the time of my writing this post, AUTOSAR (AUTomotive Open System ARchitecture) also announced a new working group for Rust in the automotive context. More details are found here.
(Side note: I am not affiliated nor have I been involved with any of the aforementioned entities)
Market direction shows that we’re probably nearing the time to initiate a shift from C/C++ to a safer, more modern, compiled programming language. Maybe not for all applications, but for ones where C/C++ can present themselves as problematic. The Rust programming language although fairly young seems to be the most primed to fit the bill. It probably would be a good strategic decision to jump on the Rust programming wagon for companies and individuals alike to give themselves an edge in the future. For individuals, even if a language like Rust is never adopted, although I'm doubtful that is the case, it will at least give the individual a whole new perspective of how to become a better C/C++ developer. Do you agree/disagree with my views? Share your thoughts in the comments 👇. If you found this useful, make sure you subscribe to the newsletter here to stay informed about new blog posts. Also, make sure to check out our social channels here.