Neven Miculinic

Posted on Mar 28, 2019

Golang race detection

#algorithms

Race conditions are pretty nasty bugs. Hard to trace, reproduce and isolate; often occurring under unusual circumstances and often lead to hard to debug issues. Thus it's best to prevent & detect them early. Thankfully, there's a production-ready algorithm to tackle the challenge - ThreadSanitizer v2, a battle proven library for compiler support in Go, Rust, C & C++.

First, let's display a typical race condition. We have a simple global count variable:

var Cnt int

and we count the number of events happening, whether we're counting HTTP requests, the number of likes or tinder matches is irrelevant. What's relevant it how we do it. We call function Run:

fun Run(amount int) {
  for i := 0; i < amount; i++ {
    Cnt++
  }
}

But that's not all. We're calling it from multiple goroutines:

func main() {
    wg := &sync.WaitGroup{}
    for i := 0; i < 10; i++ {
        wg.Add(1)
        go func() {
      defer wg.Done()
            Run(1000)
        }()
    }
  wg.Wait(); 
  fmt.Println(Cnt)
}

And we'd expect our result to be 10000. Yet it's improbably this shall happen on a multicore multithreaded system. You're most likely get results like 5234 one run, 1243 second run, or even 1521 last run without any determinism or repeatability. This is a typical data race!

Let's not stop on this toy example. Instead, let's evaluate some famous real-world race conditions with serious consequences.

Famous examples

DirtyCOW

This is a race condition in the Linux kernel. It involves a tricky memory pages interplay, mmap and madvise system call which allow for privilege escalation exploit. That is, you could mmap root owned file Copy-on-write, which is a valid operation (every write to this mmap-ed region shall not be written to an underlying file), yet under certain conditions write would propagate to underlying file, despite we're an unprivileged user.

Further info

Therac-25

Another famous example is the Therac-25 radiotherapy machine. It had a race condition between machine settings and display settings. If the user typed instructions too quickly and changed them quickly the machine could end up in maximum radiation dosage while displaying false information to the operator. This led to multiple accidents and death cases.
Further info

Definitions

Before continuing let's briefly iterate over required definitions:

Race condition - A race condition is an undesirable situation that occurs when a device or system attempts to perform two or more operations at the same time, but because of the nature of the device or system, the operations must be done in the proper sequence to be done correctly

Due to general race condition scope, and requiring domain knowledge understanding what is correct behavior, for the rest of this post we're focusing on data race, much more narrow and objective definition.

Data race - Concurrent access to a memory location, one of which is a write

Concurrent access - is one where event ordering isn’t known. There’s no happens before relation

Happens before relation requires some further explaining. Each goroutine has its own logical clock. It is incremented on each operation. Within each goroutine there's strict ordering, event happen sequentially (or at least we observe them as if they happened sequentially, various compiler optimization/Out-of-order execution might interfere). Between goroutines, there's no order unless there's synchronization between them.

In the following image, you can see it visually depicted. happens before relations are highlighted using arrows. I'd like to remind that happens before relation is transitive.

Common mistakes

Let's observe a few common mistakes in go programs. The examples have a similar equivalent in other major languages, it's in Go since I quite like it.

Race on the loop counter

func main() {
    var wg sync.WaitGroup
    wg.Add(5)
    for i := 0; i < 5; i++ {
        go func() {
            fmt.Println(i) // Not the 'i' you are looking for.
            wg.Done()
        }()
    }
    wg.Wait()
}

Try it out. You might get 0 4 4 4 4 as output depending on the event ordering. Certainly not the output you'd expect. The fix is quite simple:

func main() {
    var wg sync.WaitGroup
    wg.Add(5)
    for i := 0; i < 5; i++ {
        go func(j int) {
            fmt.Println(j)
            wg.Done()
        }(i)
    }
    wg.Wait()
}

Unprotected global

This one is similar to the introductory example:

var Cnt int
func Inc(amount int) {
    Cnt += amount
}

func main() {
    go Inc(10)
    go Int(100)
}

Solutions are to sync access to global Cnt variable since increment operations aren't atomic (unless from sync/atomic package).

Thus we use a mutex:

var Cnt int
var m = &sync.Mutex{}
func Inc(amount int) {
    m.Lock()
    defer m.Unlock()
    Cnt += amount
}

or atomic variable:

var Cnt int64
func Inc(amount int64) {
    atomic.AddInt64(&Cnt, amount)
}

Violating go memory model

Go memory model specifies what guarantees you have from the compiler. If you breach that contract, you're in for a bad time. The compiler is free to optimize away your code or do unpredictable things with it.

var a string
var done bool

func setup() {
    a = "hello, world"
    done = true
}
func main() {
    go setup()
    for !done {
    }
    fmt.Print(a)
}

In this example go's memory model doesn't guarantee write to done in one goroutine is visible to other goroutines since there was no synchronization between them. The compiler is also free to optimize away:

for !done {}

into a simpler construct, not loading done variable each iteration. Furthermore, there are no guarantees memory buffers between CPU cores and L1 caches shall be flushed between concurrent reads for the done variable. This is all due to nasty complexity of out-of-order execution and various memory guarantees across architectures. E.g. x86 has stronger memory guaranties on assembly code then ARM architecture.

Synchronization primitives

In Go we have following synchronization primitives:

channels, that is sending and receiving is a synchronization point
sync package
- Mutex
- atomics

For further details look those up, they are not a concept unique to go nor race detection. The important point is they bring happens before ordering into the program code.

Detecting race conditions

In Go, this is simply done with -race compiler flag. We can even run our tests with it and fail on any race condition:

go install -race
go build -race
go test -race

As said in the beginning it uses ThreadSanitizer library under the hood. You should expect runtime slowdown of 2-20x and increased memory consumption 5-10x. Other requirements are CGO enabled and 64bit operating system. It detects race conditions reliably, without false positives. During its usage is uncovered 1000+bugs in chrome, 100+ in go's standard library and more in other projects.

What -race flag does is instrument your code with additional instructions. For example, this simple function:

func Inc(x *int) {
*x++
}

Get's compiled into:

movq "".x+8(SP), AX
incq (AX)

However upon turning on -race you get bunch of assembly code, interesting bits being:

...
call runtime.raceread(SB)
...
call runtime.racewrite(SB)
...

That is compiler adds read and write barriers for each concurrently reachable memory location. The compiler is smart enough not to instrument local variables since this incurs a quite performance penalty.

Algorithm

First the bad news:

Determining if an arbitrary program contains potential data races is NP-hard.

Netzer&Miller 1990

Therefore our algorithm shall have some tradeoffs. ˍ

First how & when do we collect our data. Our choices are either dynamic or static analysis. The main static analysis drawback is the time required for proper source code annotation. This dynamic approach is used since it requires no additional programmer intervention except turning it on. Data could be collected on the fly or dumped somewhere for post mortem analysis. ThreadSanitizer uses on the fly approach due to performance consideration. Otherwise, we could pile up huge amounts of unnecessary data.

There are multiple approaches for dynamic race detection, pure happens before based, lockset based and hybrid models. ThreadSanitizer uses pure happens before. Now we'll go over 3 algorithms, each pure happens before dynamic race detection, each improving upon the last. We'll see how ThreadSanitizer evolved and understand how it works from its humble origins:

Vector Clocks

First, let's explain the concept of vector clocks. Instead of each goroutine remembering only its logical clock, we remember the logical clock of the last time we hear from another goroutine.

Vector clocks are partially ordered set, if two events have strictly greater or less than relation between them there's happens before relation between them. Otherwise, they are concurrent.

DJIT+

DJIT+ is an application of vector clocks for pure happens before data race detector.

We remember vector clocks for:

Each lock $m$ release $$L_m= (t_0, \ldots , t_n)$$
Last memory read on location x $$R_x = (t_0,\ldots, t_n)$$
Last memory write on location x $$W_x = (t_0, \ldots, t_n)$$
Goroutine vector clock, $$C_x = (t_0, \ldots, t_n)$$

Let's see an example where there are no races:

Each row represents a single event as our race detector sees it. First, we write to location $x$ from goroutine 0, and it's remembered in $W_x$. Afterward, we release the lock $m$ and goroutine 0 field is updated in $L_m$, that is our lock release vector clock. By acquiring the lock, we max by elements vectors clocks of $L_m$ and $C_1$, because we know lock's lock happens after lock's release. We perform the write-in goroutine 1 and check whether our own vector clock is concurrent to $W_x$ and $R_x$. It's strictly ordered, thus everything is good.

And now the same example, but without goroutine synchronization. There are no lock's lock and release. When comparing goroutine 1 vector clock, $(0, 8)$ is concurrent to the last write $W_x = (4,0)$. Thus we have detected the data race. This is the most important concept to understand, rest is optimizations.

FastTrack

The Fast track introduces the first optimization. For full details I recommend reading original paper, it's quite well written!

We observe the following property. If there are no data races on writes, each write is sequential. That is, it's sufficient to remember last write's goroutine and logical clock for that goroutine. Thus we create the shadow word, representing last memory write, logical clock and goroutine id. Format <logical clock>@<goroutine id>

For reads, it's a bit different story. It's perfectly valid having multiple concurrent reads as long as reads and write and strictly ordered. Thus we utilize the shadow word read representation as long as a single goroutine reads the data, and fallback expensive full vector clock in concurrent read scenario:

ThreadSanitizer v2

Thread sanitizer further improved on the FastTrack. Instead of having separate $R_x$ and $W_x$, we keep a fixed sized shadow word pool for each 8-byte quadword. That is, we approximate the full vector clock with partial shadow words. Upon full shadow word pool we randomly evict an entry.

By introducing this trade-off, we trace false negatives for speed and performance. Our fixed shadow word pool is memory mapped, thus allowing cheap access with additions and byte shifts compared to more expensive hashmaps or variable sized array access.

Let's go over an example. Each shadow word is 8 bytes wide and consists of goroutine logical clock, goroutine id, write/read flag and position in the 8byte word we're reading/writing.

Everything's good so far. Let's make another operation.

Now let's introduce a data race. The goroutine 3 vector clock for goroutine 1 is 5, that is smaller than 10@1 shadow word written in the pool. Thus we're certain a data race happened. (We assume events for each goroutine arrive in order, otherwise, we couldn't be sure whether 10@1 entry for goroutine 3. is bigger or smaller than the current vector clock's entry for goroutine 3. This is enforced by algorithm design and implementation.)

Summary

To summarize this article, we covered the basics of logical and vector clocks, how they are used in happens before data race detector and we build it up to the ThreadSanitizer v2 which is used in go's -race.

We observed the tradeoffs in the algorithm design, it traded higher false negative rate for speed, however, it forbids false positive. This property builds trust, and with that trust, we know certainly, we have a race if it screams at us, not some weird edge case in the algorithm. No flaky race tests, only true positives are reported.

Though, keep in my this is only data race detector; it's easy to circumvent it. For example, the following code shall pass the data race detector despite being horribly wrong in the concurrent setting:

var Cnt int64

func Run(amount int) {
    for i := 0; i < amount; i++ {
        val := atomic.LoadInt64(&Cnt)
        val ++  // incrementing Cnt isn't atomic, we just load and store the value atomically.
        atomic.StoreInt64(&Cnt, val)
    }
}

References

Slides
https://golang.org/doc/articles/race_detector.html
Race detector options
https://github.com/google/sanitizers/wiki/ThreadSanitizerAlgorithm
The Go memory model
""go test -race" Under the Hood" by Kavya Joshi talk
https://blog.golang.org/race-detector
https://danluu.com/concurrency-bugs/
ThreadSanitizer – data race detection in practice google paper
FastTrack: Efficient and Precise Dynamic Race Detection
AddressSanitizer/ThreadSanitizer for Linux Kernel and userspace.
DJIT+ algo MultiRace: efficient on-the-fly data race detection in multithreaded C++ programs
On the Complexity of Event Ordering for Shared-Memory Parallel Program Executions (Netzer, Miller, 1990)

DEV Community