Winning Race Conditions: 2012

2012-10-20

What is a "data race" and when is a race not a data race?

I've been meaning to write about this since my post series about Rust (in particular here, where I wrote "while data races are no longer possible, race conditions in general still are" about the RWARC library). In general, Rust statically guarantees freedom from data races, though not freedom from all races. But what does that mean?

A data race is when multiple threads concurrently access the same memory location where at least one access is a write. "Concurrently" here could mean either literally at the same time (threads run on different CPUs) or abstractly at the same time (threads interleave with each other on the same CPU); i.e., no synchronisation primitive enforces that one thread's access completes before the other begins.

Thread 1	Thread 2
if (p != NULL)
	p = NULL;
output(p->data);

Data race detectors, such as Eraser and Helgrind, analyse threads' mutual-exclusion and happens-before relationships to identify unsafe concurrent accesses like these. But it's possible to stop accesses from being concurrent without enforcing correct behaviour:

Thread 1	Thread 2
mutex_lock(m);
bool ok = p != NULL;
mutex_unlock(m);
	mutex_lock(m);
	p = NULL;
	mutex_unlock(m);
mutex_lock(m);
if (ok) output(p->data);
mutex_unlock(m);

Now the data race is gone, but the bug has simply become a higher-level race condition. Most literature calls this an "atomicity violation" (and some literature even uses "race" to mean exclusively data races).

You might think this code looks silly, but if you're working in a project with many layers of abstraction and function/module boundaries, this kind of mistake can be all too easy to make, and data race detectors are powerless to find them.

Consider this real-world example. When I started at Mozilla this summer, Rust 0.2 had recently shipped, and its release notes mentioned that it was "Helgrind-clean" (meaning no data races existed). Yet the Rust runtime contained this code:

bool rust_task::blocked() {

    mutex_lock(this->lifecycle_lock);

    bool is_blocked = this->state == task_state_blocked;

    mutex_unlock(this->lifecycle_lock);

    return is_blocked;

}

Sure, accessing the field was safely protected by the mutex, but once it dropped the lock and returned, all bets were off as to whether the returned value was still accurate or not. (I fixed several bugs related to this, and removed this function entirely.)

In a similar vein, Rust's type system guarantees that concurrent tasks cannot share state but instead must use message-passing to communicate, which precludes the possibility of data races completely by enforcing happens-before relationships on all data accesses (or in the case of this post, by enforcing mutual-exclusion relationships). Yet it's still possible to write nondeterministic programs in Rust (using select2, failure propagation, etc), and so race conditions are still possible.

The moral of this story is that data races are only one of many types of races, and though many tools exist for finding them, just because one guarantees absence of data races does not mean your code is completely concurrency-safe. Not to say these tools aren't useful, but they often fail where more sophisticated race-finding techniques could succeed, and even still, no automated race-finding tool can substitute for a careful human brain when reasoning about concurrency.

What exactly is a "race condition", anyway?

A friend of mine is taking the operating systems class at UMD, in which the second project is to implement inter-process signals. He noted a peculiarity in the specification: that processes are not woken up immediately if they receive a signal while blocked (e.g. on child processes, on keyboard/disk input). As a result, it could be completely random whether or not a process receiving a signal gets killed immediately or sleeps forever.

He discussed this with the professor, and they disagreed over whether this nondeterminism constituted a "race condition" or not. After all, the specification allows for signals to fail to wake up processes under certain circumstances, so there's nothing wrong about implementing it that way. On the other hand, a kernel whose signalling mechanism always wakes up processes in bounded time (i.e., finitely long -- whereas waiting for keyboard input could take forever) could provide stronger guarantees about inter-process communication.

In my interpretation, both arguments don't tell the entire story. For starters, race conditions don't necessarily entail wrong behaviour; I've seen plenty of "benign" race conditions with comments along the lines of "if X and Y race, Z will happen, and this is OK". Benign races aside, though, "race condition" to me means "unexpected behaviour occurs nondeterministically". So, if you want to be precise, it's important to talk about race conditions with respect to certain expectations.

Someone writing a userspace program for this kernel who didn't realise that signals might never get taken (and hence produced code that sometimes accidentally sleeps forever) could say they were bitten by a race in the kernel. But if they'd read the spec carefully, they might've written code that handled the nondeterminism more robustly. They could say the spec's nondeterminism made it less useful than other possible specs, but it wouldn't be fair to blame the particular implementation of this spec for being buggy.

In short, I would say the specification itself has a race condition in it, but implementations thereof don't. What's important is who holds the expectations and who nondeterministically breaks them.

2012-09-26

Rust (0): Index and Conclusion

This four-post series on Rust is intended to introduce you to the language, to teach you about Rust's cool language features, and to give a debriefing of what I contributed to it this summer.

These posts are targetted for an audience with some knowledge of programming language design principles. You should be lightly familiar with both systems programming languages such as C++ and with functional languages such as Haskell or ML, and preferably strongly skilled in at least one or the other domain.

Do feel free to skip ahead, if you're already familiar with parts of the language, or to bail out early, if you're not interested in an involved tour of concurrency primitives. All the same, I hope you get something out of some or all of these posts.

Primer - an introduction to the language's syntax, memory model, and concurrency model
Linked Task Failure - advanced parallel programming and error handling with tasks (my first project)
Typesafe Shared State - an overview of the region system and a parallelism library that makes heavy use of it
Typesafe Shared Mutable State - using trickery with Rust's type system to achieve a completely safe interface for common concurrency idioms (my second project)

I'd like to close with an argument for why I think Rust is the "language of the future" for systems programming.

Rust's strong static type system relieves programmers from worrying about many types of errors they should never have to. NULL pointer crashes, memory management errors, surprising implicit type coercions, and dynamic cast exceptions don't exist anymore. Meanwhile, features like closures and higher-order functions (missing in C++ (until very recent versions)), algebraic datatypes and parametric polymorphism (both missing in Go), and traits (existential types; a combination of haskell-style typeclasses and OO-style interfaces) allow you to concisely express ideas that would otherwise involve a lot of legwork in certain "conventional" languages.
Unlike other functional languages, however, Rust has heavy focus on performance as well. Stack-allocated data lets you often avoid dynamic allocation overhead and garbage collection (even closures can sometimes be entirely on the stack). The region system and borrow checker allow for type-and-memory-safe aliasing of arbitrary data with no runtime overhead. Explicit copyability as part of the type system lets you be aware of when expensive copies might occur.
Finally (and this is the big one, for me), Rust's type system includes a concurrency-aware memory model. Forbidding unprotected shared state and using message-passing over pipes as the main communication mechansim means programmers no longer have to worry about data races, and is also friendly to massively-parallel applications where cache-line contention is a serious worry. The use of noncopyable types means the message-passing library can safely assume all communication will be one-to-one, which allows for a blazing fast implementation under the hood. Noncopyable types also give other strong guarantees, such as the safety of ARCs and the fact that two tasks cannot deadlock when communicating over a single pipe.

Hopefully I've gotten you excited about using Rust for safe + performant parallel programming (or maybe several months from now, when its features and syntax are more stable). And to the Rust community: Thanks, it's been a blast.

2012-09-25

Rust (4): Typesafe Shared Mutable State

This post is a continuation of shared immutable state. Before I introduce how we do safe shared mutable state, I'll take a moment to show why unprotected shared mutable state is dangerous.

Dangers of Shared State

If you're a functional programmer, you're probably used to a language in which nested data structures are allocated in several heap cells, each of which is garbage-collected, so multiple users can freely alias into the same data, implicitly copy to make changes, and so on.

Rust's approach is somewhat different: it focuses on stack-allocation, avoiding expensive implicit copies, and predictable performance. In fact, heap-allocation only occurs when you write the @ or ~ sigil; and, absent @-pointers, Rust's representation semantics don't involve garbage collection at all. Instead:

Data types are representated with interior types, meaning data types are embedded directly within one another rather than using pointer indirection. You can, of course, create borrowed pointers to such types and pass them between functions.
Stack-allocated and ~-allocated values are owned data, which get eagerly freed/deinitialised immediately upon going out of scope or being overwritten.
Rustic data structures can have in-place mutability, indicated with the mut keyword. While also supported by many other functional languages, in Rust it presents new difficulties with aliasing pointers because of point #2 above.

With such a C/C++-like representation model, the prospect of sharing mutable state among multiple actors is a lot more dangerous. To show why, let's say we added a data-race-enabling function to ARC's interface:

    fn get_mut<T: Const Send>(arc: &a/ARC<T>) -> &a/mut T

Then we can commit badness like:

    let arc: ARC<Option<~int>> = ARC(Some(~31337));

    let arc2 = clone(&arc);

    do task::spawn |move arc2| {

        // Might print "Some(~31337)". Might print "None". Might segfault.

        io::println(fmt!("%?", *get(&arc2)));

}

    // Frees and deinitialises the owned pointer inside the ARC.

    *get_mut(&arc) = None;

    // (But what if this runs after the other task determines the data

    //  is Some, but before it dereferences the contained pointer??)

With sufficient cleverness, this can even be harnessed to implement arbitrary type coercion. (See my solution here.)

Reader-Writer ARCs

The ARC already existed when I arrived at Mozilla, but there was no similar (and safe) solution for the state being mutable. I created the RWARC, with a reader-writer lock inside, to fill this gap.

You create them just like you create ARCs:

    fn RWARC<T: Const Send>(data: T) -> RWARC<T>

    fn clone<T: Const Send>(arc: &RWARC<T>) -> RWARC<T>

But when using them, instead of getting an unlimited-use reference to the data inside, you give the interface a closure to run on the data, and it runs the closure for you with the rwlock held in the correct mode.

    fn read <T: Const Send>(arc: &RWARC<T>, blk: fn(&T))

    fn write<T: Const Send>(arc: &RWARC<T>, blk: fn(&mut T))

The key difference is that the region associated with the data pointer is the region of the closure, rather than some arbitrary region defined by the caller. This allows read() and write() to enforce that the contained reader-writer lock is always held in the correct mode when references to the data exist.

Now we can fix the example from before.

    let arc = RWARC(Some(~31337));

    for 5.times {

        let arc2 = clone(&arc);

        do task::spawn |move arc2| {

            do read(&arc2) |state: &Option<~int>| {

                // Long-running reads on state still happen in parallel.

                io::println(fmt!("%?", *state));

}

}

}

    do write(&arc) |state: &mut Option<~int>| {

        // Exclusive write access. No other aliases to state can exist concurrently.

        *state = None;

}

Note that while data races are no longer possible, race conditions in general still are. (I mentioned earlier that shared mutable state introduces nondeterminism.) Here, anywhere between zero and five "None"s will be printed.

The compiler will, of course, reject code that tries to cheat the interface:

    let escaped_state;

    do write(&arc) |state| {

        escaped_state = state; // ERROR: reference not valid outside of its lifetime

}

A brief informal justification of safety:

The Const restriction still enforces that readers only see deeply immutable state. Also, even with mutable state, it still prevents cycles from being created, because the RWARC itself does not have the Const kind.
References to the shared state cannot escape the closure called by read() or write(). In effect, the region system statically enforces that the lock must be held in order to access the state.

The Concurrency Primitives You Know and Love

Condition Variables

The RWARC also comes with some other features to remind you of home (if "home" to you means old C-style concurrency primitives you fought off race conditions with back in the day). We have condition variables:

    fn write_cond<T: Const Send>(arc: &RWARC<T>, blk: fn(&mut T, &Condvar))

    fn wait(cond: &Condvar)

    fn signal(cond: &Condvar) -> bool

    fn broadcast(cond: &Condvar) -> uint

These work as you might expect. Like the &mut T reference, the Condvar reference can only be used inside the closure (i.e., while the lock is held).

    let arc = RWARC(~[]);

    let arc2 = clone(&arc);

    do task::spawn |move arc2| {

        do write_cond(&arc2) |state,cond| {

            // Poor man's message-passing. Of course, pipes are much

            // faster; rwarcs and condvars are built on top of pipes.

            vec::push(state, ~"hello there!");

            signal(cond);

}

}

    do write_cond(&arc) |state,cond| {

        while state.len() == 0 {

            wait(cond);

}

        io::println(vec::pop(state));

}

(The more seasoned concurrency hackers among you might now be wondering what if you wanted to associate multiple conditions with the same state? That can be done too -- gritty details are in the docs.)

Downgrade (or, Now You're Just Showing Off with the Region System)

(Do feel free to zone out for this section.)

If you're used to being able to atomically "downgrade" write access into read access without letting other writers through in the meantime, you can do that here too. (I'm presenting this feature mostly just to show off more stuff you can do by combining the region system with noncopyable types.)

    // Calls a closure which will write, then downgrade, then read.
    fn write_downgrade<T: Const Send>(arc: &RWARC<T>, blk: fn(RWWriteMode/&a<T>))

    // Converts a "write permission" token to a "read permission" token.

    fn downgrade<T: Const Send>(token: RWWriteMode/&a<T>) -> RWReadMode/&a<T>

    fn write<T: Const Send>(token: &RWWriteMode<T>, blk: fn(&mut T))

    fn read <T: Const Send>(token: &RWReadMode <T>, blk: fn(&T))

Here, the RWWriteMode and RWReadMode are noncopyable "permission tokens" that allow the user to write or read, and downgrade() is a function that consumes the write token and wakes up any readers waiting on the rwlock. Since the tokens are noncopyable, the caller cannot still have write permissions after calling downgrade() (which would, of course, result in data races).

The "RWWriteMode/&a" syntax indicates an opaque data structure with region pointers inside. While the write mode token is passed by ownership (so that it can in turn be surrendered to downgrade()), its scope is still constrained by the associated region, which means it can't escape from the closure passed to write_downgrade(). And downgrade() converts a write mode token to a read mode token with the same region, so the latter can't escape either.

Complex as the above functions may seem, using the interface simply looks like this:

    do write_downgrade(&arc) |token| {

        do write(&token) |mutable_state| {

...

}

        let token = downgrade(move token);

        do read(&token) |immutable_state| {

...

}

}

Unwrap

Finally, RWARCs (ARCs too) also now have a mechanism to get your data back out again.

    fn unwrap<T: Const Send>(arc: RWARC<T>) -> T

Of course, it wouldn't be valid to reclaim ownership of the data while other tasks might still have aliases to it. Instead, unwrap() blocks the calling task until its reference is the only reference alive, and then takes ownership of the data instead of freeing it. (To avoid deadlock, subsequent callers to unwrap() on the same ARC immediately fail.)

This adds expressivity in two ways: it relieves you from having to deeply-copy the shared data if you need to own it (which would be extra problematic if it had noncopyables inside), and it automatically synchronises with the ARC's other users. You could use this to implement a fork-join pattern, like so:

    let arc = RWARC(some_data);

    for num_cpus().times {

        let arc2 = clone(&arc);

        do task::spawn |move arc2| {

            process_data(arc2); // might read, write, whatever

}

}

    let modified_data = unwrap(move arc); // blocks on all child tasks at once

    // do more of the algorithm, etc.

All this without ever once copying the data.

This about wraps up the contributions I made this summer at Mozilla. In my next post I'll conclude the series with a summary of why I like Rust so much.

2012-09-22

Rust (3): Typesafe Shared State

Previously I introduced Rust, talking about syntax, pointer types, and light-weight parallelism and message-passing. I also wrote about my own summer project, flexible failure propagation between tasks, talking about some more advanced programming techniques with Rustic tasks.

Through it all you might have been wondering, "No shared state?! I see the value in eliminating data races, but isn't it sometimes what you want?" Yes! That's what this post is for.

Consider: When spawning a bunch of tasks to parallelly process a large data structure, it would be a shame to have to deeply copy the whole thing and send one copy over a pipe to each task (expensive in both space and time). You'd want each task to be able to alias the same data instead.

Shared Immutable State

Rust's standard library includes the ARC, which stands for Atomically Reference-Counted object. The ARC serves as a wrapper-handle to some data you wish to share; rather than copying the data itself, you instead copy just the handle, which just involves atomically incrementing a reference count for the contained data.

To create an ARC:

    // Given ownership of some data, wraps it in an ARC.

fn ARC<T: Const Send>(data: T) -> ARC<T>

The polymorphic type T is constrained by the Send kind (which I mentioned in my primer post), so it can only be used with data of types that you could also otherwise send over pipes, and also by the Const kind, which means the data can have no mutable interior fields (the type has to be deeply immutable to guarantee no data races).

Like pipe endpoints, the ARC is a noncopyable type. New handles to the same ARC cannot be freely created (for that would bypass the reference counting mechanism); they must be made using the rest of the interface. (ARC also uses destructors internally, so the moment an ARC handle leaves scope, the reference count gets dropped. When the count hits zero, the data will be freed as well.)

And to use an ARC:

// Creates a new handle to the ARC.

    fn clone<T: Const Send>(arc: &ARC<T>) -> ARC<T>

    // Get an immutable pointer to the underlying data.

fn get<T: Const Send>(arc: &a/ARC<T>) -> &a/T

You'll notice the use of &-pointers (borrowed pointers) in this interface. In clone(), this means the argument ARC is passed by-reference rather than by-ownership to create the new handle. The interface of get() introduces some new syntax, &a/T, which to explain I'll need to introduce regions.

As I hinted at in my primer post, borrowed pointers are statically analysed to ensure they don't outlive the data they were borrowed from. This is done by associating a region with the borrowed pointer to denote its lifetime (which is tied to some lexical scope or inherited from other data's lifetime).

Mostly, regions exist behind-the-scenes, since the compiler can infer them when needed. Sometimes it is useful, though, to explicitly write that two regions will be the same -- the &a/T syntax denotes a borrowed pointer to a T with some lifetime a. Because the same region variable is used to borrow the ARC itself ("&a/ARC<T>"), the compiler knows to enforce in get()'s caller that the returned pointer cannot outlive the associated ARC handle. get() is said to be region-parametric; that is, the region variable a can be instantiated with whatever region is appropriate at each call-site.

Examples

Here's a code snippet that demonstrates basic ARC usage. I create an ARC with a BigDataStructure inside, clone a second handle, and then in two parallel tasks get references into them.

    fn init() -> BigDataStructure   { ... }
    fn access(x: &BigDataStructure) { ... }

    fn main() {
        let arc1 = ARC(init());   // refcount == 1
        let arc2 = clone(&arc1);  // refcount == 2
        do task::spawn |move arc2| {  // gives child ownership of 2nd handle
            let x2: &BigDataStructure = get(&arc2);
            access(x2);  // in parallel with the below
            // arc2 gets dropped. BigDataStructure might get freed here.....
            // (note: x2 can no longer be accessed)
        }
        let x1: &BigDataStructure = get(&arc1);
        access(x1);  // in parallel with the above
        // arc1 gets dropped. .....or it might get freed here.
        // (note: x1 can no longer be accessed)
    }

Here are some examples of ways the type system prevents unsafe usage.

First, the compiler won't let me bypass the reference-counting mechanism:

    let arc1 = ARC(init());  // refcount == 1
    let arc2 = arc1; // ERROR: copying a noncopyable value
    // double free :(

If ARC handles were copyable, two destructors would run here and the reference count would get decremented too many times.
The compiler will also stop me from using the reference from get() after the associated ARC handle went out of scope (which is legal in a language like C++, and would result in a use-after-free):

    fn broken_get(arc: ARC<BigDataStructure>) -> &a/BigDataStructure {
        // note the unconstrained region variable ^
        let x = get(&arc);
        return x;  // ERROR: reference not valid outside of its lifetime
        // note: the arc handle would get dropped here(??)
    }
    access(broken_get(ARC(init())));  // use after free :(
Finally, I will try to surrender ownership of my ARC handle by sending it over a pipe (perhaps to another task), while still holding on to a pointer I borrowed from it with get().

    let (sender,receiver) = pipes::stream();
    let arc = ARC(init());
    let x = get(&arc);      // NOTE: loan of local variable granted here
    sender.send(move arc);  // ERROR: moving out of local variable
                            //        prohibited due to outstanding loan
    access(x);  // unknown whether arc is still alive(??)

But the compiler's borrow checker stopped me, because the "loan" I had created earlier was still in scope.

Safety

Because Rust intentionally has no language features to support shared state, the ARC library provides it by using unsafe code internally. Given that unsafe code "shifts the burden of proof from the compiler to the programmer", how can we know the interface is right?

While we are working on a proof of the region system's correctness in general, we don't have a proof for this interface in particular (though I'd be curious how one would look!). Nevertheless, we can be quite confident in the ARC's safety because of the guarantees that Rust's language features provide:

The Const kind restriction and the immutable pointer returned by get() ensure that once inside an ARC, data can never be modified. This makes data races impossible, and also precludes the possibility of constructing a cyclic reference among ARCs. (Reference counting is a safe memory management strategy only in absence of cycles.)
The use of noncopyable ("linear") types for the ARC handles ensures that the reference count exactly matches the number of handles, and therefore the associated data will only be freed when all handles have left scope.
The regioned type signature of get() ensures that a reference to the contained data must be outlived by its associated handle (and hence, by #2, outlived also by the contained data itself).

Stay tuned for a follow-up post explaining a still more advanced interface I created for safely sharing mutable state between tasks.

2012-09-18

Rust (2): Linked Task Failure

In my last post, I gave an introduction to Rust's syntax and memory/concurrency model. None of that stuff was anything I contributed -- that's what I'll talk about in this post.

Rust has a built-in mechanism for failure, sort of light-weight exceptions that can be thrown but not caught. It is written "fail" (or "fail "reason"", or sometimes "assert expr"), and it causes the task to unwind its stack, running destructors and freeing owned memory along the way, and then exit itself.

There are library convenience wrappers for handling failure on the other side of the task boundary, so:

    let result = do task::try {  // spawns and waits for a task

        fail "oops!";

};

    assert result.is_err();

(There is talk of extending failure to support throwing values of an "any" type and catching them, but that will take development effort.)

But not all failure is created equal. In some cases you might need to abort the entire program (perhaps you're writing an assert which, if it trips, indicates an unrecoverable logic error); in other cases you might want to contain the failure at a certain boundary (perhaps a small piece of input from the outside world, which you happen to be processing in parallel, is malformed and its processing task can't proceed).

Hence the need for different linked failure spawn modes, which was my main project at Mozilla this summer. One of the main motivations for configurable failure propagation is Servo, a parallel web browser being written in Rust (again from Mozilla Research), so along with the code examples below I'll also include a web-browser-style use case for each failure mode.

Linked Task Failure

By default, task failure is bidirectionally linked, which means if either task dies, it kills the other one.

    do task::spawn {

        do task::spawn {

            fail;  // All three tasks will die.

}

        sleep_forever();  // will get woken up by force

}

    sleep_forever();  // will get woken up by force

There are plans for Servo to have parallel HTML/CSS parsing and lexing, so the parse phase can start before lexing finishes. If an error happens during either phase, though, the other one should stop immediately -- an application for bidirectionally linked failure.

Supervised Task Failure

If you want parent tasks to kill their children, but not for a child task's failure to kill the parent, you can call task::spawn_supervised for unidirectionally linked failure.

The function task::try uses spawn_supervised internally, with additional logic to wait for the child task to finish before returning. Hence:

    let (receiver,sender) = pipes::stream();

    do task::spawn {  // bidirectionally linked

        // Wait for the supervised child task to exist.

        let message = receiver.recv();

        // Kill both it and the parent task.

        assert message != 42;

}

    do task::try {  // unidirectionally linked

        sender.send(42);

        sleep_forever();  // will get woken up by force

}

    // Flow never reaches here -- parent task was killed too.

Supervised failure is useful in any situation where one task manages multiple children tasks, such as with a parent tab task and several image render children tasks, each of the latter of which could fail due to corrupted image data. This failure mode was inspired by Erlang.

This mode of failure propagation was also the hardest to fully support, because parent task failure must propagate across multiple generations even if an intermediate generation has already exited:

    do task::spawn_supervised {

        do task::spawn_supervised {

            sleep_forever();  // should get woken up by force

}

        // Intermediate task immediately exits.

}

    wait_for_a_while();

    fail;  // must kill grandchild even if child is gone

Unlinked Task Failure

Finally, tasks can be configured to not propagate failure to each other at all, using task::spawn_unlinked for isolated failure.

    let (time1, time2) = (random(), random());

    do task::spawn_unlinked {

        sleep_for(time2);  // won't get forced awake

        fail;

}

    sleep_for(time1);  // won't get forced awake

    fail;

    // It will take MAX(time1,time2) for the program to finish.

If you're a Firefox user, you're probably familiar with this screen. Using tasks with isolated failure would prevent the entire browser from crashing if one particular tab crashed.

Wrap-Up

I'd also like to note that asynchronous failure is one of the few sources of nondeterminism in Rust. This code, for example, is dependent on task scheduling patterns:

    fn random_bit() -> bool {

        let result = do task::try {  // supervised

            do task::spawn { fail; }  // linked

            // Might get through here ok; might get killed.

};

        return result.is_success();

}

The fact that Rust has no shared state between tasks makes it difficult to trip over inherent randomness in scheduling patterns.

Other sources of nondeterminism include (1) a certain library for shared state, which I'll talk about in my next post; (2) the ability to select on multiple pipes at once; (3) the ability to detect when a pipe endpoint was closed before the message was received (called "try_send()"); and of course (4) system I/O (which includes random number generation). Eric Holk and I believe that in absence of these five things, Rust code (including one-to-one pipe communication) is deterministic.

If you're interested, the slide deck I used for my end-of-internship presentation on linked failure (with more of the same pictures) is here.

2012-09-04

Rust (1): Primer

I spent my summer at Mozilla Research working on Rust. There were several interesting things I did that I'll write about in subsequent posts; this one is an introduction/primer.

Rust is an experimental, still-in-development language that is geared towards parallelism and performance while at the same time providing a strong static type system. (You can read the other buzzwords on the website.)

Syntax Primer

On the front page of Rust's website, there is a code snippet:

    fn main() {
        for 5.times {
            println("Here's some Rust!");
        }
    }

This looks sort of cutesy and imperative, but actually there is some syntax sugar going on which facilitates a more functional-programming idiom. The above code is equivalent to:

    fn main() {
        times(5, || { println("Here's some Rust!"); true });
    }

where "|args*| { stmt* }" is the lambda/closure syntax (like in Ruby), and "times" is a core library function implemented as:

    fn times(count: uint, blk: fn() -> bool) {  // 'blk' is a stack-allocated closure
        if count > 0 {
            if blk() {  // Only continue looping if blk succeeds
                times(count-1, blk);  // Iterate until count hits 0
            }
        }
    }

The long and short of this is that idiomatic Rust typically has a lot of curly-brace "control flow blocks" that are actually closures, and higher-order functions are commonplace.

Concurrency

So, when I was giving my end-of-internship talk (which I'll link in my next post), I showed how easy it is to add parallelism to your rust program.

    fn main() {
        for 5.times {
            do task::spawn { // create 5 tasks to print a message in parallel
                println("Here's some Rust!");
            }
        }
    }

'task::spawn' has the signature "fn spawn(child: ~fn())" and is implemented with magic (unsafe code and runtime calls) internally. The 'do' syntax is similar to the 'for' syntax, but doesn't use the "iteration protocol" in which the closure returns bool.

(That code is equivalent to "times(5, || { task::spawn(|| { println("..."); }); true });".)

The Memory Model

If you've a sharp eye, you're wondering what that "~" is that I snuck in on the type of the closure for the child task. That's actually a pointer type, of which Rust has three (none of which can be null, by the way):

~T is a unique pointer to a T. It points to memory allocated in the send heap, which means data inside of unique pointers can be sent between tasks. You can copy unique pointers, but only by deeply copying (otherwise they wouldn't be unique!) (and by default, they are "non-implicitly-copyable", so the compiler will issue warnings if you copy them without writing the "copy" keyword).
@T is a managed pointer to a T. Currently, these are reference-counted and cycle-collected (they may be full-on GCed in the future). Copying one increments the reference count, so multiple managed pointers can point to the same data. These are allocated on a per-task private heap, and cannot be sent between tasks.
&T is a borrowed pointer to a T. It can point to the inside of arbitrary data structures - on the stack, inside ~ or @ pointers, etc. Rust has a static analysis, called the "borrow checker", that ensures that borrowed pointers must not outlive the scope of the pointed-to data (i.e., it is impossible for rust programs to have a use-after-free).

Behind this analysis is a sophisticated region system, developed by Niko Matsakis, which you can read about in this tutorial on his blog. I'll also talk a bit more about these in a later post.

The end result here is that in Rust there can be no shared state between tasks; tasks may only communicate by message-passing or by moving unique values into unique closures. More technically said, there is an inherent "send" kind that denotes whether a type may be sent to another task. ~T is sendable if T is sendable; @T and &T are never sendable; structs (conjunctive types) and enums (disjunctive types) are sendable if their contents are sendable; primitive types are always sendable.

Communication

Tasks can pass messages between each other using pipes, which is Rust's communication primitive. Pipes consist of a send endpoint and a receive endpoint, each of which is a noncopyable type (or "linear type", by correspondence with linear logic).

Pipes' noncopyability ensures that communication is one-to-one (i.e., multiple tasks cannot send or receive on the same pipe), which allows their internal synchronisation implementation to be much simpler than N-to-N might require, and hence also be blazing fast. The other benefit of noncopyability is it allows for pipe protocols, statically-enforced send/receive state machines that ensure you can't send/receive values of the "wrong" type, or (for example) try to receive when the other endpoint is also receiving.

I was working closely this summer with Eric Holk, the one responsible for pipes. You can read more about them (some examples, some performance, some type theory) on his blog.

Conclusion

I've got several more posts coming up to talk about the two cool things I personally worked on this summer. Hopefully this post has gotten you enough up to speed on what's going on in Rust to follow along with what I did.

Hopefully also I've gotten you excited about using Rust to write parallel programs that are both safe and performant. I know I am.

2012-07-01

Linux's leap-second deadlocks

Intro

The leap second is an extra second we insert irregularly at midnight at the end of certain months as determined by astronomers. UTC clocks render this as 23:59:60.

Yesterday at that time, linux servers around the world became wedged or experienced huge CPU spikes due to deadlock bugs in the leap second code. This post was linked on hackernews today, and has a good summary of some of the bugs in the comments. Here I'll discuss the leap second deadlocks from a concurrency researcher's perspective.

Five Bugs

I did a bit of digging to see where in Linux's code things were actually going wrong. It turns out there have actually been five different bugs related to the leap second management. It also turns out that we've seen linux deadlock at the leap second before, in 2007. (Race conditions can exist unnoticed for a very long time in infrequently-tested code paths. Who knew?!)

Each of these bugs result from some interaction with a spin-lock called "xtime_lock". Take a look at this code (trimmed and approximated a bit), which has moved and changed between different functions over the years but currently lives in "ntp_leap_second" in kernel/time/ntp.c.

        write_seqlock(&xtime_lock);
        switch (time_state) {
        case TIME_INS:
                timekeeping_leap_insert(-1);
                time_state = TIME_OOP;
                clock_was_set();
                printk(KERN_NOTICE
                        "Clock: inserting leap second 23:59:60 UTC\n");
                break;
        case TIME_DEL:
                timekeeping_leap_insert(1);
                time_state = TIME_WAIT;
                clock_was_set();
                printk(KERN_NOTICE
                        "Clock: deleting leap second 23:59:59 UTC\n");
                break;
        // (more cases omitted ...)
        }
        write_sequnlock(&xtime_lock);

I will list the bugs in chronological order.

One deadlock is described here (fixed by same, in 2007). The function clock_was_set() calls smp_call_function() to "retrigger CPU local events" in the high-resolution timer subsystem. Unfortunately, it's forbidden to call smp_call_function() in "atomic context", which this code is in because it holds a spinlock (and moreover, is running in the timer interrupt handler). This was "fixed" in this commit - they simply removed the call.
Another deadlock bug is shown in this post (It was fixed in 2008, according to this - but not all machines' linux versions had that fix for the 2008-2009 new year's leap second). The above code's call to printk actually needs to schedule the logging daemon kthread in order to print. Linux has a complicated "completely fair" scheduling algorithm that, when under enough system load, it needs to check the timer to determine what scheduling pattern to use. Thus, only when under heavy load, the call to printk() while holding xtime_lock would attempt to acquire xtime_lock again, causing deadlock.
A third deadlock bug, linked from the serverfault post I linked to in the intro, is shown and fixed in this commit, dated a month and a half ago of this year. Nine days prior, the "ntp_lock" spinlock was split out from "xtime_lock", for finer locking granularity, but was wrong here because of circular lock ordering. Only kernels built from this nine-day window would have this bug.
Despite all these bugs' fixes, yesterday saw linux servers having problems around the world. Many people reported huge CPU spikes, which turned out to be resulting from futex misbehaviours. (Literally while writing this post, the systems hacker friend who told me yesterday about the bug had the same futex problem on his own server.)
It turned out that in bug #1 above, removing clock_was_set() was wrong after all. Technical details about this are here (also, major props to John Stultz, the same guy who fixed bug #3, for being on the case promptly last night, and for offering clarification when I emailed him). In short, the bug happened because the missing call caused sub-second high-resolution timers to always immediately return, which causes userspace applications that use them in loops to instead run in tight loops eating up CPU. The popular-seeming fix to this was to run 'date -s "`date`"', which calls settimeofday(), which calls clock_has_changed(), replacing the missing call (reference).
And yet, there is still another bug lurking whose cause nobody seems to have discovered yet. In the serverfault post I linked in the intro, there was also an error message "[3161000.864001] BUG: spinlock lockup on CPU#1, ntpd/3358". This message is printed when the kernel detects a spinlock has been held for a second or more, indicating (you guessed it) deadlock. The linux folks don't appear to have figured this one out yet (not to disparage them - it's only been a day).

Research Applicability

An obvious zeroth-order conclusion: There were/are disproportionately many disastrous race conditions in the leap second code simply because it's such an infrequently-executed codepath. Hence, there's a call for some extra form of verification apart from "run the code as it is" - whether it's code-coverage-based stress tests, symbolic execution, or static analysis. I think static analysis would do best here.

One interesting thing to note is all of the locks involved here are global - the xtime_lock and the ntp_lock, and also the "interrupt handling context" which is a global property of the code. Projects like RacerX (paper pdf) would be well-capable of finding deadlocks #2 and #3 by simple callgraph analysis.

Bug #1 is especially interesting, though. This "can't call scheduler code while in interrupt context" is a very simple property, for which there's currently a runtime check for (in kernel/smp.c):

    /* Can deadlock when called with interrupts disabled. */
    WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled() && !oops_in_progress);

But I think this property can be ensured at compile-time - so that code with this bug will never build, let alone ship to production systems. I'd thought about this before, and last fall I wrote a primitive checker for this as a class project, which I called Atomic All-Nighters, and which would have been directly capable of finding bug #1 without ever even compiling or running the code. Here's the writeup pdf.

My high-minded idealist goal for the Atomic All-Nighters project is to get these static properties of the code representable by constructs in the kernel programming language itself. That way, the compiler would refuse to build any code that had this sort of bug. I'm going to work more on that project more when I get back to CMU to start on my ph.d. in the fall. I'm excited for it.

Winning Race Conditions