2012-09-22

Rust (3): Typesafe Shared State

Previously I introduced Rust, talking about syntax, pointer types, and light-weight parallelism and message-passing. I also wrote about my own summer project, flexible failure propagation between tasks, talking about some more advanced programming techniques with Rustic tasks.

Through it all you might have been wondering, "No shared state?! I see the value in eliminating data races, but isn't it sometimes what you want?" Yes! That's what this post is for.

Consider: When spawning a bunch of tasks to parallelly process a large data structure, it would be a shame to have to deeply copy the whole thing and send one copy over a pipe to each task (expensive in both space and time). You'd want each task to be able to alias the same data instead.

Shared Immutable State


Rust's standard library includes the ARC, which stands for Atomically Reference-Counted object. The ARC serves as a wrapper-handle to some data you wish to share; rather than copying the data itself, you instead copy just the handle, which just involves atomically incrementing a reference count for the contained data.

To create an ARC:

    // Given ownership of some data, wraps it in an ARC.
    fn ARC<T: Const Send>(data: T) -> ARC<T>

The polymorphic type T is constrained by the Send kind (which I mentioned in my primer post), so it can only be used with data of types that you could also otherwise send over pipes, and also by the Const kind, which means the data can have no mutable interior fields (the type has to be deeply immutable to guarantee no data races).

Like pipe endpoints, the ARC is a noncopyable type. New handles to the same ARC cannot be freely created (for that would bypass the reference counting mechanism); they must be made using the rest of the interface. (ARC also uses destructors internally, so the moment an ARC handle leaves scope, the reference count gets dropped. When the count hits zero, the data will be freed as well.)

And to use an ARC:

    // Creates a new handle to the ARC.
    fn clone<T: Const Send>(arc: &ARC<T>) -> ARC<T>
    // Get an immutable pointer to the underlying data.
    fn get<T: Const Send>(arc: &a/ARC<T>) -> &a/T

You'll notice the use of &-pointers (borrowed pointers) in this interface. In clone(), this means the argument ARC is passed by-reference rather than by-ownership to create the new handle. The interface of get() introduces some new syntax, &a/T, which to explain I'll need to introduce regions.

As I hinted at in my primer post, borrowed pointers are statically analysed to ensure they don't outlive the data they were borrowed from. This is done by associating a region with the borrowed pointer to denote its lifetime (which is tied to some lexical scope or inherited from other data's lifetime).

Mostly, regions exist behind-the-scenes, since the compiler can infer them when needed. Sometimes it is useful, though, to explicitly write that two regions will be the same -- the &a/T syntax denotes a borrowed pointer to a T with some lifetime a. Because the same region variable is used to borrow the ARC itself ("&a/ARC<T>"), the compiler knows to enforce in get()'s caller that the returned pointer cannot outlive the associated ARC handle. get()  is said to be region-parametric; that is, the region variable a can be instantiated with whatever region is appropriate at each call-site.

Examples


Here's a code snippet that demonstrates basic ARC usage. I create an ARC with a BigDataStructure inside, clone a second handle, and then in two parallel tasks get references into them.

   
fn init() -> BigDataStructure   { ... }
   
fn access(x: &BigDataStructure) { ... }

    fn main() {
        let arc1 = ARC(init());   // refcount == 1
       
let arc2 = clone(&arc1);  // refcount == 2
        do task::spawn |move arc2| {  // gives child ownership of 2nd handle
           
let x2: &BigDataStructure = get(&arc2);
            access(x2);  // in parallel with the below
            // arc2 gets dropped. BigDataStructure might get freed here.....
            // (note: x2 can no longer be accessed)
        }
       
let x1: &BigDataStructure = get(&arc1);
        access(x1);  // in parallel with the above
        // arc1 gets dropped. .....or it might get freed here.
        // (note: x1 can no longer be accessed)
    }


Here are some examples of ways the type system prevents unsafe usage.
  • First, the compiler won't let me bypass the reference-counting mechanism:

        let arc1 = ARC(init());  // refcount == 1
       
    let arc2 = arc1;         // ERROR: copying a noncopyable value
        // double free :(


    If ARC handles were copyable, two destructors would run here and the reference count would get decremented too many times.
  • The compiler will also stop me from using the reference from get() after the associated ARC handle went out of scope (which is legal in a language like C++, and would result in a use-after-free):

        fn broken_get(arc: ARC<BigDataStructure>) -> &a/BigDataStructure {
           
    // note the unconstrained region variable ^
            let x = get(&arc);
            r
    eturn x;  // ERROR: reference not valid outside of its lifetime
            // note: the arc handle would get dropped here(??)
        }
        access(broken_get(ARC(init())));  // use after free :(

  • Finally, I will try to surrender ownership of my ARC handle by sending it over a pipe (perhaps to another task), while still holding on to a pointer I borrowed from it with get().

        let (sender,receiver) = pipes::stream();
       
    let arc = ARC(init());
       
    let x = get(&arc);      // NOTE: loan of local variable granted here
        sender.send(move arc);  // ERROR: moving out of local variable
                                //        prohibited due to outstanding loan
        access(x);  // unknown whether arc is still alive(??
    )

    But the compiler's borrow checker stopped me, because the "loan" I had created earlier was still in scope.

Safety


Because Rust intentionally has no language features to support shared state, the ARC library provides it by using unsafe code internally. Given that unsafe code "shifts the burden of proof from the compiler to the programmer", how can we know the interface is right?

While we are working on a proof of the region system's correctness in general, we don't have a proof for this interface in particular (though I'd be curious how one would look!). Nevertheless, we can be quite confident in the ARC's safety because of the guarantees that Rust's language features provide:
  1. The Const kind restriction and the immutable pointer returned by get() ensure that once inside an ARC, data can never be modified. This makes data races impossible, and also precludes the possibility of constructing a cyclic reference among ARCs. (Reference counting is a safe memory management strategy only in absence of cycles.)
  2. The use of noncopyable ("linear") types for the ARC handles ensures that the reference count exactly matches the number of handles, and therefore the associated data will only be freed when all handles have left scope.
  3. The regioned type signature of get() ensures that a reference to the contained data must be outlived by its associated handle (and hence, by #2, outlived also by the contained data itself).
Stay tuned for a follow-up post explaining a still more advanced interface I created for safely sharing mutable state between tasks.

No comments:

Post a Comment