2011-11-10

vanish_vanish

a few weeks ago i hit the big milestone of being able to actually explore the decision tree. in addition to the meta-scheduler, this required me to build the following pieces of infrastructure:

- tree explorer, to find the next unexplored branch. (71 lines of code)
- save-and-restore, to travel back in time, as previously discussed (310 lines of code)
- pretty-printing / diagnostics-reporting "found a bug" routine (42 lines of code)
- deadlock detection (trivial, given the agent-tracking code already in place)

i've been using vanish_vanish, a small userspace program that simply forks a child, which vanishes, then vanishes itself, to test the infrastructure as it comes into existence. the reason this is interesting is because i modified my student kernel so its vanish implementation does approximately as follows:

void NORETURN vanish()
{
    task_lock(self);
    ...
    for_each_child(self) {
        task_lock(child);  // A
        child->parent = &init_process;
        task_unlock(child);
    }
    ...
    task_lock(self->parent);  // B
    notify_waiter(self->parent);
    task_unlock(self->parent);
    ...
}

the astute programmer will note that a concurrently vanishing parent and child could deadlock if the parent tries to execute line A while the child tries to execute line B, but this is not guaranteed. the extra nice thing about this test case is that to find the bug, one only needs to consider calls to task_lock as potential choice points (actually, only even such calls from within vanish, which is what i'm really doing here). (i'm going to add shared memory accesses soon, but that will make the tree explode, which would be bad for finding whether my infrastructure is correct.)

using vanish_vanish as the test case and the naively-implemented guest kernel, landslide can find the deadlock in a little bit over 4 minutes:

...
[SCHEDULE]      about to switch threads 3 -> 4
[SCHEDULE]      about to switch threads 4 -> 3
[SCHEDULE]      DEADLOCK! (3 -> 4 -> 3)
[BUG!]          ****    A bug was found!   ****
[BUG!]          **** Choice trace follows. ****
[BUG!]          Choice 1:       at eip 0x00105ae5, trigger_count 1144575, TID 1
[BUG!]          Choice 2:       at eip 0x00105ae5, trigger_count 1158049, TID 2
[BUG!]          Choice 3:       at eip 0x00106930, trigger_count 1677311, TID 3
[BUG!]          Choice 4:       at eip 0x00106930, trigger_count 1677854, TID 4
[BUG!]          Choice 5:       at eip 0x00106930, trigger_count 1712596, TID 3
[BUG!]          Choice 6:       at eip 0x00106930, trigger_count 1747558, TID 4
[BUG!]          Choice 7:       at eip 0x00106930, trigger_count 1747805, TID 3
[BUG!]          Choice 8:       at eip 0x00106930, trigger_count 1748273, TID 2
[BUG!]          Choice 9:       at eip 0x00106930, trigger_count 1749356, TID 2
[BUG!]          Choice 10:      at eip 0x00106930, trigger_count 1750372, TID 4
[BUG!]          Current eip 0x00106957, trigger_count 1750826, total triggers 61208725
[BUG!]          Total choice points 2705, total backtracks 1378, depths 18934

apart from looking impressive, this output is a good prompt to talk about where the project is headed to next.

to conclude that landslide can find deadlocks such as this in under 5 minutes is not an honest statistic, because the decision tree will get a lot bigger when i enrich the set of decision points, which is a necessary next step. (consider: if you removed one of the task_lock calls, there would still be a race, but it would not be a deadlock anymore, and we would need to look at shared memory accesses to find it.) what we can conclude from this test output, however, is the exploration rate: on average, it takes landslide 0.09 seconds per decision point in the tree. it's not clear how fast this is in terms of my eventual goal; this will become clear when i build a better set of decision points and try testing for more subtle race conditions.

i'm certainly proud to have enough infrastructure to be able to say "look! it works!", but it's not clear how much weight that should carry until i have some evidence from a more realistic test case.

2011-11-09

PDL retreat

i presented landslide at the PDL retreat, as a 10-minute work-in-progress talk and also during the poster sessions. (here are my slides.)

there's not much to say that wouldn't sound trite - in short, how great it was to synchronise with people from industry about what my project and i ought to be expecting out of each other, and also how great it was to get so many different advices from people about how to chart a path for my future.

one thing that's really worth mentioning is that during the industry feedback session at the end of the retreat, christos from vmware called out my project as "the Holy Grail for anybody who does kernel development; I'd like to see that work moving forward."