2011-07-15

The Importance of Being Assertive, A Trivial Style Guideline for Serious Programmers

one of the key ways that landslide "talks" to the guest kernel (i.e., manipulating thread interleaving patterns) is by triggering timer interrupts at key points in the kernel's execution. in terms of very general race detection, this is about the equivalent of looking at the code and thinking "what happens if we get preempted here?" (which is the traditional way of writing "race-free" concurrent code); of course, in this project, it will have some extra automated cleverness attached so it can be effective.

because landslide's code gets invoked at every instruction of the guest's execution, we must have a notion of a scheduling operation being "still in progress" - that is, after triggering a timer interrupt, there will be several instructions before a target thread (one we decided to switch to) actually starts running. if we take note of when the target thread starts up, we can provide a few guarantees about the instructions that will be run until then - namely, that the kernel must be executing in an interrupt handler and/or the context switcher (and NOT running the "meaty bits" of any other thread's code-path). this may seem obvious, but it is still an invariant that i would like to rely on, so i expressed it as an assert in the code:

    assert(ACTION(scheduler, context_switch) || HANDLING_INTERRUPT(scheduler));

imagine my surprise, testing the implementation, when this assert tripped!

after a bit of debugging, i discovered that the invariant violation was happening at the instruction immediately following the instruction at which i tried to trigger a timer interrupt. it turns out that, in some cases, simics may decide to delay interrupt processing by a few instructions (seemingly by a non-deterministic amount, too) after i set the CPU's pending interrupt flags.

the fix (ensuring that when i decide to trigger a timer interrupt it is actually received immediately) is nothing special; what is important is to realise that my programming environment had some peculiarity that broke an assumption that i didn't even realise i was making when i established the invariants of my own code. so upon finding this new assumption (and writing code to make sure it worked), i added another assert:

    if (scheduler->just_triggered_timer_interrupt) {
        assert(get_current_eip() == get_timer_wrap_begin() &&
               "simics attempted to delay our interrupt! :<");
        scheduler->just_triggered_timer_interrupt = false;
    }

now if something else of this nature goes wrong, i will know immediately, with a useful error message to boot. but imagine if i'd never written that first assert to begin with? simics could have merrily delayed all my interrupts for however long it wanted, i would never have known, and wherever i would decide to trigger interrupts (i.e., notionally good places for exposing race conditions) would have no bearing on when they actually happened! i could spend months on this project and it would never work right and i might never know why.

use asserts, fellow hackers - not just comments or thoughts in your heads. you'll be happy for it later.

2011-07-04

agence

in order to schedule threads in any interesting way, landslide will need to have pretty tight control over the internal scheduler of the kernel under test. this is not easy, especially since every kernel will have its own slightly different way of tracking scheduler state, so landslide will need a flexible and accurate way of monitoring what's going on in the guest kernel.

i just finished a bit of framework that lets landslide see a "reflection" of the guest kernel's scheduler. soon, we'll be able to bend the guest's scheduling patterns to our whims, but for now, here's what we see when passively viewing the runqueue during the boot-up sequence (comments are my own, post-hoc):

simics> c
switched threads 1 -> -268370093 at 0x1009cf  // garbage from init i haven't cleaned up yet
switched threads -268370093 -> 1 at 0x105385  // it's research-quality; give me a break
switched threads 1 -> 2 at 0x102f7b
new agent 2 at eip 0x105613 -- [2, 1]         // shell fork()ed from init
agent 2 gone at eip 0x1056d5 -- [1]           // shell blocks on readline()
agent 1 gone at eip 0x1056d4 -- []            // take init off RQ to schedule
switched threads 2 -> 1 at 0x102f7b
new agent 1 at eip 0x105613 -- [1]            // init back on RQ after context-switch
agent 1 gone at eip 0x1056d5 -- []            // init blocks on wait()

one particularity of the exhibited guest kernel (my student kernel) is that every context-switch involves pulling the target thread off of the runqueue and putting it back on later - which we see clearly here. also, keep in mind that this is all from within simics; the kernel itself is totally unmodified. (obviously, the same output could be achieved by putting print statements into the kernel at key points, which is not the point here.)

i also use the term "agent" to refer to a thread that is currently on the runqueue (i.e. can be context-switched to at any moment); it is recycled terminology from dBug.

anyway, so if i type a command at the prompt, it continues:

new agent 2 at eip 0x105613 -- [2]            // kbd handler sees "\n" and puts shell on RQ
agent 2 gone at eip 0x1056d4 -- []            // take shell off RQ to schedule
switched threads 1 -> 2 at 0x102f7b
new agent 2 at eip 0x105613 -- [2]            // shell back on RQ after context-switch
switched threads 2 -> 3 at 0x102f7b
new agent 3 at eip 0x105613 -- [3, 2]         // shell forks "juggle" process
switched threads 3 -> 4 at 0x102f7b
new agent 4 at eip 0x105613 -- [4, 3, 2]      // "juggle" forks its own child threads...


...and so on.