Why does mdadm drop a drive when creating a RAID 5 array?

I've had the opportunity to play with software RAID over the last few days, and one thing that confused the heck out of me was why I created my fresh RAID 5 array with a few disks, and from the start one of the disks came up missing.

Eventually I found in the mdadm man page

Normally mdadm will not allow creation of an array with only one device, and will try to create a raid5 array with one missing drive (as this makes the initial resync work faster). With --force, mdadm will not try to be so clever.

The mdadm code enlightens me a bit more

/* If this is  raid5, we want to configure the last active slot
 * as missing, so that a reconstruct happens (faster than re-parity)

But I'm still wondering why. I think I understand from following the code (if you know the code, please correct any mis-assumptions). Firstly, md.c schedules a resync when it notices things are out of sync, like when an array is created and the drives are not marked in sync or when there is a missing drive and a spare.

md goes through the sectors in the disk one by one (doing a bit of rate limiting) and calls sync_request for the underlying RAID layer for each sector. For RAID5 this firstly converts the sector into a stripe and sets some flags on that stripe (STRIPE_SYNCING). Each stripe structure has a buffer for each disk in the array, amongst other things (struct stripe_head in include/linux/raid/raid5.h). It then calls handle_stripe to, well, handle the stripe.

/* maybe we need to check and possibly fix the parity for this stripe
 * Any reads will already have been scheduled, so we just see if enough data
 * is available
 */
if (syncing && locked == 0 &&
    !test_bit(STRIPE_INSYNC, &sh->state) && failed <= 1) {
    set_bit(STRIPE_HANDLE, &sh->state);
    if (failed == 0) {
        char *pagea;
        if (uptodate != disks)
            BUG();
        compute_parity(sh, CHECK_PARITY);
        uptodate--;
        pagea = page_address(sh->dev[sh->pd_idx].page);
        if ((*(u32*)pagea) == 0 &&
            !memcmp(pagea, pagea+4, STRIPE_SIZE-4)) {
            /* parity is correct (on disc, not in buffer any more) */
            set_bit(STRIPE_INSYNC, &sh->state);
        }
    }
    if (!test_bit(STRIPE_INSYNC, &sh->state)) {
        if (failed==0)
            failed_num = sh->pd_idx;
        /* should be able to compute the missing block and write it to spare */
        if (!test_bit(R5_UPTODATE, &sh->dev[failed_num].flags)) {
            if (uptodate+1 != disks)
                BUG();
            compute_block(sh, failed_num);
            uptodate++;
        }
        if (uptodate != disks)
            BUG();
        dev = &sh->dev[failed_num];
        set_bit(R5_LOCKED, &dev->flags);
        set_bit(R5_Wantwrite, &dev->flags);
        locked++;
        set_bit(STRIPE_INSYNC, &sh->state);
        set_bit(R5_Syncio, &dev->flags);
    }
}

We can inspect one particular part of that code that can get triggered if we're running a sync. If we don't have a failed drive (e.g. failed == 0) then we call into compute_parity to check the parity of the current stripe sh (for stripe header). If the parity is correct, then we flag the stripe as in sync, and carry on.

However, if the parity is incorrect, we will fall out to the next if statement. The first if checks if we have a failed drive; if we don't then we flag the failed drive as the disk with the parity for this stripe (sh->pd_idx). This is where the optimisation comes in -- if we have a failed drive then we pass it in directly to compute_block meaning that we will always be bringing the "failed" disk into sync with the other drives. We then set some flags so that the lower layers know to write out that drive.

The alternative is to update the parity, which in RAID5 is spread across the disks. This means you're constantly stuffing up nice sequential reads by making the disks seek backwards to write parity for previous stripes. Under this scheme, for a situation where parity is mostly wrong (e.g. creating a new array) you just read from the other drives and bring that failed drive into sync with the others. Thus the only disk doing the writing is the spare that is being rebuilt, and it is writing in a nice, sequential manner. And the other disks are reading in a nice sequential manner too. Ingenious huh!

I did run this by Neil Brown to make sure I wasn't completely wrong, and he pointed me to a recent email where he explains things. If anyone knows, he does!