RSS | technovelty home | page of ian | ian@wienand.org
I've had the opportunity to play with software RAID over the last few days, and one thing that confused the heck out of me was why I created my fresh RAID 5 array with a few disks, and from the start one of the disks came up missing.
Eventually I found in the mdadm man page
The mdadm code enlightens me a bit more
/* If this is raid5, we want to configure the last active slot * as missing, so that a reconstruct happens (faster than re-parity)
But I'm still wondering why. I think I understand from following
the code (if you know the code, please correct any mis-assumptions).
Firstly, md.c schedules a resync when it notices things
are out of sync, like when an array is created and the drives are not
marked in sync or when there is a missing drive and a spare.
md goes through the sectors in the disk one by one
(doing a bit of rate limiting) and calls sync_request for
the underlying RAID layer for each sector. For RAID5 this firstly
converts the sector into a stripe and sets some flags on that stripe
(STRIPE_SYNCING). Each stripe structure has a buffer for
each disk in the array, amongst other things (struct
stripe_head in include/linux/raid/raid5.h). It
then calls handle_stripe to, well, handle the stripe.
/* maybe we need to check and possibly fix the parity for this stripe
* Any reads will already have been scheduled, so we just see if enough data
* is available
*/
if (syncing && locked == 0 &&
!test_bit(STRIPE_INSYNC, &sh->state) && failed <= 1) {
set_bit(STRIPE_HANDLE, &sh->state);
if (failed == 0) {
char *pagea;
if (uptodate != disks)
BUG();
compute_parity(sh, CHECK_PARITY);
uptodate--;
pagea = page_address(sh->dev[sh->pd_idx].page);
if ((*(u32*)pagea) == 0 &&
!memcmp(pagea, pagea+4, STRIPE_SIZE-4)) {
/* parity is correct (on disc, not in buffer any more) */
set_bit(STRIPE_INSYNC, &sh->state);
}
}
if (!test_bit(STRIPE_INSYNC, &sh->state)) {
if (failed==0)
failed_num = sh->pd_idx;
/* should be able to compute the missing block and write it to spare */
if (!test_bit(R5_UPTODATE, &sh->dev[failed_num].flags)) {
if (uptodate+1 != disks)
BUG();
compute_block(sh, failed_num);
uptodate++;
}
if (uptodate != disks)
BUG();
dev = &sh->dev[failed_num];
set_bit(R5_LOCKED, &dev->flags);
set_bit(R5_Wantwrite, &dev->flags);
locked++;
set_bit(STRIPE_INSYNC, &sh->state);
set_bit(R5_Syncio, &dev->flags);
}
}
We can inspect one particular part of that code that can get
triggered if we're running a sync. If we don't have a failed
drive (e.g. failed == 0) then we call into
compute_parity to check the parity of the current stripe
sh (for stripe header). If the parity is correct, then
we flag the stripe as in sync, and carry on.
However, if the parity is incorrect, we will fall out to the next
if statement. The first if checks if we have a failed
drive; if we don't then we flag the failed drive as the disk with the
parity for this stripe (sh->pd_idx). This is where the
optimisation comes in -- if we have a failed drive then we pass it in
directly to compute_block meaning that we will always be
bringing the "failed" disk into sync with the other drives. We then
set some flags so that the lower layers know to write out that
drive.
The alternative is to update the parity, which in RAID5 is spread across the disks. This means you're constantly stuffing up nice sequential reads by making the disks seek backwards to write parity for previous stripes. Under this scheme, for a situation where parity is mostly wrong (e.g. creating a new array) you just read from the other drives and bring that failed drive into sync with the others. Thus the only disk doing the writing is the spare that is being rebuilt, and it is writing in a nice, sequential manner. And the other disks are reading in a nice sequential manner too. Ingenious huh!
I did run this by Neil Brown to make sure I wasn't completely wrong, and he pointed me to a recent email where he explains things. If anyone knows, he does!
posted at: Thu, 07 Jul 2005 16:09 | in /code/linux | permalink | add comment (1 others)

This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License.