From: Yu Kuai <yukuai3@xxxxxxxxxx> Redundant data is used to enhance data fault tolerance, and the storage method for redundant data vary depending on the RAID levels. And it's important to maintain the consistency of redundant data. Bitmap is used to record which data blocks have been synchronized and which ones need to be resynchronized or recovered. Each bit in the bitmap represents a segment of data in the array. When a bit is set, it indicates that the multiple redundant copies of that data segment may not be consistent. Data synchronization can be performed based on the bitmap after power failure or readding a disk. If there is no bitmap, a full disk synchronization is required. Key Features: - IO fastpath is lockless, if user issues lots of write IO to the same bitmap bit in a short time, only the first write have additional overhead to update bitmap bit, no additional overhead for the following writes; - support only resync or recover written data, means in the case creating new array or replacing with a new disk, there is no need to do a full disk resync/recovery; Key Concept: - State Machine: Each bit is one byte, contain 6 difference state, see llbitmap_state. And there are total 8 differenct actions, see llbitmap_action, can change state: llbitmap state machine: transitions between states | | Startwrite | Startsync | Endsync | Abortsync| | --------- | ---------- | --------- | ------- | ------- | | Unwritten | Dirty | x | x | x | | Clean | Dirty | x | x | x | | Dirty | x | x | x | x | | NeedSync | x | Syncing | x | x | | Syncing | x | Syncing | Dirty | NeedSync | | | Reload | Daemon | Discard | Stale | | --------- | -------- | ------ | --------- | --------- | | Unwritten | x | x | x | x | | Clean | x | x | Unwritten | NeedSync | | Dirty | NeedSync | Clean | Unwritten | NeedSync | | NeedSync | x | x | Unwritten | x | | Syncing | NeedSync | x | Unwritten | NeedSync | Typical scenarios: 1) Create new array All bits will be set to Unwritten by default, if --assume-clean is set, all bits will be set to Clean instead. 2) write data, raid1/raid10 have full copy of data, while raid456 doesn't and rely on xor data 2.1) write new data to raid1/raid10: Unwritten --StartWrite--> Dirty 2.2) write new data to raid456: Unwritten --StartWrite--> NeedSync Because the initial recover for raid456 is skipped, the xor data is not build yet, the bit must set to NeedSync first and after lazy initial recover is finished, the bit will finially set to Dirty(see 5.1 and 5.4); 2.3) cover write Clean --StartWrite--> Dirty 3) daemon, if the array is not degraded: Dirty --Daemon--> Clean For degraded array, the Dirty bit will never be cleared, prevent full disk recovery while readding a removed disk. 4) discard {Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten 5) resync and recover 5.1) common process NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean 5.2) resync after power failure Dirty --Reload--> NeedSync 5.3) recover while replacing with a new disk By default, the old bitmap framework will recover all data, and llbitmap implement this by a new helper, see llbitmap_skip_sync_blocks: skip recover for bits other than dirty or clean; 5.4) lazy initial recover for raid5: By default, the old bitmap framework will only allow new recover when there are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is add to perform raid456 lazy recover for set bits(from 2.2). Bitmap IO: - Chunksize The default bitmap size is 128k, incluing 1k bitmap super block, and the default size of segment of data in the array each bit(chunksize) is 64k, and chunksize will adjust to twice the old size each time if the total number bits is not less than 127k.(see llbitmap_init) - READ While creating bitmap, all pages will be allocated and read for llbitmap, there won't be read afterwards - WRITE WRITE IO is divided into logical_block_size of the array, the dirty state of each block is tracked independently, for example: each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit; | page0 | page1 | ... | page 31 | | | | \-----------------------\ | | | block0 | block1 | ... | block 8| | | | \-----------------\ | | | bit0 | bit1 | ... | bit511 |