Re: [18.2.4 Reef] Using RAID1 as Metadata Device in Ceph – Risks and Recommendations

Alexander Patrakov <patrakov@xxxxxxxxx> · Thu, 11 Sep 2025 18:56:54 +0800

[Disclaimer: AI-assisted answer]

There is an interesting quirk in Linux RAID1 implementation in
connection to data races, documented in the manual page,
https://man7.org/linux/man-pages/man4/md.4.html, under the heading
"Scrubbing and mismatches".

When the md driver initiates the writeback of a page, it asks both
RAID members independently and simultaneously to write that page to
their persistent storage. If the page has changed during the writeback
(e.g., if disk 0 has serviced the write operation before the data has
changed, and disk 1 did so after the change), there is a chance that
the two drives might have received different data, thus creating an
internal inconsistency. This is common when using RAID1 drives for
swap, but in theory, any use of mmap(2) can also trigger this,
especially in cases of writeback triggered by memory pressure as
opposed to an explicit msync().

This is relevant, as RocksDB uses mmap().

It usually means (at least for the md driver) that the software does
not care which of the copies is correct, as it changed the data in
flight. Yet, it might cause non-repeatable reads afterward, as some
read operation can be serviced from disk 0 and some from disk 1.

If the RAID array does not have a write-intent bitmap (the upstream
default is to have an interactive recommendation to create it, with
"N" being the default answer), such inconsistencies can survive
indefinitely after a power failure.

I don't know whether RocksDB has an implicit assumption that rereads
of data that was being written during a power failure yield the same
results.

If you want to avoid this effect, please use RAID5, not RAID1. You
still can create RAID5 with two disks. The difference is that, unlike
RAID1, the kernel will make a copy of the to-be-written memory area,
and then write that copy (guaranteed to be stable during the write
operation) to the disks.

On Thu, Sep 11, 2025 at 5:19 PM Alex from North <service.plant@xxxxx> wrote:
>
> >Of course, but remember that Ceph stores data across multiple OSDs and hosts for just that
> >reason.
>
> Yes, but when nvme with 5 (just fot instance) OSD's db+wal on it goes down - all 5 OSDs go down as well. Keeping in mind, that each of those OSDs are 16TB HDDs, in turn it brings alive massive recovery byteflow. This is what I'd like to avoid. One member of md raid dies - no worries, we have another one which (in theory) should substitute the dead one in-flight.
>
> >You're still burning the SSD endurance twice as quickly.
> Indeed,you are right! But in my opinion it is better to hold additional costs of a couple of  new nvme disks instead of making clients to suffer.
>
> > The system would have cost less and been more reliable were it all-NVMe with no HBA.
>
> in my case I need thick and cheap cold storage and all-flash cluster cannot beat the price :(
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Alexander Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx