[Disclaimer: AI-assisted answer] There is an interesting quirk in Linux RAID1 implementation in connection to data races, documented in the manual page, https://man7.org/linux/man-pages/man4/md.4.html, under the heading "Scrubbing and mismatches". When the md driver initiates the writeback of a page, it asks both RAID members independently and simultaneously to write that page to their persistent storage. If the page has changed during the writeback (e.g., if disk 0 has serviced the write operation before the data has changed, and disk 1 did so after the change), there is a chance that the two drives might have received different data, thus creating an internal inconsistency. This is common when using RAID1 drives for swap, but in theory, any use of mmap(2) can also trigger this, especially in cases of writeback triggered by memory pressure as opposed to an explicit msync(). This is relevant, as RocksDB uses mmap(). It usually means (at least for the md driver) that the software does not care which of the copies is correct, as it changed the data in flight. Yet, it might cause non-repeatable reads afterward, as some read operation can be serviced from disk 0 and some from disk 1. If the RAID array does not have a write-intent bitmap (the upstream default is to have an interactive recommendation to create it, with "N" being the default answer), such inconsistencies can survive indefinitely after a power failure. I don't know whether RocksDB has an implicit assumption that rereads of data that was being written during a power failure yield the same results. If you want to avoid this effect, please use RAID5, not RAID1. You still can create RAID5 with two disks. The difference is that, unlike RAID1, the kernel will make a copy of the to-be-written memory area, and then write that copy (guaranteed to be stable during the write operation) to the disks. On Thu, Sep 11, 2025 at 5:19 PM Alex from North <service.plant@xxxxx> wrote: > > >Of course, but remember that Ceph stores data across multiple OSDs and hosts for just that > >reason. > > Yes, but when nvme with 5 (just fot instance) OSD's db+wal on it goes down - all 5 OSDs go down as well. Keeping in mind, that each of those OSDs are 16TB HDDs, in turn it brings alive massive recovery byteflow. This is what I'd like to avoid. One member of md raid dies - no worries, we have another one which (in theory) should substitute the dead one in-flight. > > >You're still burning the SSD endurance twice as quickly. > Indeed,you are right! But in my opinion it is better to hold additional costs of a couple of new nvme disks instead of making clients to suffer. > > > The system would have cost less and been more reliable were it all-NVMe with no HBA. > > in my case I need thick and cheap cold storage and all-flash cluster cannot beat the price :( > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx -- Alexander Patrakov _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx