Re: [18.2.4 Reef] Using RAID1 as Metadata Device in Ceph – Risks and Recommendations

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Wed, 10 Sep 2025 12:51:24 -0400

> 
>> IMHO this isn't the right layer for this.  An admin wishing to mirror the offload
>> device should do so (via MD or (sigh) an HBA) and present that device in the OSD spec. 
>> ymmv.
> 
> Hello Anthony! The idea behind is to keep osds (with data on HDD) running in case of meta device goes down.

Of course, but remember that Ceph stores data across multiple OSDs and hosts for just that reason.  This isn't IMHO the most effective place to spend money.  You're effectively doing RAID on RAID.

> For sure, my test lab with md Raid1 on 2 NS within the same NVME device is just to bring the idea and it has no sense at all in real world. In prod I mean to use, for instance, nvme0n1 and nvme1n1 united into raid1.
> And the problem is that ceph-volume tries to get blkid, which is obviosly none on /dev/md127. That is my intention to modify the code.
> 
>> Conventional wisdom has favored instead offloading fewer OSDs to each SSD to reduce write
>> amp and the blast radius.
> 
> Indeed, but I use fast SSD devices for wal and db.

You're still burning the SSD endurance twice as quickly.

> 
> BTW, don't you like HBA's? Cuz you did sigh when mentioned HBA?:) Why? Cuz of price?

I've seen a tri-mode RAID HBA in use that had a list price of USD 2000 when it was purchased.  All it was doing was mirroring the boot drives, then it failed.

The system would have cost less and been more reliable were it all-NVMe with no HBA.  RAID HBAs are an anachronism, they're fussy, almost nobody monitors them, and they are money better spent elsewhere.

> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx