Re: How important is the "default" data pool being replicated for CephFS

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Thu, 11 Sep 2025 10:14:08 -0400

I held off replying here hoping that someone more authoritative would step in, but I have a few thoughts that might help or stimulate conversation.

> 
> The recommendations for cephfs is to make a replicated default data pool,
> and adding any EC data pools using layouts:
> https://docs.ceph.com/en/latest/cephfs/createfs/

That's my understanding too.  AIUI one reason is that head RADOS objects or some analogue thereof always live there, so there's a significant performance benefit.

> I have an cephfs that unfortunately wasn't set up like this: they just made
> an EC pool on the slow HDDs as the default, which sounds like the worst
> case scenario to me.

It could be worse - those slow HDDs could be attached via USB 1 ;)  I've seen it done.

> I would like to add an NVMe data pool to this ceph fs,
> but recommended gives me pause on if i should instead go through the hassle
> of creating a new cephfs and migrating all users.

That wouldn't be a horrible idea.  My understanding, which may be incomplete, is that one can't factor out and replace the default/root data pool. 

Something you could do easily would be to edit the CRUSH rule the pool is using to specify the nvme/ssd device class, and the pool will migrate.  upmap-remapped.py could be used to moderate the thundering herd.  EC still wouldn't be ideal, but this would limit client disruption.

> I've tried to run some mdtest with small 1k files to see if i could measure
> this difference, but speed is about the same in my relatively small tests
> so far. I'm also not sure what impact I should realistically expect here. I
> don't even know if creating files counts as "updating backtraces", so my
> testing might just be pointless.

Are you running with a large number of files for an extended period of time? From multiple clients?  Gotta eliminate any cache effects.

> 
> I guess my core question is; just how important is this suggestion to keep
> the default data pool on replicated NVME?
> 
> Setup:
> 14 hosts x 42 HDD + 3 NVMEs for db/wal  2*2x25 GbitE bonds
> 12 hosts x 10 NVME. 2*2x100 GbitE bonds
> 
> Old CephFS setup:
> - metadata: replicated NVME
> - data-pools: EC 10+2 on HDD  (i plan to add a EC NVME pool here via
> layouts)
> 
> New CephFS setup as recommended:
> - metadata: replicated NVME
> - data-pools: replicated NVME (default), EC 8+2 on HDD via layout, EC 8+2
> on NVME via layout.

Glad to see that you aren't making k+m = the number of hosts.  

> 
> Ceph 18.2.7
> 
> 
> Best regards, Mikael
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx