Re: Increase amount of OSDs on nodes

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Thu, 24 Apr 2025 11:08:30 -0400

> 

> I’m currently facing the following challenge were I’d like to hear what the community thinks on how to solve it.
> 
> We have a ceph cluster (only using RGW) with which we had some performance issues when initially setting it up and instead of using all 60 HDDs installed in each server

How many of those ultra dense nodes are in the cluster?  There are multiple concerns with nodes like that, including HBA and backplane saturation as well as network throughput.

> we only use half (30) of the HDDs - they exist but are set to “out”. Using only half of the OSDs per server solved our performance issues

Sounds like a network or HBA/backplane saturation issue.  Do you have network stats?  How fast is your networking?  Do you have a separate replication network?

> - not the part of my question just a bit of backstory for this weird setup.
> We have now optimised things and are looking into getting those OSDs back “in”. However we have quite some client load on the system and we want to keep the any necessary downtime (for users) low.

Upmap-remapped is a fine solution.  Run it to freeze the current mappings, mark the OSDs in/up, and let the balancer move data slowly.

> The HDDs are only used for the buckets.data pool (using erasure coding) as the other pools necessary for RGW are on separate NVMe drives.
> 
> To achieve this I was looking into using pgremapper and upmap-remapped. My expectation was to be able to remap all PGs to stay on the “old” OSDs (the ones that are currently “in”) and have zero PGs on the “new” OSDs (the ones that are currently “out”). If that part is done I thought somehow I can double the amount of PGs to cover the now doubled capacity for the pool.

That may or may not be necessary.  How many nodes? How many OSDs?  

Send `ceph osd dump | grep pool` and `ceph status` and `ceph osd df | head -20`

And `ceph osd pool autoscale-status`

> 
> What I tried so far (with some variations):
> - Set norecover and nobackfill flags
> - Set “new” OSDs to “in”
> - Use pgremapper and/or upmap-remapped to create upmap entries so the acting OSDs of a PGs are unchanged
> - Increase the amount of PGs
> - Map new PGs to “new” OSDs
> - (Maybe remove the upmap entries for the PGs one at a time over an extend period to allow for automatic balancing again)

Upmap-remapped can be a great solution, but do one thing at a time.  Freeze the mappings where they are, mark the OSDs in/up, and let the balancer move the data slowly.  Then when the cluster settles consider pg_num for the pools.

If you are using the autoscaler, consider setting mon_target_pg_per_osd to 250.
If not, we can look at where your pg_num values are now and solve for perhaps 150-200 PG replicas per HDD OSD and double that for the NVMe OSDs.
Remember that these days we set `pg_num` for a pool and Ceph gradually scales up or down over time.

> 
> I’m trying this on a smaller test setup with 6 servers and 8 OSDs per server (4 in / 4 out) so 48 OSDs in total. For testing I filled the “in” OSDs to around 50% so the overall used capacity is also 50% - when bringing the other 24 OSDs "in" this is of course means only 25% of the overall capacity is used.
> 
> The following information is based on this test setup.
> 
> The pgremapper/upmap-remapped part is where I’m currently stuck as I always end up with around 2% of objects being misplaced.

I often find that upmap-remapped needs a second, sometimes a third pass to clear up everything.  Note that right in the code it says:

# 4. Run this script a few times. (Remember to | sh)

> What I was also able to manage is that if I after setting all OSDs to “in”, remap the PGs, increase the PG count and remap again I end up around 3% of misplaced objects however the additional/new OSDs then will have zero PGs again.

Do you have the balancer enabled?  You want to NOT have norecover/norebalance set after freezing the PGs.

> Is it somehow possible to add the OSDs and double the PG count without causing any (or only minimal) backfill?

Adding PGs necessarily means splitting PGs and thus backfill.  With recent Ceph releases this is usually manageable.  But I would bring all the OSDs up/in first and then worry about pg_num.

> If possible I also want to bring in the “new” OSDs in batches so we can check the performance after each batch to avoid having issues in the final state with all OSDs set to “in”.
> 
> 
> Thanks in advance for any input.
> 
> 
> Cheers,
> Florian
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx