Re: Doubled numbers of PGs from 8192 to 16384 - backfill bottlenecked

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Wed, 30 Apr 2025 09:45:14 -0400

> 
>> I suggest playing with https://docs.ceph.com/en/squid/rados/operations/pgcalc/
>> … setting the target PGs per OSD to 250
> 
> There was a thread[1] last year about many PGs pr OSD without any firm conclusions, so we are going to bump our number of PGs for the largest HDDs a lot higher than 250 while keeping an eye on the impact. Currently sitting at something like 550 PGs for a 20TB drive.

As OSDs become increasingly larger I think we need to take a look at the costs of more PGs (more peering, more memory) vs the costs of having extremely large PGs (uniform distribution, any backfill/remap is a huge operation).

In the past there was the idea that SSDs can “handle” more PGs than HDDs from a parallelism perspective, but over time I’ve come to suspect that may not be strictly the case, as the cluster’s ops are still divided among HDDs regardless of pg_num, and the driver/firmware still does elevator scheduling or w/e.  Maybe that was an artifact of Filestore that is no longer relevant?

Re backfill parallelism I’ve experienced similar situations, titrating max_backfills seemed to help.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx