Re: Squid 19.2.2 - mon_target_pg_per_osd change not applied

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> Hi Anthony,
> 
> Thanks for this very useful and comprehensive summary.

You bet.  We’re a community!


> In your message you say "Remember too that this figure aka the pg ratio is per-pool, not per-pool-per-osd." Isn't it that the ratio is per OSD rather than per pool?

This can be subtle.

By “this figure” I meant 

pg_num - (#OSDs * ratio) / replication
Ratio = (pg_num * replication) / #OSDs

If you have a cluster with only one significant pool — notably if you only provide RBD service — the calculation can be straightforward.  When you have multiple pools it gets more complicated.  Your cluster’s total number of PGs almost certainly won’t be a power of 2 because it’s a sum but that for each pool will be.

In these calculations, for EC use a replication value of k+m

RGW index and CephFS metadata pools need more PGs than just their ostensible amount of data stored would indicate, this is the autoscaler’s “BIAS” value.

Today the autoscaler works well for many people, especially if you bump the target from the default 100 to at least 200; higher if you have especially large OSDs or aren’t starved for CPU and RAM.

I’ve been calculating these for years, sometimes literally in my sleep, so it’s second nature, but it’s a learning curve for sure.  The pg autoscaler is intended to improve usability by doing the calculations for you,  with some hints.  


When solving for the pg ratio, the number of PG replicas on each OSD, we don’t just divide the total number of PGs by the total number of OSDs.  We have to account for replication and possibly for different media.

This calculator is a bit dated but still useful:  https://docs.ceph.com/en/latest/rados/operations/pgcalc/

pg_num is set *per pool*, so each pool should have a power of 2 number go PGs. A  couple of examples:

pool 76 'default.rgw.buckets.data' erasure profile EC6-2 size 8 min_size 7 crush_rule 4 object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode off
pool 71 'testbench' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off 

# ceph osd df | head
ID   CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP      META     AVAIL    %USE   VAR   PGS  STATUS
217    hdd  18.53969   1.00000   19 TiB   12 TiB   12 TiB     8 KiB   80 GiB  6.1 TiB  66.88  1.00  125      up
219    hdd  18.53969   1.00000   19 TiB   12 TiB   12 TiB     3 KiB   86 GiB  6.1 TiB  66.87  1.00  124      up
221    hdd  18.53969   1.00000   19 TiB   12 TiB   12 TiB     4 KiB   80 GiB  6.1 TiB  66.87  1.00  124      up


Here the PGS column shows 124-125 PG replicas per OSD.  These OSDs support multiple pools, some replicated, some EC. Including the two above.

# ceph pg stat
6979 pgs

The .mgr pool is always there with one PG, so the cluster total number of PGs normally is odd not even.  This stat shows that the cluster total PG count isn’t going to be a power of two, but each pool’s value should be.

Please feel free to send `ceph osd pool ls detail` for specific recommendations.  


> 
> Best regards,
> 
> Michel
> Sent from my mobile
> 
> Le 25 juillet 2025 02:21:20 "Anthony D'Atri" <aad@xxxxxxxxxxxxxx> a écrit :
> 
>> These options can be confusing.  I’ll wax didactic here for future Cephers searching the archives.
>> 
>>> I wanted to increase the number of PG per OSD
>> 
>> Many clusters should ;)
>> 
>>> ....and did so by using
>>> 
>>> ceph config set global mon_target_pg_per_osd 800
>> opn
>> I kinda wish this had a different name, like pg_autoscaler_target_pg_per_osd.
>> 
>> This value affects the autoscaler’s target, so it only comes into play for pools where the autoscaler is not disabled.  It acts IIRC as an upper bound, and the per-pool pg_num values are calculated by the autoscaler accordingly.  Each pool’s pg_num should be a power of 2, and the autoscaler applies a bit of additional conservatism to, I suspect, avoid flapping between two values.  Like my father’s 1976 Ford Elite would gear-hunt climbing hills.
>> 
>> That said, 800 is higher than I would suggest for mostly any OSD (mostly).  I would suggest a value there of, say, 300-400 and you’ll end up with like 200-300 ish.  If you aren’t using the autoscaler this doesn’t matter, but I would suggest raising it accordingly Just In Case one day you do, or you create new pools that do — you’d want the behavior to approximate that of your manual efforts.
>> 
>> I usually recommend either entirely disabling the autoscaler or going all-in for all pools.
>> 
>>> Although the OSD config has the new value , I am still unable to create pools 
>>> that end up creating more than 250PG per OSD 
>> 
>> That guardrail is mon_max_pg_per_osd.  Read the message more closely, I suspect you saw mon_max_pg_per_osd in there and thought mon_target_pg_per_osd.  If so, you’re not the first to make that mistake, especially before I revamped the wording of that message a couple years back.
>> 
>> Remember too that this figure aka the pg ratio is per-pool, not per-pool-per-osd.  So when you have multiple pools that use the same OSDs, they all contribute to the sum.  This is the PGS value at right of “ceph osd df” output.
>> 
>> For both of these options, check that you don’t have an existing value at a different scope, which can lead to unexpected results.
>> 
>> ceph config dump | grep pg_per_osd
>> 
>> In many cases it’s safe and convenient to set options at global scope instead of mon, osd, etc (the “who” field) as some need to be set at scopes that are not intuitive.  If you need to set an option to different values for different hosts, daemon types, specific OSDs, etc. then the more-granular scopes are useful.  And rgw options are “client”.
>> 
>> So think of mon_max_pg_per_osd as a guardrail or failsafe that prevents you from accidentally typing too many zeros, which can have various plusungood results.  I personally set it to 1000, you might choose a more conservative value of say 600.  Subtly, this can come into play when you lose a host and the cluster recovers PGs onto other OSDs, which can result in them crossing the threshold and failing to activate.  If your cluster has OSDs of significantly varying weight (size), this effect is especially possible, and you scratch your head wondering why on earth the PGs won’t activate.
>> 
>> — aad
>> 
>> 
>>> 
>>> I have restarted the OSDs ...and the monitors ...
>>> 
>>> Any ideas or suggestions for properly applying the change would be appreciated 
>>> 
>>> Steven
>>> 
>> 
>> 
>> 
>> ----------
>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> 
>> 
>> 
>> 
>> ----------
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> 

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux