Re: crush rule: is it valid to use a non root element for the root parameter?

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Wed, 16 Apr 2025 10:48:13 -0400

> 
> No we use the same mons (that are also backed up by UPS/Diesel). The idea doing this was to allow sharing the OSD between the pool using this "critical area" (thus OSD located only in this row) and the other normal pools, to avoid dedicated potentially a large storage volume to this critical area that doesn't require much.

Groovy.  Since you were clearly targeting higher availability for this subset of data, I wanted to be sure that your efforts weren’t confounded by the potential for the mons to not reach quorum, which would make the CRUSH hoops moot.

> Thus also the choice to reweight the OSDs in this row so that they are less used than other OSDs by normal pools to avoid exploding the number of PGs on these OSDs.
> 
> I am not sure that we can use a custom device class to achieve what we had in mind as this will not allow to share an OSD between critical and non critical pools.

The above two statements seem a bit at odds with each other.  In the first you’re discouraging sharing and may fill up as your critical dataset grows; in the second you want to share.

> But it may be a better way in fact, dedicated only a  fraction of the OSDs on each server in the "critical row" to these pools and using other OSDs on these servers for normal pools without any reweighting. Thanks for the idea.

You bet.  It seems like a cleaner approach.  You might consider a reclassify operation

https://docs.ceph.com/en/latest/rados/operations/crush-map-edits/#migrating-from-a-legacy-ssd-rule-to-device-classes
to update the CRUSHmap and rules at the same time.

> 
> We may also try to bump the number of retries to see if it has an effect.
> 
> Best regards,
> 
> Michel
> 
> Le 16/04/2025 à 13:16, Anthony D'Atri a écrit :
>> First time I recall anyone trying this.  Thoughts:
>> 
>> * Manually edit the crush map and bump retries from 50 to 100
>> * Better yet, give those OSDs a custom device class and change the CRUSH rule to use that and the default root.
>> 
>> Do you also constrain mons to those systems ?
>> 
>>> On Apr 16, 2025, at 6:41 AM, Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx> wrote:
>>> 
>>> Hi,
>>> 
>>> We have use case where we had like to restrict some pools to a subset of the OSDs located in a particular section of the crush map hierarchy (OSDs backed up by UPS/Diesel). We tried to define for these (replica 3) pools a specific crush rule with the root paramater defined to a specific row (which contains 3 OSD servers with #10 OSD each). At the beginning it worked but after some time (probably after doing a reweight on the OSDs in this row to reduce the number of PGs from other pools), a few PGs are active+clean+remapped and 1 is undersized.
>>> 
>>> 'ceph osd pg dump|grep remapped' gives an output similar to the following one for each remapped PG:
>>> 
>>> 20.1ae       648                   0         0        648 0   374416846            0           0    438      1404 438       active+clean+remapped  2025-04-16T07:19:40.507778+0000 43117'1018433    48443:1131738       [70,58]          70 [70,58,45]              70   43117'1018433 2025-04-16T07:19:40.507443+0000    43117'1018433 2025-04-16T07:19:40.507443+0000              0 15  periodic scrub scheduled @ 2025-04-17T19:18:23.470846+0000 648                0
>>> 
>>> We can see that we currently have 3 replica but that Ceph would like to move to 2... (the undersized PG has currently only 2 replica for an unknown reason, probably the same).
>>> 
>>> Is it wrong trying to do what we did, i.e. using a row for the crush rule root parameter? If not, where could we find more information about the cause?
>>> 
>>> Thanks in advance for any help. Best regards,
>>> 
>>> Michel
>>> 
>>> --------------------- Crush rule used -----------------
>>> 
>>> {
>>>     "rule_id": 2,
>>>     "rule_name": "ha-replicated_ruleset",
>>>     "type": 1,
>>>     "steps": [
>>>         {
>>>             "op": "take",
>>>             "item": -22,
>>>             "item_name": "row-01~hdd"
>>>         },
>>>         {
>>>             "op": "chooseleaf_firstn",
>>>             "num": 0,
>>>             "type": "host"
>>>         },
>>>         {
>>>             "op": "emit"
>>>         }
>>>     ]
>>> }
>>> 
>>> 
>>> ------------------- Beginning of the CRUSH tree -------------------
>>> 
>>> ID   CLASS  WEIGHT     TYPE NAME                         STATUS REWEIGHT  PRI-AFF
>>>  -1         843.57141  root default
>>> -19         843.57141      datacenter bat.206
>>> -21         283.81818          row row-01
>>> -15          87.32867              host cephdevel-76079
>>>   1    hdd    7.27739                  osd.1                 up 0.50000  1.00000
>>>   2    hdd    7.27739                  osd.2                 up 0.50000  1.00000
>>>  14    hdd    7.27739                  osd.14                up 0.50000  1.00000
>>>  39    hdd    7.27739                  osd.39                up 0.50000  1.00000
>>>  40    hdd    7.27739                  osd.40                up 0.50000  1.00000
>>>  41    hdd    7.27739                  osd.41                up 0.50000  1.00000
>>>  42    hdd    7.27739                  osd.42                up 0.50000  1.00000
>>>  43    hdd    7.27739                  osd.43                up 0.50000  1.00000
>>>  44    hdd    7.27739                  osd.44                up 0.50000  1.00000
>>>  45    hdd    7.27739                  osd.45                up 0.50000  1.00000
>>>  46    hdd    7.27739                  osd.46                up 0.50000  1.00000
>>>  47    hdd    7.27739                  osd.47                up 0.50000  1.00000
>>>  -3          94.60606              host cephdevel-76154
>>>  49    hdd    7.27739                  osd.49                up 0.50000  1.00000
>>>  50    hdd    7.27739                  osd.50                up 0.50000  1.00000
>>>  51    hdd    7.27739                  osd.51                up 0.50000  1.00000
>>>  66    hdd    7.27739                  osd.66                up 0.50000  1.00000
>>>  67    hdd    7.27739                  osd.67                up 0.50000  1.00000
>>>  68    hdd    7.27739                  osd.68                up 0.50000  1.00000
>>>  69    hdd    7.27739                  osd.69                up 0.50000  1.00000
>>>  70    hdd    7.27739                  osd.70                up 0.50000  1.00000
>>>  71    hdd    7.27739                  osd.71                up 0.50000  1.00000
>>>  72    hdd    7.27739                  osd.72                up 0.50000  1.00000
>>>  73    hdd    7.27739                  osd.73                up 0.50000  1.00000
>>>  74    hdd    7.27739                  osd.74                up 0.50000  1.00000
>>>  75    hdd    7.27739                  osd.75                up 0.50000  1.00000
>>>  -4         101.88345              host cephdevel-76204
>>>  48    hdd    7.27739                  osd.48                up 0.50000  1.00000
>>>  52    hdd    7.27739                  osd.52                up 0.50000  1.00000
>>>  53    hdd    7.27739                  osd.53                up 0.50000  1.00000
>>>  54    hdd    7.27739                  osd.54                up 0.50000  1.00000
>>>  56    hdd    7.27739                  osd.56                up 0.50000  1.00000
>>>  57    hdd    7.27739                  osd.57                up 0.50000  1.00000
>>>  58    hdd    7.27739                  osd.58                up 0.50000  1.00000
>>>  59    hdd    7.27739                  osd.59                up 0.50000  1.00000
>>>  60    hdd    7.27739                  osd.60                up 0.50000  1.00000
>>>  61    hdd    7.27739                  osd.61                up 0.50000  1.00000
>>>  62    hdd    7.27739                  osd.62                up 0.50000  1.00000
>>>  63    hdd    7.27739                  osd.63                up 0.50000  1.00000
>>>  64    hdd    7.27739                  osd.64                up 0.50000  1.00000
>>>  65    hdd    7.27739                  osd.65                up 0.50000  1.00000
>>> -23         203.16110          row row-02
>>> -13          87.32867              host cephdevel-76213
>>>  27    hdd    7.27739                  osd.27                up 1.00000  1.00000
>>>  28    hdd    7.27739                  osd.28                up 1.00000  1.00000
>>>  29    hdd    7.27739                  osd.29                up 1.00000  1.00000
>>>  30    hdd    7.27739                  osd.30                up 1.00000  1.00000
>>>  31    hdd    7.27739                  osd.31                up 1.00000  1.00000
>>>  32    hdd    7.27739                  osd.32                up 1.00000  1.00000
>>>  33    hdd    7.27739                  osd.33                up 1.00000  1.00000
>>>  34    hdd    7.27739                  osd.34                up 1.00000  1.00000
>>>  35    hdd    7.27739                  osd.35                up 1.00000  1.00000
>>>  36    hdd    7.27739                  osd.36                up 1.00000  1.00000
>>>  37    hdd    7.27739                  osd.37                up 1.00000  1.00000
>>>  38    hdd    7.27739                  osd.38                up 1.00000  1.00000
>>> ......
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx