Le 13 sept. 2025 à 14:17, Anthony D'Atri <anthony.datri@xxxxxxxxx> a écrit :
(..)
I can't say that we upgraded a lot of clusters from N to P, but those that we upgraded didn't show any of these symptoms you describe. But we always did the Filestore to Bluestore conversion before the actual upgrade. In SUSE Enterprise Storage (which we also supported at that time) this was pointed out as a requirement. I just checked the ceph docs, I can't find such a statement (yet).
I *think* Filestore OSDs still work, but I've been peppering the docs with admonitions to convert for several releases now. I would expect them to have worked with Pacific. Did you update from the last Nautilus to the last Pacific?
Yes. And yes it was supposed to work. And in fact it works on our test cluster but this one has not the same history as the prod one and it’s there only to test some crush rules,… before applying to prod cluster.
In terms of hardware, we have three monitors (cephmon) and 30 storage servers (cephstore) spread across three datacenters. These servers are connected to the network via an aggregate (LACP) of two 10 Gbps fibre connections, through which two VLANs pass, one for the CEPH frontend network and one for the CEPH backend network. In doing so, we have always given ourselves the option of separating the frontend and backend into dedicated aggregates if the bandwidth becomes insufficient.
Nice planning. For scratch clusters today I usually suggest a single 25-100 GE bonded public network. The dynamics when you started were different.
Yes. But for a network availability of ports point of view, I thing the next step will be probably splitting the front and the back network.
100 hdd 10.90999 TB 48 hdd 11.00000 TB 48 hdd 14.54999 TB 24 hdd 15.00000 TB 9 hdd 5.45999 TB 108 hdd 9.09999 TB
Have you tried disabling their volatile write cache?
Yes they are. 84 ssd 0.89400 TB 198 ssd 0.89424 TB 18 ssd 0.93599 TB 32 ssd 1.45999 TB 16 ssd 1.50000 TB 48 ssd 1.75000 TB 24 ssd 1.79999 TB
Just for others reading, very small SSDs can end up using a surprising fraction of their capacity for DB/WAL/other overhead, resulting in less usable capacity than one expects.
DB is always offloaded to a RAID1 NVMe storage. (..) * First problem: We were forced to switch from FileStore to BlueStore in an emergency and unscheduled manner because after upgrading the CEPH packages on the first storage server, the FileStore OSDs would no longer start.
Did you capture logs from your init system and representative OSDs? They would help understand what happened.
Unfortunately not. We were not expecting the problems when we started the upgrade… :-(
(..)
Here we see that our SSD-type OSDs fill up at a rate of ~ 2% every 3 hours (the phenomenon is also observed on HDD-type OSDs, but as we have a large capacity, it is less critical). Manual (re)weight changes only provided a temporary solution and, despite all our attempts (OSD restart, etc.), we reached the critical full_ratio threshold, which is 0.97 for us.
Does your CRUSH map set optimal tunables? Or an older profile?
# ceph osd crush show-tunables { "choose_local_tries": 0, "choose_local_fallback_tries": 0, "choose_total_tries": 50, "chooseleaf_descend_once": 1, "chooseleaf_vary_r": 1, "chooseleaf_stable": 1, "straw_calc_version": 1, "allowed_bucket_algs": 54, "profile": "jewel", "optimal_tunables": 1, "legacy_tunables": 0, "minimum_required_version": "jewel", "require_feature_tunables": 1, "require_feature_tunables2": 1, "has_v2_rules": 1, "require_feature_tunables3": 1, "has_v3_rules": 0, "has_v4_buckets": 1, "require_feature_tunables5": 1, "has_v5_rules": 0 }
Seems the same as yours { "choose_local_tries": 0, "choose_local_fallback_tries": 0, "choose_total_tries": 50, "chooseleaf_descend_once": 1, "chooseleaf_vary_r": 1, "chooseleaf_stable": 1, "straw_calc_version": 1, "allowed_bucket_algs": 54, "profile": "jewel", "optimal_tunables": 1, "legacy_tunables": 0, "minimum_required_version": "jewel", "require_feature_tunables": 1, "require_feature_tunables2": 1, "has_v2_rules": 0, "require_feature_tunables3": 1, "has_v3_rules": 0, "has_v4_buckets": 1, "require_feature_tunables5": 1, "has_v5_rules": 0 }
Older tunables can result in unequal data distribution. Similarly, are all of your CRUSH buckets straw2?
# ceph osd crush dump | fgrep alg\" | sort | uniq -c 42 "alg": "straw2",
Yup
306 "alg": "straw2",
If not
ceph osd crush set-all-straw-buckets-to-straw2
That should help with uniformity, though note that it will cause data to move, and if you're using legacy OSD reweighs those values would need to be readjusted.
What does
ceph balancer status
show? If you have legacy reweighs set to < 1.00 and pg-upmap balancing at the same time, you'll end up with outliers. When using pg-upmap balancing, one really has to reset all the legacy reweights. If the cluster is fairly full that may need to be done incrementally to minimize making outliers worse.
{ "active": true, "last_optimize_duration": "0:00:01.032868", "last_optimize_started": "Sun Sep 14 11:58:51 2025", "mode": "upmap", "no_optimization_needed": true, "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect", "plans": [] }
I'll leave you to imagine the effect on the virtual machines and the services provided to our users. We also had very strong growth in the size of the MONitor databases (~3 GB -> 100 GB) (compaction did not really help).
Compaction can't happen until backfill/recovery is complete. At one point there was a bug when it also required that the numbers of total, up, and in OSDs were equal, i.e. all OSDs were up and in.
Good to know
After this second total recovery of the CEPH cluster and the restart of the virtualisation environment, we still have the third DC (10 cephstore) to update from CEPH 14 to 16, and our ‘SSD’ OSDs are filling up again until the automatic activation of scrubs/deep-scrubs at 7 p.m. Since then, progress has stopped, the use of the various OSDs is stable and more or less evenly distributed (via active upmap balancer).
Check your legacy reweights:
# ceph osd tree | head ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -37 0 root staging -1 5577.10254 root default -34 465.31158 host cephab92 217 hdd 18.53969 osd.217 up 1.00000 1.00000
If you have any reweights that aren't 1.0000, that could be a factor. When using the upmap balancer, they all really need to be 1.0000.
Make sense
*** Questions / Assumptions / Opinions ***
Have you ever encountered a similar phenomenon? We agree that having different versions of OSDs coexisting is not a good solution
Filestore vs BlueStore is below RADOS, so it's not so bad. BlueStore OSDs are much less prone to memory ballooning but there's no special risk in running both that I've ever seen.
OK
Our current hypothesis, following the restoration of stability and the fact that we have never had this problem with OSDs in FileStore, is that there is some kind of ‘housekeeping’ of BlueStore OSDs via scrubs. Does that make sense? Any clues ? ideas ?
Did you see any messages about legacy / per-pool stats? At a certain point, I don't recall when, a nifty new feature was added that required that BlueStore OSDs get a one-time repair, which could be done at startup, but which could take a while especially on spinners.
No msg but the time for the 1st start can be really long…. Guess it’s what you describe, the one-time repair / check ?
Thx for your answers.
Olivier |