Re: Upgrade CEPH 14.x -> 16.X + switch from Filestore to Bluestore = strange behavior

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Sat, 13 Sep 2025 08:17:51 -0400

> From Hammer to the Pacific version is already a long journey.

I've seen some odd mon behavior from clusters initially installed on Hammer, but that surfaced in Jewel and is I doubt in play here.

>> 
>> 
>> I can't say that we upgraded a lot of clusters from N to P, but those
>> that we upgraded didn't show any of these symptoms you describe. But
>> we always did the Filestore to Bluestore conversion before the actual
>> upgrade. In SUSE Enterprise Storage (which we also supported at that
>> time) this was pointed out as a requirement. I just checked the ceph
>> docs, I can't find such a statement (yet).

I *think* Filestore OSDs still work, but I've been peppering the docs with admonitions to convert for several releases now.  I would expect them to have worked with Pacific.  Did you update from the last Nautilus to the last Pacific?

>>> All our pools are of the following type: replicated size 3 min_size
>>> 1 crush_rule 0 (or 1).
>> 
>> I would recommend to increase min_size to 2, otherwise you let Ceph
>> lose two of three PGs before pausing IO, this can make recovery
>> difficult. Reducing min_size to 1 should only be a temporary solution
>> to preserve stalling client IO during recovery.

Absolutely.  

>>> . First, I need to explain the context of our CEPH

Ceph, please.  Not CEPH.  ;)

>>> In terms of hardware, we have three monitors (cephmon) and 30
>>> storage servers (cephstore) spread across three datacenters. These
>>> servers are connected to the network via an aggregate (LACP) of two
>>> 10 Gbps fibre connections, through which two VLANs pass, one for the
>>> CEPH frontend network and one for the CEPH backend network. In doing
>>> so, we have always given ourselves the option of separating the
>>> frontend and backend into dedicated aggregates if the bandwidth
>>> becomes insufficient.

Nice planning.  For scratch clusters today I usually suggest a single 25-100 GE bonded public network.  The dynamics when you started were different.

>>> 
>>> Each of the storage servers comes with HDDs whose size varies
>>> depending on the server generation

Good job keeping each DC's aggregate CRUSH weight nearly identical.  

>>>  as well as SSDs whose size is more consistent but still varies (depending on price).

Enterprise SAS/SATA SSDs are starting to disappear from the market.  The next time you buy servers, consider NVMe-only chassis.  With careful procurement and by not having to pay for an HBA, they can be more affordable than you might think, and conserve precious RUs.  And larger SSDs get you more capacity per chassis, so you can save on chassis, switch ports, etc.

>>> 
>>>    100 hdd 10.90999 TB
>>>     48 hdd 11.00000 TB
>>>     48 hdd 14.54999 TB
>>>     24 hdd 15.00000 TB
>>>      9 hdd 5.45999 TB
>>>    108 hdd 9.09999 TB

Have you tried disabling their volatile write cache?

>>> 
>>>     84 ssd 0.89400 TB
>>>    198 ssd 0.89424 TB
>>>     18 ssd 0.93599 TB
>>>     32 ssd 1.45999 TB
>>>     16 ssd 1.50000 TB
>>>     48 ssd 1.75000 TB
>>>     24 ssd 1.79999 TB

Just for others reading, very small SSDs can end up using a surprising fraction of their capacity for DB/WAL/other overhead, resulting in less usable capacity than one expects.

>>> 
>>> Regarding the CRUSHmap, and since at the time the CEPH cluster was
>>> launched, classes (ssd/hdd) did not exist and we wanted to be able
>>> to create pools on disk storage or flash storage, we created two
>>> trees:

Indeed, that was a common strategy.  Device classes are way more convenient and work better with the upmap balancer.   At some point you might consider converting with the crush tool reclassify feature, see: https://docs.ceph.com/en/pacific/rados/operations/crush-map-edits/

>>> 
>>> Everything was working fine until last August, when we scheduled an
>>> update from CEPH 14.x (Nautilus) to 16.X (Pacific) (and an update
>>> from Debian 10 to Debian 11, which was not a problem).

I'm not experienced with Debian as such, but Filestore OSDs being XFS filesystems are susceptible to XFS flern, which means both the kernel and filesystem utilities.  I've seen [non-Ceph] issues with a large kernel version jump.  In those situations a one-time xfs_repair addressed the problem.  I don't recall the details, but ISTR that at a certain point XFS started paying closer attention to a certain filesystem structure than it used to, so older filesystems that previously ran just fine suddenly didn't.  The one-time repair aligned them with the newer expectations and the issues did not recur.  I don't know for sure that this is what you experienced, of course.

>>> 
>>> * First problem:
>>> We were forced to switch from FileStore to BlueStore in an emergency
>>> and unscheduled manner because after upgrading the CEPH packages on
>>> the first storage server, the FileStore OSDs would no longer start.

Did you capture logs from your init system and representative OSDs?  They would help understand what happened.

>>> We did not have this problem on our small test cluster, which
>>> obviously did not have the ‘same upgrade life’ as the production
>>> cluster. We therefore took the opportunity, DC by DC (since this is
>>> our ‘failure domain’), not only to update CEPH but also to recreate
>>> the OSDs in BlueStore.

It's good to have switched, though of course doing so in a planned fashion is always less stressful.

>>> Here we see that our SSD-type OSDs fill up at a rate of ~ 2% every 3
>>> hours (the phenomenon is also observed on HDD-type OSDs, but as we
>>> have a large capacity, it is less critical).
>>> Manual (re)weight changes only provided a temporary solution and,
>>> despite all our attempts (OSD restart, etc.), we reached the
>>> critical full_ratio threshold, which is 0.97 for us.

Does your CRUSH map set optimal tunables? Or an older profile? 

# ceph osd crush show-tunables
{
    "choose_local_tries": 0,
    "choose_local_fallback_tries": 0,
    "choose_total_tries": 50,
    "chooseleaf_descend_once": 1,
    "chooseleaf_vary_r": 1,
    "chooseleaf_stable": 1,
    "straw_calc_version": 1,
    "allowed_bucket_algs": 54,
    "profile": "jewel",
    "optimal_tunables": 1,
    "legacy_tunables": 0,
    "minimum_required_version": "jewel",
    "require_feature_tunables": 1,
    "require_feature_tunables2": 1,
    "has_v2_rules": 1,
    "require_feature_tunables3": 1,
    "has_v3_rules": 0,
    "has_v4_buckets": 1,
    "require_feature_tunables5": 1,
    "has_v5_rules": 0
}

Older tunables can result in unequal data distribution.  Similarly, are all of your CRUSH buckets straw2?

# ceph osd crush dump | fgrep alg\" | sort | uniq -c
     42             "alg": "straw2",

If not

ceph osd crush set-all-straw-buckets-to-straw2 

That should help with uniformity, though note that it will cause data to move, and if you're using legacy OSD reweighs those values would need to be readjusted.

What does

	ceph balancer status

show?  If you have legacy reweighs set to < 1.00 and pg-upmap balancing at the same time, you'll end up with outliers.  When using pg-upmap balancing, one really has to reset all the legacy reweights.  If the cluster is fairly full that may need to be done incrementally to minimize making outliers worse.

Similarly, when using the upmap balancer, the CERN upmap-remapped tool can help avoid surprise full OSDs:

https://community.ibm.com/community/user/blogs/anthony-datri/2025/07/30/gracefully-expanding-your-ibm-storage-ceph;
Gracefully Expanding Your IBM Storage Ceph Cluster
community.ibm.com

>>> I'll leave you to imagine the effect on the virtual machines and the
>>> services provided to our users.
>>> We also had very strong growth in the size of the MONitor databases
>>> (~3 GB -> 100 GB) (compaction did not really help).

Compaction can't happen until backfill/recovery is complete.  At one point there was a bug when it also required that the numbers of total, up, and in OSDs were equal, i.e. all OSDs were up and in.

>>> 
>>> 
>>> After this second total recovery of the CEPH cluster and the restart
>>> of the virtualisation environment, we still have the third DC (10
>>> cephstore) to update from CEPH 14 to 16, and our ‘SSD’ OSDs are
>>> filling up again until the automatic activation of
>>> scrubs/deep-scrubs at 7 p.m.
>>> Since then, progress has stopped, the use of the various OSDs is
>>> stable and more or less evenly distributed (via active upmap
>>> balancer).

Check your legacy reweights:

# ceph osd tree | head
ID   CLASS  WEIGHT      TYPE NAME                   STATUS  REWEIGHT  PRI-AFF
-37                  0  root staging
 -1         5577.10254  root default
-34          465.31158      host cephab92
217    hdd    18.53969          osd.217                 up   1.00000  1.00000

If you have any reweights that aren't 1.0000, that could be a factor.  When using the upmap balancer, they all really need to be 1.0000.

>>> 
>>> 
>>> *** Questions / Assumptions / Opinions ***
>>> 
>>> Have you ever encountered a similar phenomenon? We agree that having
>>> different versions of OSDs coexisting is not a good solution

Filestore vs BlueStore is below RADOS, so it's not so bad.  BlueStore OSDs are much less prone to memory ballooning but there's no special risk in running both that I've ever seen.

>>> Our current hypothesis, following the restoration of stability and
>>> the fact that we have never had this problem with OSDs in FileStore,
>>> is that there is some kind of ‘housekeeping’ of BlueStore OSDs via
>>> scrubs. Does that make sense? Any clues ? ideas ?

Did you see any messages about legacy / per-pool stats? At a certain point, I don't recall when, a nifty new feature was added that required that BlueStore OSDs get a one-time repair, which could be done at startup, but which could take a while especially on spinners.

>>> 
>>> I also read on the Internet (somewhere...) that in any case, when
>>> the cluster is not ‘healthy’, scrubs are suspended by default.
>>> Indeed, in our case:
>>> 
>>> root@cephstore16:~# ceph daemon osd.11636 config show | grep
>>> ‘osd_scrub_during_recovery’
>>>    ‘osd_scrub_during_recovery’: ‘false’,
>>> 
>>> This could explain why, during the three days of recovery, no
>>> cleaning is performed and if bluestore does not perform maintenance,
>>> it fills up?

I don't *think* scrubs are related to such cleanup, though when addressing large omaps, a scrub can be required for them to stop being *reported*.

>>> 
>>> (It would be possible to temporarily change this behaviour via: ceph
>>> tell “osd.*” injectargs --osd-scrub-during-recovery=1 (to be tested).)

Central config mostly means we don't have to inject any more.  Much more convenient.

>>> 
>>> Do you have any suggestions for things to check? Although we have
>>> experience with FileStore, we have not yet had time to gain
>>> experience with BlueStore.
>> 
>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx