Re: Upgrade CEPH 14.x -> 16.X + switch from Filestore to Bluestore = strange behavior

Olivier Delcourt <olivier.delcourt@xxxxxxxxxxxx> · Sun, 14 Sep 2025 10:14:42 +0000

Le 13 sept. 2025 à 14:17, Anthony D'Atri <anthony.datri@xxxxxxxxx> a écrit :

(..)

I can't say that we upgraded a lot of clusters from N to P, but those
that we upgraded didn't show any of these symptoms you describe. But
we always did the Filestore to Bluestore conversion before the actual
upgrade. In SUSE Enterprise Storage (which we also supported at that
time) this was pointed out as a requirement. I just checked the ceph
docs, I can't find such a statement (yet).

I *think* Filestore OSDs still work, but I've been peppering the docs with admonitions to convert for several releases now.  I would expect them to have worked with Pacific.  Did you update from the last Nautilus to the last Pacific?

Yes. And yes it was supposed to work. And in fact it works on our test cluster but this one has not the same history as the prod one and it’s there only to test some crush rules,… before applying to prod cluster.

In terms of hardware, we have three monitors (cephmon) and 30
storage servers (cephstore) spread across three datacenters. These
servers are connected to the network via an aggregate (LACP) of two
10 Gbps fibre connections, through which two VLANs pass, one for the
CEPH frontend network and one for the CEPH backend network. In doing
so, we have always given ourselves the option of separating the
frontend and backend into dedicated aggregates if the bandwidth
becomes insufficient.

Nice planning.  For scratch clusters today I usually suggest a single 25-100 GE bonded public network.  The dynamics when you started were different.

Yes. But for a network availability of ports point of view, I thing the next step will be probably splitting the front and the back network.

   100 hdd 10.90999 TB
    48 hdd 11.00000 TB
    48 hdd 14.54999 TB
    24 hdd 15.00000 TB
     9 hdd 5.45999 TB
   108 hdd 9.09999 TB

Have you tried disabling their volatile write cache?

Yes they are.

    84 ssd 0.89400 TB
   198 ssd 0.89424 TB
    18 ssd 0.93599 TB
    32 ssd 1.45999 TB
    16 ssd 1.50000 TB
    48 ssd 1.75000 TB
    24 ssd 1.79999 TB

Just for others reading, very small SSDs can end up using a surprising fraction of their capacity for DB/WAL/other overhead, resulting in less usable capacity than one expects.

DB is always offloaded to a RAID1 NVMe storage.

(..)

* First problem:
We were forced to switch from FileStore to BlueStore in an emergency
and unscheduled manner because after upgrading the CEPH packages on
the first storage server, the FileStore OSDs would no longer start.

Did you capture logs from your init system and representative OSDs?  They would help understand what happened.

Unfortunately not. We were not expecting the problems when we started the upgrade… :-(

(..)

Here we see that our SSD-type OSDs fill up at a rate of ~ 2% every 3
hours (the phenomenon is also observed on HDD-type OSDs, but as we
have a large capacity, it is less critical).
Manual (re)weight changes only provided a temporary solution and,
despite all our attempts (OSD restart, etc.), we reached the
critical full_ratio threshold, which is 0.97 for us.

Does your CRUSH map set optimal tunables? Or an older profile? 

# ceph osd crush show-tunables
{
    "choose_local_tries": 0,
    "choose_local_fallback_tries": 0,
    "choose_total_tries": 50,
    "chooseleaf_descend_once": 1,
    "chooseleaf_vary_r": 1,
    "chooseleaf_stable": 1,
    "straw_calc_version": 1,
    "allowed_bucket_algs": 54,
    "profile": "jewel",
    "optimal_tunables": 1,
    "legacy_tunables": 0,
    "minimum_required_version": "jewel",
    "require_feature_tunables": 1,
    "require_feature_tunables2": 1,
    "has_v2_rules": 1,
    "require_feature_tunables3": 1,
    "has_v3_rules": 0,
    "has_v4_buckets": 1,
    "require_feature_tunables5": 1,
    "has_v5_rules": 0
}

Seems the same as yours
{
    "choose_local_tries": 0,
    "choose_local_fallback_tries": 0,
    "choose_total_tries": 50,
    "chooseleaf_descend_once": 1,
    "chooseleaf_vary_r": 1,
    "chooseleaf_stable": 1,
    "straw_calc_version": 1,
    "allowed_bucket_algs": 54,
    "profile": "jewel",
    "optimal_tunables": 1,
    "legacy_tunables": 0,
    "minimum_required_version": "jewel",
    "require_feature_tunables": 1,
    "require_feature_tunables2": 1,
    "has_v2_rules": 0,
    "require_feature_tunables3": 1,
    "has_v3_rules": 0,
    "has_v4_buckets": 1,
    "require_feature_tunables5": 1,
    "has_v5_rules": 0
}

Older tunables can result in unequal data distribution.  Similarly, are all of your CRUSH buckets straw2?

# ceph osd crush dump | fgrep alg\" | sort | uniq -c
     42             "alg": "straw2",

Yup

    306             "alg": "straw2",

If not

ceph osd crush set-all-straw-buckets-to-straw2 

That should help with uniformity, though note that it will cause data to move, and if you're using legacy OSD reweighs those values would need to be readjusted.

What does

	ceph balancer status

show?  If you have legacy reweighs set to < 1.00 and pg-upmap balancing at the same time, you'll end up with outliers.  When using pg-upmap balancing, one really has to reset all the legacy reweights.  If the cluster is fairly full that may need to be done incrementally to minimize making outliers worse.

{
    "active": true,
    "last_optimize_duration": "0:00:01.032868",
    "last_optimize_started": "Sun Sep 14 11:58:51 2025",
    "mode": "upmap",
    "no_optimization_needed": true,
    "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect",
    "plans": []
}

I'll leave you to imagine the effect on the virtual machines and the
services provided to our users.
We also had very strong growth in the size of the MONitor databases
(~3 GB -> 100 GB) (compaction did not really help).

Compaction can't happen until backfill/recovery is complete.  At one point there was a bug when it also required that the numbers of total, up, and in OSDs were equal, i.e. all OSDs were up and in.

Good to know

After this second total recovery of the CEPH cluster and the restart
of the virtualisation environment, we still have the third DC (10
cephstore) to update from CEPH 14 to 16, and our ‘SSD’ OSDs are
filling up again until the automatic activation of
scrubs/deep-scrubs at 7 p.m.
Since then, progress has stopped, the use of the various OSDs is
stable and more or less evenly distributed (via active upmap
balancer).

Check your legacy reweights:

# ceph osd tree | head
ID   CLASS  WEIGHT      TYPE NAME                   STATUS  REWEIGHT  PRI-AFF
-37                  0  root staging
 -1         5577.10254  root default
-34          465.31158      host cephab92
217    hdd    18.53969          osd.217                 up   1.00000  1.00000

If you have any reweights that aren't 1.0000, that could be a factor.  When using the upmap balancer, they all really need to be 1.0000.

Make sense

*** Questions / Assumptions / Opinions ***

Have you ever encountered a similar phenomenon? We agree that having
different versions of OSDs coexisting is not a good solution

Filestore vs BlueStore is below RADOS, so it's not so bad.  BlueStore OSDs are much less prone to memory ballooning but there's no special risk in running both that I've ever seen.

OK

Our current hypothesis, following the restoration of stability and
the fact that we have never had this problem with OSDs in FileStore,
is that there is some kind of ‘housekeeping’ of BlueStore OSDs via
scrubs. Does that make sense? Any clues ? ideas ?

Did you see any messages about legacy / per-pool stats? At a certain point, I don't recall when, a nifty new feature was added that required that BlueStore OSDs get a one-time repair, which could be done at startup, but which could take a while especially on spinners.

No msg but the time for the 1st start can be really long…. Guess it’s what you describe, the one-time repair / check  ?

Thx for your answers.
Olivier

Attachment:
smime.p7s

Description: S/MIME cryptographic signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx