On Sunday, Apr 13, 2025 at 10:08 AM, Jeremy Hansen <jeremy@xxxxxxxxxx> wrote:I’m now seeing this:cluster:id: 95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1health: HEALTH_WARNFailed to apply 1 service(s): osd.cost_capacityI’m assuming this is due to the fact that I’ve only upgraded mgr but I wanted to double check before proceeding with the rest of the components.Thanks-jeremyOn Sunday, Apr 13, 2025 at 12:59 AM, Jeremy Hansen <jeremy@xxxxxxxxxx> wrote:Updating mgr’s to 18.2.5 seemed to work just fine. I will go for the remaining services after the weekend. Thanks.-jeremyOn Thursday, Apr 10, 2025 at 6:37 AM, Eugen Block <eblock@xxxxxx> wrote:Glad I could help! I'm also waiting for 18.2.5 to upgrade our own
cluster from Pacific after getting rid of our cache tier. :-D
Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>:
This seems to have worked to get the orch back up and put me back to
16.2.15. Thank you. Debating on waiting for 18.2.5 to move forward.
-jeremy
On Monday, Apr 07, 2025 at 1:26 AM, Eugen Block <eblock@xxxxxx
(mailto:eblock@xxxxxx)> wrote:
Still no, just edit the unit.run file for the MGRs to use a different
image. See Frédéric's instructions (now that I'm re-reading it,
there's a little mistake with dots and hyphens):
# Backup the unit.run file
$ cp /var/lib/ceph/$(ceph fsid)/mgr.ceph01.eydqvm/unit.run{,.bak}
# Change container image's signature. You can get the signature of the
version you
want to reach from https://quay.io/repository/ceph/ceph?tab=tags. It's
in the URL of a
version.
$ sed -i
's/ceph@sha256:e40c19cd70e047d14d70f5ec3cf501da081395a670cd59ca881ff56119660c8f/ceph@sha256:d26c11e20773704382946e34f0d3d2c0b8bb0b7b37d9017faa9dc11a0196c7d9/g'
/var/lib/ceph/$(ceph fsid)/mgr.ceph01.eydqvm/unit.run
# Restart the container (systemctl daemon-reload not needed)
$ systemctl restart ceph-$(ceph fsid)(a)mgr.ceph01.eydqvm.service
# Run this command a few times and it should show the new version
ceph orch ps --refresh --hostname ceph01 | grep mgr
To get the image signature, you can also look into the other unit.run
files, a version tag would also work.
It depends on how often you need the orchestrator to maintain the
cluster. If you have the time, you could wait a bit longer for other
responses. If you need the orchestrator in the meantime, you can roll
back the MGRs.
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/32APKOXKRAIZ7IDCNI25KVYFCCCF6RJG/
Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>:
Thank you. The only thing I’m unclear on is the rollback to pacific.https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-manager-daemon
Are you referring to
["quay.io/ceph/ceph@sha256:12a0a4f43413fd97a14a3d47a3451b2d2df50020835bb93db666209f3f77617a", "quay.io/ceph/ceph@sha256:cb4d698cb769b6aba05bf6ef04f41a7fe694160140347576e13bd9348514b667"], "target_version": "17.2.0", "fs_original_max_mds": null, "fs_original_allow_standby_replay": null, "error": null, "paused": false, "daemon_types": null, "hosts": null, "services": null, "total_count":
Thank you. I appreciate all the help. Should I wait for Adam to
comment? At the moment, the cluster is functioning enough to
maintain running vms, so if it’s wise to wait, I can do that.
-jeremy
On Monday, Apr 07, 2025 at 12:23 AM, Eugen Block <eblock@xxxxxx
(mailto:eblock@xxxxxx)> wrote:
I haven't tried it this way yet, and I had hoped that Adam would chime
in, but my approach would be to remove this key (it's not present when
no upgrade is in progress):
ceph config-key rm mgr/cephadm/upgrade_state
Then rollback the two newer MGRs to Pacific as described before. If
they come up healthy, test if the orchestrator works properly first.
For example, remove a node-exporter or crash or anything else
uncritical and let it redeploy.
If that works, try a staggered upgrade, starting with the MGRs only:
ceph orch upgrade start --image <image-name> --daemon-types mgr
Since there's no need to go to Quincy, I suggest to upgrade to Reef
18.2.4 (or you wait until 18.2.5 is released, which should be very
soon), so set the respective <image-name> in the above command.
If all three MGRs successfully upgrade, you can continue with the
MONs, or with the entire rest.
In production clusters, I usually do staggered upgrades, e. g. I limit
the number of OSD daemons first just to see if they come up healthy,
then I let it upgrade all other OSDs automatically.
https://docs.ceph.com/en/latest/cephadm/upgrade/#staggered-upgrade
Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>:
Snipped some of the irrelevant logs to keep message size down.
ceph config-key get mgr/cephadm/upgrade_state
{"target_name": "quay.io/ceph/ceph:v17.2.0", "progress_id":
"e7e1a809-558d-43a7-842a-c6229fdc57af", "target_id":
"e1d6a67b021eb077ee22bf650f1a9fb1980a2cf5c36bdb9cba9eac6de8f702d9",
"target_digests":
null,https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/32APKOXKRAIZ7IDCNI25KVYFCCCF6RJG/"remaining_count":null}the MGRs to
What should I do next?
Thank you!
-jeremy
On Sunday, Apr 06, 2025 at 1:38 AM, Eugen Block <eblock@xxxxxx
(mailto:eblock@xxxxxx)> wrote:
Can you check if you have this config-key?
ceph config-key get mgr/cephadm/upgrade_state
If you reset the MGRs, it might be necessary to clear this key,
otherwise you might end up in some inconsistency. Just to be sure.
Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>:
Thanks. I’m trying to be extra careful since this cluster is
actually in use. I’ll wait for your feedback.
-jeremy
On Saturday, Apr 05, 2025 at 3:39 PM, Eugen Block <eblock@xxxxxx
(mailto:eblock@xxxxxx)> wrote:
No, that's not necessary, just edit the unit.run file foruse a different image. See Frédéric's instructions:
https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-manager-daemonconfig-keys first
But I'm not entirely sure if you need to clear sometry to checkin order to reset the upgrade state. If I have time, I'lltomorrow, or on Monday.
Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>:
Would I follow this process to downgrade?
"/usr/share/ceph/mgr/snap_schedule/module.py", line 38,<eblock@xxxxxx
Thank you
On Saturday, Apr 05, 2025 at 2:04 PM, Jeremy Hansen
<jeremy@xxxxxxxxxx (mailto:jeremy@xxxxxxxxxx)> wrote:
ceph -s claims things are healthy:
ceph -s
cluster:
id: 95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1
health: HEALTH_OK
services:
mon: 3 daemons, quorum cn01,cn03,cn02 (age 20h)
mgr: cn03.negzvb(active, since 26m), standbys: cn01.tjmtph,
cn02.ceph.xyz.corp.ggixgj
mds: 1/1 daemons up, 2 standby
osd: 15 osds: 15 up (since 19h), 15 in (since 14M)
data:
volumes: 1/1 healthy
pools: 6 pools, 610 pgs
objects: 284.59k objects, 1.1 TiB
usage: 3.3 TiB used, 106 TiB / 109 TiB avail
pgs: 610 active+clean
io:
client: 255 B/s rd, 1.2 MiB/s wr, 10 op/s rd, 16 op/s wr
—
How do I downgrade if the orch is down?
Thank you
-jeremy
On Saturday, Apr 05, 2025 at 1:56 PM, Eugen Blockceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:Anyway, these two(mailto:eblock@xxxxxx)> wrote:It would help if you only pasted the relevant parts.sections stand out:
---snip---
Apr 05 20:33:48 cn03.ceph.xyz.corp
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:[balancer INFO root]debug 2025-04-05T20:33:48.909+0000 7f26f0200700 0Some PGs (1.000000) are unknown; try again later
Apr 05 20:33:48 cn03.ceph.xyz.corp
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:load Failed todebug 2025-04-05T20:33:48.917+0000 7f2663400700 -1 mgrconstruct class in 'cephadm'
Apr 05 20:33:48 cn03.ceph.xyz.corp
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:load Tracebackdebug 2025-04-05T20:33:48.917+0000 7f2663400700 -1 mgr(most recent call last):
Apr 05 20:33:48 cn03.ceph.xyz.corp
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:in __init__File "/usr/share/ceph/mgr/cephadm/module.py", line 470,Apr 05 20:33:48 cn03.ceph.xyz.corp
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:self.upgrade = CephadmUpgrade(self)
Apr 05 20:33:48 cn03.ceph.xyz.corp
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:in __init__File "/usr/share/ceph/mgr/cephadm/upgrade.py", line 112,Apr 05 20:33:48 cn03.ceph.xyz.corp
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:self.upgrade_state: Optional[UpgradeState] =
UpgradeState.from_json(json.loads(t))
Apr 05 20:33:48 cn03.ceph.xyz.corp
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:in from_jsonFile "/usr/share/ceph/mgr/cephadm/upgrade.py", line 93,Apr 05 20:33:48 cn03.ceph.xyz.corp
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:return cls(**c)
Apr 05 20:33:48 cn03.ceph.xyz.corp
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:'daemon_types'TypeError: __init__() got an unexpected keyword argumentApr 05 20:33:48 cn03.ceph.xyz.corp
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:Apr 05 20:33:48 cn03.ceph.xyz.corp
mgr operator()debug 2025-04-05T20:33:48.918+0000 7f2663400700 -1ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:Failed to run module in active mode ('cephadm')
Apr 05 20:33:49 cn03.ceph.xyz.corp
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:load Failed todebug 2025-04-05T20:33:49.273+0000 7f2663400700 -1 mgrconstruct class in 'snap_schedule'
Apr 05 20:33:49 cn03.ceph.xyz.corp
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:load Tracebackdebug 2025-04-05T20:33:49.273+0000 7f2663400700 -1 mgr(most recent call last):
Apr 05 20:33:49 cn03.ceph.xyz.corp
Filecluster beforeceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:in __init__Apr 05 20:33:49 cn03.ceph.xyz.corp
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:self.client = SnapSchedClient(self)
Apr 05 20:33:49 cn03.ceph.xyz.corp
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:"/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py", lineFile158, in __init__
Apr 05 20:33:49 cn03.ceph.xyz.corp
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:with self.get_schedule_db(fs_name) as conn_mgr:
Apr 05 20:33:49 cn03.ceph.xyz.corp
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:"/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py", lineFile192, in get_schedule_db
Apr 05 20:33:49 cn03.ceph.xyz.corp
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:db.executescript(dump)
Apr 05 20:33:49 cn03.ceph.xyz.corp
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:sqlite3.OperationalError: table schedules already exists
Apr 05 20:33:49 cn03.ceph.xyz.corp
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:Apr 05 20:33:49 cn03.ceph.xyz.corp
mgr operator()debug 2025-04-05T20:33:49.274+0000 7f2663400700 -1because of anFailed to run module in active mode ('snap_schedule')
---snip---
Your cluster seems to be in an error state (ceph -s)unknown PG. It's recommended to have a healthyam thinkingcome from theattemping an upgrade. It's possible that these errorsnot upgraded MGR, I'm not sure.
Since the upgrade was only successful for two MGRs, Ichecked thean upgrade toabout downgrading both MGRs back to 16.2.15, then retrya newer version, either 17.2.8 or 18.2.4. I haven'tthat already.snap_schedule error yet, though. Maybe someone else knows
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx