Snipped some of the irrelevant logs to keep message size down.
ceph config-key get mgr/cephadm/upgrade_state
{"target_name": "quay.io/ceph/ceph:v17.2.0", "progress_id": "e7e1a809-558d-43a7-842a-c6229fdc57af", "target_id": "e1d6a67b021eb077ee22bf650f1a9fb1980a2cf5c36bdb9cba9eac6de8f702d9", "target_digests": ["quay.io/ceph/ceph@sha256:12a0a4f43413fd97a14a3d47a3451b2d2df50020835bb93db666209f3f77617a", "quay.io/ceph/ceph@sha256:cb4d698cb769b6aba05bf6ef04f41a7fe694160140347576e13bd9348514b667"], "target_version": "17.2.0", "fs_original_max_mds": null, "fs_original_allow_standby_replay": null, "error": null, "paused": false, "daemon_types": null, "hosts": null, "services": null, "total_count": null, "remaining_count": null}
What should I do next?
Thank you! -jeremy
On Sunday, Apr 06, 2025 at 1:38 AM, Eugen Block < eblock@xxxxxx> wrote: Can you check if you have this config-key? ceph config-key get mgr/cephadm/upgrade_state If you reset the MGRs, it might be necessary to clear this key, otherwise you might end up in some inconsistency. Just to be sure. Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>: Thanks. I’m trying to be extra careful since this cluster is actually in use. I’ll wait for your feedback. -jeremy
On Saturday, Apr 05, 2025 at 3:39 PM, Eugen Block <eblock@xxxxxx (mailto:eblock@xxxxxx)> wrote: No, that's not necessary, just edit the unit.run file for the MGRs to use a different image. See Frédéric's instructions: https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/32APKOXKRAIZ7IDCNI25KVYFCCCF6RJG/ But I'm not entirely sure if you need to clear some config-keys first in order to reset the upgrade state. If I have time, I'll try to check tomorrow, or on Monday. Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>:
Would I follow this process to downgrade?
https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-manager-daemon
Thank you
On Saturday, Apr 05, 2025 at 2:04 PM, Jeremy Hansen <jeremy@xxxxxxxxxx (mailto:jeremy@xxxxxxxxxx)> wrote: ceph -s claims things are healthy: ceph -s cluster: id: 95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1 health: HEALTH_OK services: mon: 3 daemons, quorum cn01,cn03,cn02 (age 20h) mgr: cn03.negzvb(active, since 26m), standbys: cn01.tjmtph, cn02.ceph.xyz.corp.ggixgj mds: 1/1 daemons up, 2 standby osd: 15 osds: 15 up (since 19h), 15 in (since 14M) data: volumes: 1/1 healthy pools: 6 pools, 610 pgs objects: 284.59k objects, 1.1 TiB usage: 3.3 TiB used, 106 TiB / 109 TiB avail pgs: 610 active+clean io: client: 255 B/s rd, 1.2 MiB/s wr, 10 op/s rd, 16 op/s wr — How do I downgrade if the orch is down? Thank you -jeremy
On Saturday, Apr 05, 2025 at 1:56 PM, Eugen Block <eblock@xxxxxx
(mailto:eblock@xxxxxx)> wrote:
It would help if you only pasted the relevant parts. Anyway, these two sections stand out: ---snip--- Apr 05 20:33:48 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: debug 2025-04-05T20:33:48.909+0000 7f26f0200700 0 [balancer INFO root] Some PGs (1.000000) are unknown; try again later Apr 05 20:33:48 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: debug 2025-04-05T20:33:48.917+0000 7f2663400700 -1 mgr load Failed to construct class in 'cephadm' Apr 05 20:33:48 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: debug 2025-04-05T20:33:48.917+0000 7f2663400700 -1 mgr load Traceback (most recent call last): Apr 05 20:33:48 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: File "/usr/share/ceph/mgr/cephadm/module.py", line 470, in __init__ Apr 05 20:33:48 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: self.upgrade = CephadmUpgrade(self) Apr 05 20:33:48 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: File "/usr/share/ceph/mgr/cephadm/upgrade.py", line 112, in __init__ Apr 05 20:33:48 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: self.upgrade_state: Optional[UpgradeState] = UpgradeState.from_json(json.loads(t)) Apr 05 20:33:48 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: File "/usr/share/ceph/mgr/cephadm/upgrade.py", line 93, in from_json Apr 05 20:33:48 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: return cls(**c) Apr 05 20:33:48 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: TypeError: __init__() got an unexpected keyword argument
'daemon_types'
Apr 05 20:33:48 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: Apr 05 20:33:48 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: debug 2025-04-05T20:33:48.918+0000 7f2663400700 -1 mgr operator() Failed to run module in active mode ('cephadm') Apr 05 20:33:49 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: debug 2025-04-05T20:33:49.273+0000 7f2663400700 -1 mgr load Failed to construct class in 'snap_schedule' Apr 05 20:33:49 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: debug 2025-04-05T20:33:49.273+0000 7f2663400700 -1 mgr load Traceback (most recent call last): Apr 05 20:33:49 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: File "/usr/share/ceph/mgr/snap_schedule/module.py", line 38,
in __init__
Apr 05 20:33:49 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: self.client = SnapSchedClient(self) Apr 05 20:33:49 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: File "/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py", line 158, in __init__ Apr 05 20:33:49 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: with self.get_schedule_db(fs_name) as conn_mgr: Apr 05 20:33:49 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: File "/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py", line 192, in get_schedule_db Apr 05 20:33:49 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: db.executescript(dump) Apr 05 20:33:49 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: sqlite3.OperationalError: table schedules already exists Apr 05 20:33:49 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: Apr 05 20:33:49 cn03.ceph.xyz.corp ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: debug 2025-04-05T20:33:49.274+0000 7f2663400700 -1 mgr operator() Failed to run module in active mode ('snap_schedule') ---snip--- Your cluster seems to be in an error state (ceph -s) because of an unknown PG. It's recommended to have a healthy cluster before attemping an upgrade. It's possible that these errors come from the not upgraded MGR, I'm not sure. Since the upgrade was only successful for two MGRs, I am thinking about downgrading both MGRs back to 16.2.15, then retry an upgrade to a newer version, either 17.2.8 or 18.2.4. I haven't checked the snap_schedule error yet, though. Maybe someone else knows
that already.
|