Re: Cephadm upgrade from 16.2.15 -> 17.2.0

Jeremy Hansen <jeremy@xxxxxxxxxx> · Mon, 7 Apr 2025 00:51:12 -0700

   Thank you.  The only thing I’m unclear on is the rollback to pacific. 

Are you referring to 

https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-manager-daemon

Thank you. I appreciate all the help.  Should I wait for Adam to comment?  At the moment, the cluster is functioning enough to maintain running vms, so if it’s wise to wait, I can do that. 

-jeremy

   On Monday, Apr 07, 2025 at 12:23 AM, Eugen Block <eblock@xxxxxx> wrote:
 I haven't tried it this way yet, and I had hoped that Adam would chime 
in, but my approach would be to remove this key (it's not present when 
no upgrade is in progress): 

ceph config-key rm mgr/cephadm/upgrade_state 

Then rollback the two newer MGRs to Pacific as described before. If 
they come up healthy, test if the orchestrator works properly first. 
For example, remove a node-exporter or crash or anything else 
uncritical and let it redeploy. 
If that works, try a staggered upgrade, starting with the MGRs only: 

ceph orch upgrade start --image <image-name> --daemon-types mgr 

Since there's no need to go to Quincy, I suggest to upgrade to Reef 
18.2.4 (or you wait until 18.2.5 is released, which should be very 
soon), so set the respective <image-name> in the above command. 

If all three MGRs successfully upgrade, you can continue with the 
MONs, or with the entire rest. 

In production clusters, I usually do staggered upgrades, e. g. I limit 
the number of OSD daemons first just to see if they come up healthy, 
then I let it upgrade all other OSDs automatically. 

https://docs.ceph.com/en/latest/cephadm/upgrade/#staggered-upgrade 

Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>: 

Snipped some of the irrelevant logs to keep message size down. 

ceph config-key get mgr/cephadm/upgrade_state 

{"target_name": "quay.io/ceph/ceph:v17.2.0", "progress_id": 
"e7e1a809-558d-43a7-842a-c6229fdc57af", "target_id": 
"e1d6a67b021eb077ee22bf650f1a9fb1980a2cf5c36bdb9cba9eac6de8f702d9", 
"target_digests": 
["quay.io/ceph/ceph@sha256:12a0a4f43413fd97a14a3d47a3451b2d2df50020835bb93db666209f3f77617a", "quay.io/ceph/ceph@sha256:cb4d698cb769b6aba05bf6ef04f41a7fe694160140347576e13bd9348514b667"], "target_version": "17.2.0", "fs_original_max_mds": null, "fs_original_allow_standby_replay": null, "error": null, "paused": false, "daemon_types": null, "hosts": null, "services": null, "total_count": null, "remaining_count": 
null} 

What should I do next? 

Thank you! 
-jeremy 

On Sunday, Apr 06, 2025 at 1:38 AM, Eugen Block <eblock@xxxxxx 
(mailto:eblock@xxxxxx)> wrote: 
Can you check if you have this config-key? 

ceph config-key get mgr/cephadm/upgrade_state 

If you reset the MGRs, it might be necessary to clear this key, 
otherwise you might end up in some inconsistency. Just to be sure. 

Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>: 

Thanks. I’m trying to be extra careful since this cluster is 
actually in use. I’ll wait for your feedback. 

-jeremy 

On Saturday, Apr 05, 2025 at 3:39 PM, Eugen Block <eblock@xxxxxx 
(mailto:eblock@xxxxxx)> wrote: 
No, that's not necessary, just edit the unit.run file for the MGRs to 
use a different image. See Frédéric's instructions: 

https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/32APKOXKRAIZ7IDCNI25KVYFCCCF6RJG/ 

But I'm not entirely sure if you need to clear some config-keys first 
in order to reset the upgrade state. If I have time, I'll try to check 
tomorrow, or on Monday. 

Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>: 

Would I follow this process to downgrade? 

https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-manager-daemon 

Thank you 

On Saturday, Apr 05, 2025 at 2:04 PM, Jeremy Hansen 
<jeremy@xxxxxxxxxx (mailto:jeremy@xxxxxxxxxx)> wrote: 
ceph -s claims things are healthy: 

ceph -s 
cluster: 
id: 95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1 
health: HEALTH_OK 

services: 
mon: 3 daemons, quorum cn01,cn03,cn02 (age 20h) 
mgr: cn03.negzvb(active, since 26m), standbys: cn01.tjmtph, 
cn02.ceph.xyz.corp.ggixgj 
mds: 1/1 daemons up, 2 standby 
osd: 15 osds: 15 up (since 19h), 15 in (since 14M) 

data: 
volumes: 1/1 healthy 
pools: 6 pools, 610 pgs 
objects: 284.59k objects, 1.1 TiB 
usage: 3.3 TiB used, 106 TiB / 109 TiB avail 
pgs: 610 active+clean 

io: 
client: 255 B/s rd, 1.2 MiB/s wr, 10 op/s rd, 16 op/s wr 

— 
How do I downgrade if the orch is down? 

Thank you 
-jeremy 

On Saturday, Apr 05, 2025 at 1:56 PM, Eugen Block <eblock@xxxxxx 
(mailto:eblock@xxxxxx)> wrote: 
It would help if you only pasted the relevant parts. 
Anyway, these two 
sections stand out: 

---snip--- 
Apr 05 20:33:48 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
debug 2025-04-05T20:33:48.909+0000 7f26f0200700 0 
[balancer INFO root] 
Some PGs (1.000000) are unknown; try again later 
Apr 05 20:33:48 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
debug 2025-04-05T20:33:48.917+0000 7f2663400700 -1 mgr 
load Failed to 
construct class in 'cephadm' 
Apr 05 20:33:48 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
debug 2025-04-05T20:33:48.917+0000 7f2663400700 -1 mgr 
load Traceback 
(most recent call last): 
Apr 05 20:33:48 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
File "/usr/share/ceph/mgr/cephadm/module.py", line 470, 
in __init__ 
Apr 05 20:33:48 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
self.upgrade = CephadmUpgrade(self) 
Apr 05 20:33:48 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
File "/usr/share/ceph/mgr/cephadm/upgrade.py", line 112, 
in __init__ 
Apr 05 20:33:48 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
self.upgrade_state: Optional[UpgradeState] = 
UpgradeState.from_json(json.loads(t)) 
Apr 05 20:33:48 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
File "/usr/share/ceph/mgr/cephadm/upgrade.py", line 93, 
in from_json 
Apr 05 20:33:48 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
return cls(**c) 
Apr 05 20:33:48 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
TypeError: __init__() got an unexpected keyword argument 
'daemon_types' 
Apr 05 20:33:48 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
Apr 05 20:33:48 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
debug 2025-04-05T20:33:48.918+0000 7f2663400700 -1 mgr operator() 
Failed to run module in active mode ('cephadm') 

Apr 05 20:33:49 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
debug 2025-04-05T20:33:49.273+0000 7f2663400700 -1 mgr 
load Failed to 
construct class in 'snap_schedule' 
Apr 05 20:33:49 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
debug 2025-04-05T20:33:49.273+0000 7f2663400700 -1 mgr 
load Traceback 
(most recent call last): 
Apr 05 20:33:49 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
File "/usr/share/ceph/mgr/snap_schedule/module.py", line 38, 
in __init__ 
Apr 05 20:33:49 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
self.client = SnapSchedClient(self) 
Apr 05 20:33:49 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
File 
"/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py", line 
158, in __init__ 
Apr 05 20:33:49 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
with self.get_schedule_db(fs_name) as conn_mgr: 
Apr 05 20:33:49 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
File 
"/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py", line 
192, in get_schedule_db 
Apr 05 20:33:49 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
db.executescript(dump) 
Apr 05 20:33:49 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
sqlite3.OperationalError: table schedules already exists 
Apr 05 20:33:49 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
Apr 05 20:33:49 cn03.ceph.xyz.corp 
ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
debug 2025-04-05T20:33:49.274+0000 7f2663400700 -1 mgr operator() 
Failed to run module in active mode ('snap_schedule') 
---snip--- 

Your cluster seems to be in an error state (ceph -s) because of an 
unknown PG. It's recommended to have a healthy cluster before 
attemping an upgrade. It's possible that these errors 
come from the 
not upgraded MGR, I'm not sure. 

Since the upgrade was only successful for two MGRs, I am thinking 
about downgrading both MGRs back to 16.2.15, then retry 
an upgrade to 
a newer version, either 17.2.8 or 18.2.4. I haven't checked the 
snap_schedule error yet, though. Maybe someone else knows 
that already. 

 Attachment:
signature.asc

Description: PGP signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx