Re: Cephadm upgrade from 16.2.15 -> 17.2.0

Jeremy Hansen <jeremy@xxxxxxxxxx> · Mon, 21 Apr 2025 16:50:32 -0700

   Just to follow this through, 18.2.6 fixed my issues and I was able to complete the upgrade.  Is it advisable to go to 19 or should I stay on reef?

-jeremy

   On Monday, Apr 14, 2025 at 12:14 AM, Jeremy Hansen <jeremy@xxxxxxxxxx> wrote:
   Thanks.  I’ll wait.  I need this to go smoothly on another cluster that has to go through the same process.

-jeremy

   On Monday, Apr 14, 2025 at 12:10 AM, Eugen Block <eblock@xxxxxx> wrote:
 Ah, this looks like the encryption issue which seems new in 18.2.5, 
brought up here: 

https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/UJ4DREAWNBBVVUJXYVZO25AYVQ5RLT42/ 

In that case it's questionable if you really want to upgrade to 
18.2.5. Maybe 18.2.4 would be more suitable, although it's missing bug 
fixes from .5 (like the RGW memory leak). If you really need to 
upgrade, I guess I would go with .4, otherwise stay on Pacific until 
this issue has been addressed. It's not an easy decision. ;-) 

Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>: 

I haven’t attempted the remaining upgrade just yet. I wanted to 
check on this before proceeding. Things seem “stable” in the sense 
that I’m running VMs and all volumes and images are still 
functioning. I’m using whatever would have been the default from 
16.2.14. It seems to be from time to time because I receive nagios 
alerts, which seem to eventually clear and then reappear. 

HEALTH_WARN Failed to apply 1 service(s): osd.cost_capacity 
[WRN] CEPHADM_APPLY_SPEC_FAIL: Failed to apply 1 service(s): 
osd.cost_capacity 
osd.cost_capacity: cephadm exited with an error code: 1, 
stderr:Inferring config 
/var/lib/ceph/95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1/mon.cn02/config 
Non-zero exit code 1 from /usr/bin/podman run --rm --ipc=host 
--stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume 
--privileged --group-add=disk --init -e 
CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:47de8754d1f72fadb61523247c897fdf673f9a9689503c64ca8384472d232c5c -e NODE_NAME=cn02.ceph.xyz.corp -e CEPH_VOLUME_OSDSPEC_AFFINITY=cost_capacity -e CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v /var/run/ceph/95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1:/var/run/ceph:z -v /var/log/ceph/95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1:/var/log/ceph:z -v /var/lib/ceph/95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1/crash:/var/lib/ceph/crash:z -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm -v /:/rootfs -v /etc/hosts:/etc/hosts:ro -v /tmp/ceph-tmp49jj8zoh:/etc/ceph/ceph.conf:z -v /tmp/ceph-tmp_9k8v5uj:/var/lib/ceph/bootstrap-osd/ceph.keyring:z quay.io/ceph/ceph@sha256:47de8754d1f72fadb61523247c897fdf673f9a9689503c64ca8384472d232c5c lvm batch --no-auto /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf --dmcrypt --yes 
--no-systemd 
/usr/bin/podman: stderr Traceback (most recent call last): 
/usr/bin/podman: stderr File "/usr/sbin/ceph-volume", line 33, in <module> 
/usr/bin/podman: stderr 
sys.exit(load_entry_point('ceph-volume==1.0.0', 'console_scripts', 
'ceph-volume')()) 
/usr/bin/podman: stderr File 
"/usr/lib/python3.9/site-packages/ceph_volume/main.py", line 54, in 
__init__ 
/usr/bin/podman: stderr self.main(self.argv) 
/usr/bin/podman: stderr File 
"/usr/lib/python3.9/site-packages/ceph_volume/decorators.py", line 
59, in newfunc 
/usr/bin/podman: stderr return f(*a, **kw) 
/usr/bin/podman: stderr File 
"/usr/lib/python3.9/site-packages/ceph_volume/main.py", line 166, in 
main 
/usr/bin/podman: stderr terminal.dispatch(self.mapper, subcommand_args) 
/usr/bin/podman: stderr File 
"/usr/lib/python3.9/site-packages/ceph_volume/terminal.py", line 
194, in dispatch 
/usr/bin/podman: stderr instance.main() 
/usr/bin/podman: stderr File 
"/usr/lib/python3.9/site-packages/ceph_volume/devices/lvm/main.py", 
line 46, in main 
/usr/bin/podman: stderr terminal.dispatch(self.mapper, self.argv) 
/usr/bin/podman: stderr File 
"/usr/lib/python3.9/site-packages/ceph_volume/terminal.py", line 
192, in dispatch 
/usr/bin/podman: stderr instance = mapper.get(arg)(argv[count:]) 
/usr/bin/podman: stderr File 
"/usr/lib/python3.9/site-packages/ceph_volume/devices/lvm/batch.py", 
line 325, in __init__ 
/usr/bin/podman: stderr self.args = parser.parse_args(argv) 
/usr/bin/podman: stderr File "/usr/lib64/python3.9/argparse.py", 
line 1825, in parse_args 
/usr/bin/podman: stderr args, argv = self.parse_known_args(args, namespace) 
/usr/bin/podman: stderr File "/usr/lib64/python3.9/argparse.py", 
line 1858, in parse_known_args 
/usr/bin/podman: stderr namespace, args = 
self._parse_known_args(args, namespace) 
/usr/bin/podman: stderr File "/usr/lib64/python3.9/argparse.py", 
line 2067, in _parse_known_args 
/usr/bin/podman: stderr start_index = consume_optional(start_index) 
/usr/bin/podman: stderr File "/usr/lib64/python3.9/argparse.py", 
line 2007, in consume_optional 
/usr/bin/podman: stderr take_action(action, args, option_string) 
/usr/bin/podman: stderr File "/usr/lib64/python3.9/argparse.py", 
line 1935, in take_action 
/usr/bin/podman: stderr action(self, namespace, argument_values, 
option_string) 
/usr/bin/podman: stderr File 
"/usr/lib/python3.9/site-packages/ceph_volume/util/arg_validators.py", line 
17, in __call__ 
/usr/bin/podman: stderr set_dmcrypt_no_workqueue() 
/usr/bin/podman: stderr File 
"/usr/lib/python3.9/site-packages/ceph_volume/util/encryption.py", 
line 54, in set_dmcrypt_no_workqueue 
/usr/bin/podman: stderr raise RuntimeError('Error while checking 
cryptsetup version.\n', 
/usr/bin/podman: stderr RuntimeError: ('Error while checking 
cryptsetup version.\n', '`cryptsetup --version` output:\n', 
'cryptsetup 2.7.2 flags: UDEV BLKID KEYRING FIPS KERNEL_CAPI 
PWQUALITY ') 
Traceback (most recent call last): 
File "/usr/lib64/python3.9/runpy.py", line 197, in _run_module_as_main 
return _run_code(code, main_globals, None, 
File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code 
exec(code, run_globals) 
File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 11009, in <module> 
File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 10997, in main 
File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 2593, in 
_infer_config 
File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 2509, in _infer_fsid 
File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 2621, in _infer_image 
File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 2496, in 
_validate_fsid 
File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 7226, in 
command_ceph_volume 
File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 2284, in call_throws 
RuntimeError: Failed command: /usr/bin/podman run --rm --ipc=host 
--stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume 
--privileged --group-add=disk --init -e 
CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:47de8754d1f72fadb61523247c897fdf673f9a9689503c64ca8384472d232c5c -e NODE_NAME=cn02.ceph.xyz.corp -e CEPH_VOLUME_OSDSPEC_AFFINITY=cost_capacity -e CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v /var/run/ceph/95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1:/var/run/ceph:z -v /var/log/ceph/95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1:/var/log/ceph:z -v /var/lib/ceph/95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1/crash:/var/lib/ceph/crash:z -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm -v /:/rootfs -v /etc/hosts:/etc/hosts:ro -v /tmp/ceph-tmp49jj8zoh:/etc/ceph/ceph.conf:z -v /tmp/ceph-tmp_9k8v5uj:/var/lib/ceph/bootstrap-osd/ceph.keyring:z quay.io/ceph/ceph@sha256:47de8754d1f72fadb61523247c897fdf673f9a9689503c64ca8384472d232c5c lvm batch --no-auto /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf --dmcrypt --yes 
--no-systemd 

— 

ceph orch ls osd --export 
service_type: osd 
service_id: all-available-devices 
service_name: osd.all-available-devices 
placement: 
host_pattern: '*' 
spec: 
data_devices: 
all: true 
filter_logic: AND 
objectstore: bluestore 
--- 
service_type: osd 
service_id: cost_capacity 
service_name: osd.cost_capacity 
placement: 
host_pattern: '*' 
spec: 
data_devices: 
rotational: 1 
encrypted: true 
filter_logic: AND 
objectstore: bluestore 

Thank you 
-jeremy 

On Sunday, Apr 13, 2025 at 11:48 PM, Eugen Block <eblock@xxxxxx 
(mailto:eblock@xxxxxx)> wrote: 
Are you using Rook? Usually, I see this warning when a host is not 
reachable, for example during a reboot. But it also clears when the 
host comes back. Do you see this permanently or from time to time? It 
might have to do with the different Ceph versions, I'm not sure. But 
it shouldn't be a show stopper for the remaining upgrade. Or are you 
trying to deploy OSDs but it fails? You can paste 

ceph health detail 
ceph orch ls osd --export 

You can also scan the cephadm.log for any hints. 

Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>: 

This looks relevant. 

https://github.com/rook/rook/issues/13600#issuecomment-1905860331 

On Sunday, Apr 13, 2025 at 10:08 AM, Jeremy Hansen 
<jeremy@xxxxxxxxxx (mailto:jeremy@xxxxxxxxxx)> wrote: 
I’m now seeing this: 

cluster: 
id: 95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1 
health: HEALTH_WARN 
Failed to apply 1 service(s): osd.cost_capacity 

I’m assuming this is due to the fact that I’ve only upgraded mgr 
but I wanted to double check before proceeding with the rest of the 
components. 

Thanks 
-jeremy 

On Sunday, Apr 13, 2025 at 12:59 AM, Jeremy Hansen 
<jeremy@xxxxxxxxxx (mailto:jeremy@xxxxxxxxxx)> wrote: 
Updating mgr’s to 18.2.5 seemed to work just fine. I will go for 
the remaining services after the weekend. Thanks. 

-jeremy 

On Thursday, Apr 10, 2025 at 6:37 AM, Eugen Block 
<eblock@xxxxxx (mailto:eblock@xxxxxx)> wrote: 
Glad I could help! I'm also waiting for 18.2.5 to upgrade our own 
cluster from Pacific after getting rid of our cache tier. :-D 

Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>: 

This seems to have worked to get the orch back up and put 
me back to 
16.2.15. Thank you. Debating on waiting for 18.2.5 to 
move forward. 

-jeremy 

On Monday, Apr 07, 2025 at 1:26 AM, Eugen Block <eblock@xxxxxx 
(mailto:eblock@xxxxxx)> wrote: 
Still no, just edit the unit.run file for the MGRs to use a 
different 
image. See Frédéric's instructions (now that I'm re-reading it, 
there's a little mistake with dots and hyphens): 

# Backup the unit.run file 
$ cp /var/lib/ceph/$(ceph 
fsid)/mgr.ceph01.eydqvm/unit.run{,.bak} 

# Change container image's signature. You can get the 
signature of the 
version you 
want to reach from 
https://quay.io/repository/ceph/ceph?tab=tags. It's 
in the URL of a 
version. 
$ sed -i 

's/ceph@sha256:e40c19cd70e047d14d70f5ec3cf501da081395a670cd59ca881ff56119660c8f/ceph@sha256:d26c11e20773704382946e34f0d3d2c0b8bb0b7b37d9017faa9dc11a0196c7d9/g' 
/var/lib/ceph/$(ceph fsid)/mgr.ceph01.eydqvm/unit.run 

# Restart the container (systemctl daemon-reload not needed) 
$ systemctl restart ceph-$(ceph 
fsid)(a)mgr.ceph01.eydqvm.service 

# Run this command a few times and it should show the 
new version 
ceph orch ps --refresh --hostname ceph01 | grep mgr 

To get the image signature, you can also look into the 
other unit.run 
files, a version tag would also work. 

It depends on how often you need the orchestrator to 
maintain the 
cluster. If you have the time, you could wait a bit 
longer for other 
responses. If you need the orchestrator in the meantime, 
you can roll 
back the MGRs. 

https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/32APKOXKRAIZ7IDCNI25KVYFCCCF6RJG/ 

Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>: 

Thank you. The only thing I’m unclear on is the rollback 
to pacific. 

Are you referring to 

https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-manager-daemon 

Thank you. I appreciate all the help. Should I wait 
for Adam to 
comment? At the moment, the cluster is functioning enough to 
maintain running vms, so if it’s wise to wait, I can do that. 

-jeremy 

On Monday, Apr 07, 2025 at 12:23 AM, Eugen Block 
<eblock@xxxxxx 
(mailto:eblock@xxxxxx)> wrote: 
I haven't tried it this way yet, and I had hoped that 
Adam would chime 
in, but my approach would be to remove this key (it's 
not present when 
no upgrade is in progress): 

ceph config-key rm mgr/cephadm/upgrade_state 

Then rollback the two newer MGRs to Pacific as 
described before. If 
they come up healthy, test if the orchestrator works 
properly first. 
For example, remove a node-exporter or crash or 
anything else 
uncritical and let it redeploy. 
If that works, try a staggered upgrade, starting with 
the MGRs only: 

ceph orch upgrade start --image <image-name> 
--daemon-types mgr 

Since there's no need to go to Quincy, I suggest to 
upgrade to Reef 
18.2.4 (or you wait until 18.2.5 is released, which 
should be very 
soon), so set the respective <image-name> in the 
above command. 

If all three MGRs successfully upgrade, you can 
continue with the 
MONs, or with the entire rest. 

In production clusters, I usually do staggered 
upgrades, e. g. I limit 
the number of OSD daemons first just to see if they 
come up healthy, 
then I let it upgrade all other OSDs automatically. 

https://docs.ceph.com/en/latest/cephadm/upgrade/#staggered-upgrade 

Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>: 

Snipped some of the irrelevant logs to keep 
message size down. 

ceph config-key get mgr/cephadm/upgrade_state 

{"target_name": "quay.io/ceph/ceph:v17.2.0", 
"progress_id": 
"e7e1a809-558d-43a7-842a-c6229fdc57af", "target_id": 

"e1d6a67b021eb077ee22bf650f1a9fb1980a2cf5c36bdb9cba9eac6de8f702d9", 
"target_digests": 

["quay.io/ceph/ceph@sha256:12a0a4f43413fd97a14a3d47a3451b2d2df50020835bb93db666209f3f77617a", "quay.io/ceph/ceph@sha256:cb4d698cb769b6aba05bf6ef04f41a7fe694160140347576e13bd9348514b667"], "target_version": "17.2.0", "fs_original_max_mds": null, "fs_original_allow_standby_replay": null, "error": null, "paused": false, "daemon_types": null, "hosts": null, "services": 
null, 
"total_count": 
null, 
"remaining_count": 
null} 

What should I do next? 

Thank you! 
-jeremy 

On Sunday, Apr 06, 2025 at 1:38 AM, Eugen Block 
<eblock@xxxxxx 
(mailto:eblock@xxxxxx)> wrote: 
Can you check if you have this config-key? 

ceph config-key get mgr/cephadm/upgrade_state 

If you reset the MGRs, it might be necessary to 
clear this key, 
otherwise you might end up in some inconsistency. 
Just to be sure. 

Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>: 

Thanks. I’m trying to be extra careful since this 
cluster is 
actually in use. I’ll wait for your feedback. 

-jeremy 

On Saturday, Apr 05, 2025 at 3:39 PM, Eugen 
Block <eblock@xxxxxx 
(mailto:eblock@xxxxxx)> wrote: 
No, that's not necessary, just edit the 
unit.run file for 
the MGRs to 
use a different image. See Frédéric's instructions: 

https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/32APKOXKRAIZ7IDCNI25KVYFCCCF6RJG/ 

But I'm not entirely sure if you need to clear some 
config-keys first 
in order to reset the upgrade state. If I have 
time, I'll 
try to check 
tomorrow, or on Monday. 

Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>: 

Would I follow this process to downgrade? 

https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-manager-daemon 

Thank you 

On Saturday, Apr 05, 2025 at 2:04 PM, 
Jeremy Hansen 
<jeremy@xxxxxxxxxx 
(mailto:jeremy@xxxxxxxxxx)> wrote: 
ceph -s claims things are healthy: 

ceph -s 
cluster: 
id: 95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1 
health: HEALTH_OK 

services: 
mon: 3 daemons, quorum cn01,cn03,cn02 (age 20h) 
mgr: cn03.negzvb(active, since 26m), 
standbys: cn01.tjmtph, 
cn02.ceph.xyz.corp.ggixgj 
mds: 1/1 daemons up, 2 standby 
osd: 15 osds: 15 up (since 19h), 15 in 
(since 14M) 

data: 
volumes: 1/1 healthy 
pools: 6 pools, 610 pgs 
objects: 284.59k objects, 1.1 TiB 
usage: 3.3 TiB used, 106 TiB / 109 TiB avail 
pgs: 610 active+clean 

io: 
client: 255 B/s rd, 1.2 MiB/s wr, 10 op/s 
rd, 16 op/s wr 

— 
How do I downgrade if the orch is down? 

Thank you 
-jeremy 

On Saturday, Apr 05, 2025 at 1:56 PM, 
Eugen Block 
<eblock@xxxxxx 
(mailto:eblock@xxxxxx)> wrote: 
It would help if you only pasted the 
relevant parts. 
Anyway, these two 
sections stand out: 

---snip--- 
Apr 05 20:33:48 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
debug 2025-04-05T20:33:48.909+0000 
7f26f0200700 0 
[balancer INFO root] 
Some PGs (1.000000) are unknown; try 
again later 
Apr 05 20:33:48 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
debug 2025-04-05T20:33:48.917+0000 
7f2663400700 -1 mgr 
load Failed to 
construct class in 'cephadm' 
Apr 05 20:33:48 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
debug 2025-04-05T20:33:48.917+0000 
7f2663400700 -1 mgr 
load Traceback 
(most recent call last): 
Apr 05 20:33:48 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
File 
"/usr/share/ceph/mgr/cephadm/module.py", line 470, 
in __init__ 
Apr 05 20:33:48 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
self.upgrade = CephadmUpgrade(self) 
Apr 05 20:33:48 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
File 
"/usr/share/ceph/mgr/cephadm/upgrade.py", line 112, 
in __init__ 
Apr 05 20:33:48 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
self.upgrade_state: Optional[UpgradeState] = 
UpgradeState.from_json(json.loads(t)) 
Apr 05 20:33:48 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
File 
"/usr/share/ceph/mgr/cephadm/upgrade.py", line 93, 
in from_json 
Apr 05 20:33:48 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
return cls(**c) 
Apr 05 20:33:48 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
TypeError: __init__() got an unexpected 
keyword argument 
'daemon_types' 
Apr 05 20:33:48 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
Apr 05 20:33:48 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
debug 2025-04-05T20:33:48.918+0000 
7f2663400700 -1 
mgr operator() 
Failed to run module in active mode 
('cephadm') 

Apr 05 20:33:49 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
debug 2025-04-05T20:33:49.273+0000 
7f2663400700 -1 mgr 
load Failed to 
construct class in 'snap_schedule' 
Apr 05 20:33:49 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
debug 2025-04-05T20:33:49.273+0000 
7f2663400700 -1 mgr 
load Traceback 
(most recent call last): 
Apr 05 20:33:49 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
File 
"/usr/share/ceph/mgr/snap_schedule/module.py", line 38, 
in __init__ 
Apr 05 20:33:49 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
self.client = SnapSchedClient(self) 
Apr 05 20:33:49 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
File 

"/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py", line 
158, in __init__ 
Apr 05 20:33:49 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
with self.get_schedule_db(fs_name) as 
conn_mgr: 
Apr 05 20:33:49 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
File 

"/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py", line 
192, in get_schedule_db 
Apr 05 20:33:49 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
db.executescript(dump) 
Apr 05 20:33:49 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
sqlite3.OperationalError: table schedules 
already exists 
Apr 05 20:33:49 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
Apr 05 20:33:49 cn03.ceph.xyz.corp 

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: 
debug 2025-04-05T20:33:49.274+0000 
7f2663400700 -1 
mgr operator() 
Failed to run module in active mode 
('snap_schedule') 
---snip--- 

Your cluster seems to be in an error 
state (ceph -s) 
because of an 
unknown PG. It's recommended to have a healthy 
cluster before 
attemping an upgrade. It's possible that 
these errors 
come from the 
not upgraded MGR, I'm not sure. 

Since the upgrade was only successful for 
two MGRs, I 
am thinking 
about downgrading both MGRs back to 
16.2.15, then retry 
an upgrade to 
a newer version, either 17.2.8 or 18.2.4. 
I haven't 
checked the 
snap_schedule error yet, though. Maybe 
someone else knows 
that already. 

Attachment:
signature.asc

Description: PGP signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx