Re: 18.2.2: Upgrade not starting (ceph orch upgrade)

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Wed, 30 Apr 2025 07:38:27 -0400

I’ve had a similar experience with Reef, trying to destroy an improperly deployed OSD on a viable drive.   I had to ceph-volume lvm zap to get past the purge and at no point was the OSD marked as destroyed.  

> On Apr 30, 2025, at 6:15 AM, Eugen Block <eblock@xxxxxx> wrote:
> 
> Right, 'ceph osd destroy' most likely won't help you here. The --replace flag is only to mark an OSD as destroyed (so it will reuse its ID after replacing the drive).
> You wrote that stopping osd rm for 253 unblocked the upgrade, so the cluster is currently upgrading?
> 
> To clear the pending state, I would stop rm for the other OSD as well since it's already out and down anyway. You can always zap a drive, either directly on the host with:
> 
> cephadm ceph-volume lvm zap --destroy /dev/sdX
> 
> Or using the orchestrator:
> 
> orch device zap <hostname> <path> [--force]
> 
> But just to clarify, OSD.381 is already the replacement disk for a previously failed drive? If you zap it, the orchestrator would try to apply any matching spec and create a new OSD, probably with ID 381 again.
> 
> Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>:
> 
>> Frédéric,
>> 
>> My situation is a bit different I think. I had two malfunctionning OSDs that I removed with `ceph orch osd rm --replace --zap`: 1 was really dead and no longer seen by the OS (osd.253) and the other one with a lot of HW errors but still there (osd.381). Both have been successfully marked as destroyed in the CRUSH map. Just I didn't realize that cephadm was retrying every 10s to zap osd.253, getting an error as the disk could not be found. Looking at the removal status this morning(the removal was done ~2 weeks ago) with 'ceph orch osd rm status', I got:
>> 
>> OSD  HOST      STATE                    PGS  REPLACE  FORCE ZAP    DRAIN STARTED AT
>> 253  dig-osd4  done, waiting for purge    0  True     False  True
>> 381  dig-osd6  done, waiting for purge    0  True     False False  2025-04-23 11:56:09.864724+00:00
>> 
>> I don't know what the status "waiting to purge" means... but we can see that cephadm considers that the drain never started for osd.253, as the device was unavailable I guess... What happens with dig-osd6 is less clear to me but may be the consequence that at some point the disk freed by the initial rm was picked up by cephadm and readded as the replacement OSD because we forgot to set the osd.all-available-devices service to unmanaged. The drain started on Apr 23 is the second rm I did after fixing osd.all-available-devices service. For this second attempt, I didn't specify --zap, not sure why ( a mistake!).
>> 
>> I have the feeling but I may be wrong, that 'ceph osd destroy' will not help as they are already marked destroyed in the CRUSH map...
>> 
>> I'm wondering wether I should do 'ceph orch osd rm stop 381'  as I did for 253 or whether it will impact the replacement later. Or said in a different way, is the replace flag something managed by cephadm and requiring the OSD to stay in the "rm queue" until the replacement is done?
>> 
>> Best regards,
>> 
>> Michel
>> 
>>> Le 30/04/2025 à 10:50, Frédéric Nass a écrit :
>>> Hi Michel,
>>> 
>>> I've seen this recently on Reef (OSD stuck in the rm queue with the orchestrator tryng to zap a device that had already been zapped).
>>> 
>>> I could reproduce this a few times by deleting a batch of OSDs running on the same node. The whole 'ceph orch osd rm' process would stop progressing when trying to remove the ~8th OSD. I suspect that ceph-volume or the orchestrator is misinformed at some point that the device has already been zapped, looping over and over trying to remove this device that doesn't exist anymore.
>>> 
>>> I think you should now run 'ceph osd destroy <OSD_ID> --yes-i-really-mean-it'.
>>> 
>>> Regards,
>>> Frédéric.
>>> 
>>> ----- Le 30 Avr 25, à 10:28, Michel Jouvin michel.jouvin@xxxxxxxxxxxxxxx a écrit :
>>> 
>>>> Eugen,
>>>> 
>>>> Thanks, I forgot that operation started with the orchestrator can be
>>>> stopped. You were right: stopping the 'osd rm' was enough to unblock the
>>>> upgrade. I am not completely sure what is the consequence on the replace
>>>> flag: I have the feeling it has been lost somehow as the OSD is no
>>>> longer listed by 'ceph orch osd rm status' and 'ceph -s' reports now one
>>>> OSD down and 1 stray daemon instead of 2 stray daemons.
>>>> 
>>>> Michel
>>>> 
>>>> Le 30/04/2025 à 09:24, Eugen Block a écrit :
>>>>> You can stop the osd removal:
>>>>> 
>>>>> ceph orch osd rm stop <OSD_ID>
>>>>> 
>>>>> I'm not entirely sure what the orchestrator will do except for
>>>>> clearing the pending state, and since the OSDs are already marked as
>>>>> destroyed in the crush tree, I wouldn't expect anything weird. But
>>>>> it's worth a try, I guess.
>>>>> 
>>>>> Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I had no time to really investigate more our problem yesterday. But I
>>>>>> realized one issue that may explain the problem with osd.253: the
>>>>>> underlying disk is so dead that it is no longer visible by the OS.
>>>>>> Probably I added --zap when I did the 'ceph orch osd rm' and thus it
>>>>>> is trying to do the zapping, fails as it doesn't find the disk and
>>>>>> retries indefinitely... I remain a little bit surprise that this
>>>>>> zapping error is not reported (without the traceback) at the INFO
>>>>>> level and requires DEBUG to be seen but it is a detail. I'm surprised
>>>>>> that Ceph is not giving up on zapping if it cannot access the device
>>>>>> or did I miss something and there is a way to stop this process?
>>>>>> 
>>>>>> May be it is a corner case that has been fixed/improved since
>>>>>> 18.2.2... Anyway, the question remains: is there a way out of this
>>>>>> problem (that seems the only reason for the upgrade not really
>>>>>> starting) apart from getting the replacement device?
>>>>>> 
>>>>>> Best regards,
>>>>>> 
>>>>>> Michel
>>>>>> 
>>>>>> Le 28/04/2025 à 18:19, Michel Jouvin a écrit :
>>>>>>> Hi Frédéric,
>>>>>>> 
>>>>>>> Thanks for the command. I'm always looking at the wrong page of the
>>>>>>> doc! I looked at
>>>>>>> https://docs.ceph.com/en/latest/rados/troubleshooting/log-and-debug/
>>>>>>> that list the Ceph subsystem and their default log level but there
>>>>>>> is no mention of cephadm there... After enabling cephadm debug log
>>>>>>> level and restarting the upgrade, I got the messages below. The only
>>>>>>> thing strange points to the problem with osd.253 where it tries to
>>>>>>> zap the device that was probably already zapped and thus cannot find
>>>>>>> the LV volume associated with osd.253. There is not really any other
>>>>>>> messages saying the impact on the upgrade but I guess it is the
>>>>>>> reason. What do you think ? And is there any way to fix it, other
>>>>>>> than replacing the OSD?
>>>>>>> 
>>>>>>> Best regards,
>>>>>>> 
>>>>>>> Michel
>>>>>>> 
>>>>>>> --------------------- cephadm debug level log -------------------------
>>>>>>> 
>>>>>>> 2025-04-28T17:32:12.713746+0200 mgr.dig-mon1.fownxo [INF] Upgrade:
>>>>>>> Started with target quay.io/ceph/ceph:v18.2.6
>>>>>>> 2025-04-28T17:32:14.822030+0200 mgr.dig-mon1.fownxo [DBG] Refreshed
>>>>>>> host dig-osd4 devices (23)
>>>>>>> 2025-04-28T17:32:14.822550+0200 mgr.dig-mon1.fownxo [DBG] Finding
>>>>>>> OSDSpecs for host: <dig-osd4>
>>>>>>> 2025-04-28T17:32:14.822614+0200 mgr.dig-mon1.fownxo [DBG] Generating
>>>>>>> OSDSpec previews for []
>>>>>>> 2025-04-28T17:32:14.822695+0200 mgr.dig-mon1.fownxo [DBG] Loading
>>>>>>> OSDSpec previews to HostCache for host <dig-osd4>
>>>>>>> 2025-04-28T17:32:14.985257+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'config generate-minimal-conf' -> 0 in 0.005s
>>>>>>> 2025-04-28T17:32:15.262102+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'auth get' -> 0 in 0.277s
>>>>>>> 2025-04-28T17:32:15.262751+0200 mgr.dig-mon1.fownxo [DBG] Combine
>>>>>>> hosts with existing daemons [] + new hosts.... (very long line)
>>>>>>> 
>>>>>>> 2025-04-28T17:32:15.416491+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> _update_paused_health
>>>>>>> 2025-04-28T17:32:17.314607+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'osd df' -> 0 in 0.064s
>>>>>>> 2025-04-28T17:32:17.637526+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'osd df' -> 0 in 0.320s
>>>>>>> 2025-04-28T17:32:17.645703+0200 mgr.dig-mon1.fownxo [DBG] 2 OSDs are
>>>>>>> scheduled for removal: [osd.381, osd.253]
>>>>>>> 2025-04-28T17:32:17.661910+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'osd df' -> 0 in 0.011s
>>>>>>> 2025-04-28T17:32:17.667068+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'osd safe-to-destroy' -> 0 in 0.002s
>>>>>>> 2025-04-28T17:32:17.667117+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd
>>>>>>> safe-to-destroy returns:
>>>>>>> 2025-04-28T17:32:17.667164+0200 mgr.dig-mon1.fownxo [DBG] running
>>>>>>> cmd: osd down on ids [osd.381]
>>>>>>> 2025-04-28T17:32:17.667854+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'osd down' -> 0 in 0.001s
>>>>>>> 2025-04-28T17:32:17.667908+0200 mgr.dig-mon1.fownxo [INF] osd.381
>>>>>>> now down
>>>>>>> 2025-04-28T17:32:17.668446+0200 mgr.dig-mon1.fownxo [INF] Daemon
>>>>>>> osd.381 on dig-osd6 was already removed
>>>>>>> 2025-04-28T17:32:17.669534+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'osd destroy-actual' -> 0 in 0.001s
>>>>>>> 2025-04-28T17:32:17.669675+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd
>>>>>>> destroy-actual returns:
>>>>>>> 2025-04-28T17:32:17.669789+0200 mgr.dig-mon1.fownxo [INF]
>>>>>>> Successfully destroyed old osd.381 on dig-osd6; ready for replacement
>>>>>>> 2025-04-28T17:32:17.669874+0200 mgr.dig-mon1.fownxo [DBG] Removing
>>>>>>> osd.381 from the queue.
>>>>>>> 2025-04-28T17:32:17.680411+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'osd df' -> 0 in 0.010s
>>>>>>> 2025-04-28T17:32:17.685141+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'osd safe-to-destroy' -> 0 in 0.002s
>>>>>>> 2025-04-28T17:32:17.685190+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd
>>>>>>> safe-to-destroy returns:
>>>>>>> 2025-04-28T17:32:17.685234+0200 mgr.dig-mon1.fownxo [DBG] running
>>>>>>> cmd: osd down on ids [osd.253]
>>>>>>> 2025-04-28T17:32:17.685710+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'osd down' -> 0 in 0.000s
>>>>>>> 2025-04-28T17:32:17.685759+0200 mgr.dig-mon1.fownxo [INF] osd.253
>>>>>>> now down
>>>>>>> 2025-04-28T17:32:17.686186+0200 mgr.dig-mon1.fownxo [INF] Daemon
>>>>>>> osd.253 on dig-osd4 was already removed
>>>>>>> 2025-04-28T17:32:17.687068+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'osd destroy-actual' -> 0 in 0.001s
>>>>>>> 2025-04-28T17:32:17.687102+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd
>>>>>>> destroy-actual returns:
>>>>>>> 2025-04-28T17:32:17.687141+0200 mgr.dig-mon1.fownxo [INF]
>>>>>>> Successfully destroyed old osd.253 on dig-osd4; ready for replacement
>>>>>>> 2025-04-28T17:32:17.687176+0200 mgr.dig-mon1.fownxo [INF] Zapping
>>>>>>> devices for osd.253 on dig-osd4
>>>>>>> 2025-04-28T17:32:17.687508+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> _run_cephadm : command = ceph-volume
>>>>>>> 2025-04-28T17:32:17.687554+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> _run_cephadm : args = ['--', 'lvm', 'zap', '--osd-id', '253',
>>>>>>> '--destroy']
>>>>>>> 2025-04-28T17:32:17.687637+0200 mgr.dig-mon1.fownxo [DBG] osd
>>>>>>> container image
>>>>>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f
>>>>>>> 2025-04-28T17:32:17.687677+0200 mgr.dig-mon1.fownxo [DBG] args:
>>>>>>> --image
>>>>>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f
>>>>>>> --timeout 895 ceph-volume --fsid
>>>>>>> f5195e24-158c-11ee-b338-5ced8c61b074 -- lvm zap --osd-id 253 --destroy
>>>>>>> 2025-04-28T17:32:17.687733+0200 mgr.dig-mon1.fownxo [DBG] Running
>>>>>>> command: which python3
>>>>>>> 2025-04-28T17:32:17.731474+0200 mgr.dig-mon1.fownxo [DBG] Running
>>>>>>> command: /usr/bin/python3
>>>>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d
>>>>>>> --image
>>>>>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f
>>>>>>> --timeout 895 ceph-volume --fsid
>>>>>>> f5195e24-158c-11ee-b338-5ced8c61b074 -- lvm zap --osd-id 253 --destroy
>>>>>>> 2025-04-28T17:32:20.406723+0200 mgr.dig-mon1.fownxo [DBG] code: 1
>>>>>>> 2025-04-28T17:32:20.406764+0200 mgr.dig-mon1.fownxo [DBG] err:
>>>>>>> Inferring config
>>>>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/config/ceph.conf
>>>>>>> Non-zero exit code 1 from /usr/bin/podman run --rm --ipc=host
>>>>>>> --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume
>>>>>>> --privileged --group-add=disk --init -e
>>>>>>> CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f
>>>>>>> -e NODE_NAME=dig-osd4 -e CEPH_USE_RANDOM_NONCE=1 -e
>>>>>>> CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v
>>>>>>> /var/run/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/run/ceph:z
>>>>>>> -v
>>>>>>> /var/log/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/log/ceph:z
>>>>>>> -v
>>>>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/crash:/var/lib/ceph/crash:z
>>>>>>> -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v
>>>>>>> /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
>>>>>>> /run/lock/lvm:/run/lock/lvm -v
>>>>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/selinux:/sys/fs/selinux:ro
>>>>>>> -v /:/rootfs -v /etc/hosts:/etc/hosts:ro -v
>>>>>>> /tmp/ceph-tmpgtvcw4gk:/etc/ceph/ceph.conf:z
>>>>>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f
>>>>>>> lvm zap --osd-id 253 --destroy
>>>>>>> /usr/bin/podman: stderr Traceback (most recent call last):
>>>>>>> /usr/bin/podman: stderr   File "/usr/sbin/ceph-volume", line 11, in
>>>>>>> <module>
>>>>>>> /usr/bin/podman: stderr load_entry_point('ceph-volume==1.0.0',
>>>>>>> 'console_scripts', 'ceph-volume')()
>>>>>>> /usr/bin/podman: stderr   File
>>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in
>>>>>>> __init__
>>>>>>> /usr/bin/podman: stderr     self.main(self.argv)
>>>>>>> /usr/bin/podman: stderr   File
>>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line
>>>>>>> 59, in newfunc
>>>>>>> /usr/bin/podman: stderr     return f(*a, **kw)
>>>>>>> /usr/bin/podman: stderr   File
>>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in
>>>>>>> main
>>>>>>> /usr/bin/podman: stderr     terminal.dispatch(self.mapper,
>>>>>>> subcommand_args)
>>>>>>> /usr/bin/podman: stderr   File
>>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line
>>>>>>> 194, in dispatch
>>>>>>> /usr/bin/podman: stderr     instance.main()
>>>>>>> /usr/bin/podman: stderr   File
>>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/main.py",
>>>>>>> line 46, in main
>>>>>>> /usr/bin/podman: stderr     terminal.dispatch(self.mapper, self.argv)
>>>>>>> /usr/bin/podman: stderr   File
>>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line
>>>>>>> 194, in dispatch
>>>>>>> /usr/bin/podman: stderr     instance.main()
>>>>>>> /usr/bin/podman: stderr   File
>>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py",
>>>>>>> line 403, in main
>>>>>>> /usr/bin/podman: stderr     self.zap_osd()
>>>>>>> /usr/bin/podman: stderr   File
>>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line
>>>>>>> 16, in is_root
>>>>>>> /usr/bin/podman: stderr     return func(*a, **kw)
>>>>>>> /usr/bin/podman: stderr   File
>>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py",
>>>>>>> line 301, in zap_osd
>>>>>>> /usr/bin/podman: stderr     devices =
>>>>>>> find_associated_devices(self.args.osd_id, self.args.osd_fsid)
>>>>>>> /usr/bin/podman: stderr   File
>>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py",
>>>>>>> line 88, in find_associated_devices
>>>>>>> /usr/bin/podman: stderr     '%s' % osd_id or osd_fsid)
>>>>>>> /usr/bin/podman: stderr RuntimeError: Unable to find any LV for
>>>>>>> zapping OSD: 253
>>>>>>> Traceback (most recent call last):
>>>>>>>   File "/usr/lib64/python3.9/runpy.py", line 197, in
>>>>>>> _run_module_as_main
>>>>>>>     return _run_code(code, main_globals, None,
>>>>>>>   File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code
>>>>>>>     exec(code, run_globals)
>>>>>>>   File
>>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>>>>> line 10700, in <module>
>>>>>>>   File
>>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>>>>> line 10688, in main
>>>>>>>   File
>>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>>>>> line 2445, in _infer_config
>>>>>>>   File
>>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>>>>> line 2361, in _infer_fsid
>>>>>>>   File
>>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>>>>> line 2473, in _infer_image
>>>>>>>   File
>>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>>>>> line 2348, in _validate_fsid
>>>>>>>   File
>>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>>>>> line 6970, in command_ceph_volume
>>>>>>>   File
>>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>>>>> line 2136, in call_throws
>>>>>>> RuntimeError: Failed command: /usr/bin/podman run --rm --ipc=host
>>>>>>> --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume
>>>>>>> --privileged --group-add=disk --init -e
>>>>>>> CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f
>>>>>>> -e NODE_NAME=dig-osd4 -e CEPH_USE_RANDOM_NONCE=1 -e
>>>>>>> CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v
>>>>>>> /var/run/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/run/ceph:z
>>>>>>> -v
>>>>>>> /var/log/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/log/ceph:z
>>>>>>> -v
>>>>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/crash:/var/lib/ceph/crash:z
>>>>>>> -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v
>>>>>>> /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
>>>>>>> /run/lock/lvm:/run/lock/lvm -v
>>>>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/selinux:/sys/fs/selinux:ro
>>>>>>> -v /:/rootfs -v /etc/hosts:/etc/hosts:ro -v
>>>>>>> /tmp/ceph-tmpgtvcw4gk:/etc/ceph/ceph.conf:z
>>>>>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f
>>>>>>> lvm zap --osd-id 253 --destroy
>>>>>>> 2025-04-28T17:32:20.409316+0200 mgr.dig-mon1.fownxo [DBG] serve loop
>>>>>>> sleep
>>>>>>> 
>>>>>>> -----------------------
>>>>>>> 
>>>>>>> 
>>>>>>> Le 28/04/2025 à 14:00, Frédéric Nass a écrit :
>>>>>>>> Hi Michel,
>>>>>>>> 
>>>>>>>> You need to turn on cephadm debugging as described here [1] in the
>>>>>>>> documentation
>>>>>>>> 
>>>>>>>> $ ceph config set mgr mgr/cephadm/log_to_cluster_level debug
>>>>>>>> 
>>>>>>>> and then look for any hints with
>>>>>>>> 
>>>>>>>> $ ceph -W cephadm --watch-debug
>>>>>>>> 
>>>>>>>> or
>>>>>>>> 
>>>>>>>> $ tail -f /var/log/ceph/$(ceph fsid)/ceph.cephadm.log (on the
>>>>>>>> active MGR)
>>>>>>>> 
>>>>>>>> when you start/stop the upgrade.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Frédéric.
>>>>>>>> 
>>>>>>>> [1] https://docs.ceph.com/en/reef/cephadm/operations/
>>>>>>>> 
>>>>>>>> ----- Le 28 Avr 25, à 12:52, Michel Jouvin
>>>>>>>> michel.jouvin@xxxxxxxxxxxxxxx a écrit :
>>>>>>>> 
>>>>>>>>> Eugen,
>>>>>>>>> 
>>>>>>>>> Thanks for doing the test. I scanned all logs and cannot find
>>>>>>>>> anything
>>>>>>>>> except the message mentioned displayed every 10s about the removed
>>>>>>>>> OSDs
>>>>>>>>> that led me to think there is something not exactly as expected...
>>>>>>>>> No clue
>>>>>>>>> what...
>>>>>>>>> 
>>>>>>>>> Michel
>>>>>>>>> Sent from my mobile
>>>>>>>>> Le 28 avril 2025 12:43:23 Eugen Block <eblock@xxxxxx> a écrit :
>>>>>>>>> 
>>>>>>>>>> I just tried this on a single-node virtual test cluster, deployed it
>>>>>>>>>> with 18.2.2. Then I removed one OSD with --replace flag (no --zap,
>>>>>>>>>> otherwise it would redeploy the OSD on that VM). Then I also see the
>>>>>>>>>> stray daemon warning, but the upgrade from 18.2.2 to 18.2.6 finished
>>>>>>>>>> successfully. That's why I don't think the stray daemon is the root
>>>>>>>>>> cause here. I would suggest to scan monitor and cephadm logs as
>>>>>>>>>> well.
>>>>>>>>>> After the upgrade to 18.2.6 the stray warning cleared, btw.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>:
>>>>>>>>>> 
>>>>>>>>>>> Eugen,
>>>>>>>>>>> 
>>>>>>>>>>> As said in a previous message, I found a tracker issue with a
>>>>>>>>>>> similar problem: https://tracker.ceph.com/issues/67018, even if the
>>>>>>>>>>> cause may be different as it is in older versions than me. For some
>>>>>>>>>>> reasons the sequence of messages every 10s is now back on the 2
>>>>>>>>>>> OSDs:
>>>>>>>>>>> 
>>>>>>>>>>> 2025-04-28T10:00:28.226741+0200 mgr.dig-mon1.fownxo [INF]
>>>>>>>>>>> osd.253 now down
>>>>>>>>>>> 2025-04-28T10:00:28.227249+0200 mgr.dig-mon1.fownxo [INF] Daemon
>>>>>>>>>>> osd.253 on dig-osd4 was already removed
>>>>>>>>>>> 2025-04-28T10:00:28.228929+0200 mgr.dig-mon1.fownxo [INF]
>>>>>>>>>>> Successfully destroyed old osd.253 on dig-osd4; ready for
>>>>>>>>>>> replacement
>>>>>>>>>>> 2025-04-28T10:00:28.228994+0200 mgr.dig-mon1.fownxo [INF] Zapping
>>>>>>>>>>> devices for osd.253 on dig-osd4
>>>>>>>>>>> 2025-04-28T10:00:39.132028+0200 mgr.dig-mon1.fownxo [INF]
>>>>>>>>>>> osd.381 now down
>>>>>>>>>>> 2025-04-28T10:00:39.132599+0200 mgr.dig-mon1.fownxo [INF] Daemon
>>>>>>>>>>> osd.381 on dig-osd6 was already removed
>>>>>>>>>>> 2025-04-28T10:00:39.133424+0200 mgr.dig-mon1.fownxo [INF]
>>>>>>>>>>> Successfully destroyed old osd.381 on dig-osd6; ready for
>>>>>>>>>>> replacement
>>>>>>>>>>> 
>>>>>>>>>>> except that the "Zapping.." message is not present for the
>>>>>>>>>>> second OSD...
>>>>>>>>>>> 
>>>>>>>>>>> I tried to increase the mgr log verbosity with 'ceph tell
>>>>>>>>>>> mgr.dig-mon1.fownxo config set debug_mgr 20/20' and there
>>>>>>>>>>> stop/start
>>>>>>>>>>> the upgrade without any additonal message displayed.
>>>>>>>>>>> 
>>>>>>>>>>> Michel
>>>>>>>>>>> 
>>>>>>>>>>> Le 28/04/2025 à 09:20, Eugen Block a écrit :
>>>>>>>>>>>> Have you increased the debug level for the mgr? It would surprise
>>>>>>>>>>>> me if stray daemons would really block an upgrade. But debug logs
>>>>>>>>>>>> might reveal something. And if it can be confirmed that the strays
>>>>>>>>>>>> really block the upgrade, you could either remove the OSDs
>>>>>>>>>>>> entirely
>>>>>>>>>>>> (they are already drained) to continue upgrading, or create a
>>>>>>>>>>>> tracker issue to report this and wait for instructions.
>>>>>>>>>>>> 
>>>>>>>>>>>> Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Eugen,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Yes I stopped and restarted the upgrade several times already, in
>>>>>>>>>>>>> particular after failing over the mgr. And the only messages
>>>>>>>>>>>>> related are the upgrade started and upgrade canceled ones.
>>>>>>>>>>>>> Nothing
>>>>>>>>>>>>> related to an error or a crash...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> For me the question is why do I have stray daemons after removing
>>>>>>>>>>>>> OSD. IMO it is unexpected as these daemons are not there anymore.
>>>>>>>>>>>>> I can understand that stray daemons prevent the upgrade to start
>>>>>>>>>>>>> if they are really strayed... And it would be nice if cephadm was
>>>>>>>>>>>>> giving a message about why the upgrade does not really start
>>>>>>>>>>>>> despite its status is "in progress"...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Michel
>>>>>>>>>>>>> Sent from my mobile
>>>>>>>>>>>>> Le 28 avril 2025 07:27:44 Eugen Block <eblock@xxxxxx> a écrit :
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Do you see anything in the mgr log? To get fresh logs I would
>>>>>>>>>>>>>> cancel
>>>>>>>>>>>>>> the upgrade (ceph orch upgrade stop) and then try again.
>>>>>>>>>>>>>> A workaround could be to manually upgrade the mgr daemons by
>>>>>>>>>>>>>> changing
>>>>>>>>>>>>>> their unit.run file, but that would be my last resort. Btwm
>>>>>>>>>>>>>> did you
>>>>>>>>>>>>>> stop and start the upgrade after failing the mgr as well?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Eugen,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks for the hint. Here is the osd_remove_queue:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> [root@ijc-mon1 ~]# ceph config-key get
>>>>>>>>>>>>>>> mgr/cephadm/osd_remove_queue|jq
>>>>>>>>>>>>>>> [
>>>>>>>>>>>>>>>   {
>>>>>>>>>>>>>>>     "osd_id": 253,
>>>>>>>>>>>>>>>     "started": true,
>>>>>>>>>>>>>>>     "draining": false,
>>>>>>>>>>>>>>>     "stopped": false,
>>>>>>>>>>>>>>>     "replace": true,
>>>>>>>>>>>>>>>     "force": false,
>>>>>>>>>>>>>>>     "zap": true,
>>>>>>>>>>>>>>>     "hostname": "dig-osd4",
>>>>>>>>>>>>>>>     "drain_started_at": null,
>>>>>>>>>>>>>>>     "drain_stopped_at": null,
>>>>>>>>>>>>>>>     "drain_done_at": "2025-04-15T14:09:30.521534Z",
>>>>>>>>>>>>>>>     "process_started_at": "2025-04-15T14:09:14.091592Z"
>>>>>>>>>>>>>>>   },
>>>>>>>>>>>>>>>   {
>>>>>>>>>>>>>>>     "osd_id": 381,
>>>>>>>>>>>>>>>     "started": true,
>>>>>>>>>>>>>>>     "draining": false,
>>>>>>>>>>>>>>>     "stopped": false,
>>>>>>>>>>>>>>>     "replace": true,
>>>>>>>>>>>>>>>     "force": false,
>>>>>>>>>>>>>>>     "zap": false,
>>>>>>>>>>>>>>>     "hostname": "dig-osd6",
>>>>>>>>>>>>>>>     "drain_started_at": "2025-04-23T11:56:09.864724Z",
>>>>>>>>>>>>>>>     "drain_stopped_at": null,
>>>>>>>>>>>>>>>     "drain_done_at": "2025-04-25T06:53:03.678729Z",
>>>>>>>>>>>>>>>     "process_started_at": "2025-04-23T11:56:05.924923Z"
>>>>>>>>>>>>>>>   }
>>>>>>>>>>>>>>> ]
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> It is not empty the two stray daemons are listed. Not sure
>>>>>>>>>>>>>>> it these
>>>>>>>>>>>>>>> entries are expected as I specified --replace... A similar
>>>>>>>>>>>>>>> issue was
>>>>>>>>>>>>>>> reported in https://tracker.ceph.com/issues/67018 so before
>>>>>>>>>>>>>>> Reef but
>>>>>>>>>>>>>>> the cause may be different. Still not clear for me how to
>>>>>>>>>>>>>>> get out of
>>>>>>>>>>>>>>> this, except may be replacing the OSDs but this will take
>>>>>>>>>>>>>>> some time...
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Michel
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Le 27/04/2025 à 10:21, Eugen Block a écrit :
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> what's the current ceph status? Wasn't there a bug in early
>>>>>>>>>>>>>>>> Reef
>>>>>>>>>>>>>>>> versions preventing upgrades if there were removed OSDs in the
>>>>>>>>>>>>>>>> queue? But IIRC, the cephadm module would crash. Can you check
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> ceph config-key get mgr/cephadm/osd_remove_queue
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> And then I would check the mgr log, maybe set it to a
>>>>>>>>>>>>>>>> higher debug
>>>>>>>>>>>>>>>> level to see what's blocking it.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I tried to restart all the mgrs (we have 3, 1 active, 2
>>>>>>>>>>>>>>>>> standby)
>>>>>>>>>>>>>>>>> by executing 3 times the `ceph mgr fail`, no impact. I don't
>>>>>>>>>>>>>>>>> really understand why I get these stray daemons after doing a
>>>>>>>>>>>>>>>>> 'ceph orch osd rm --replace` but I think I have always
>>>>>>>>>>>>>>>>> seen this.
>>>>>>>>>>>>>>>>> I tried to mute rather than disable the stray daemon check
>>>>>>>>>>>>>>>>> but it
>>>>>>>>>>>>>>>>> doesn't help either. And I find strange this message every
>>>>>>>>>>>>>>>>> 10s
>>>>>>>>>>>>>>>>> about one of the destroyed OSD and only one, reporting it
>>>>>>>>>>>>>>>>> is down
>>>>>>>>>>>>>>>>> and already destroyed and saying it'll zap it (I think I
>>>>>>>>>>>>>>>>> didn't
>>>>>>>>>>>>>>>>> add --zap when I removed it as the underlying disk is dead).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I'm completely stuck with this upgrade and I don't
>>>>>>>>>>>>>>>>> remember having
>>>>>>>>>>>>>>>>> this kind of problems in previous upgrades with cephadm...
>>>>>>>>>>>>>>>>> Any
>>>>>>>>>>>>>>>>> idea where to look for the cause and/or how to fix it?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Michel
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Le 24/04/2025 à 23:34, Michel Jouvin a écrit :
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I'm trying to upgrade a (cephadm) cluster from 18.2.2 to
>>>>>>>>>>>>>>>>>> 18.2.6,
>>>>>>>>>>>>>>>>>> using 'ceph orch upgrade'. When I enter the command 'ceph
>>>>>>>>>>>>>>>>>> orch
>>>>>>>>>>>>>>>>>> upgrade start --ceph-version 18.2.6', I receive a message
>>>>>>>>>>>>>>>>>> saying
>>>>>>>>>>>>>>>>>> that the upgrade has been initiated, with a similar
>>>>>>>>>>>>>>>>>> message in
>>>>>>>>>>>>>>>>>> the logs but nothing happens after this. 'ceph orch upgrade
>>>>>>>>>>>>>>>>>> status' says:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> -------
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> [root@ijc-mon1 ~]# ceph orch upgrade status
>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>     "target_image": "quay.io/ceph/ceph:v18.2.6",
>>>>>>>>>>>>>>>>>>     "in_progress": true,
>>>>>>>>>>>>>>>>>>     "which": "Upgrading all daemon types on all hosts",
>>>>>>>>>>>>>>>>>>     "services_complete": [],
>>>>>>>>>>>>>>>>>>     "progress": "",
>>>>>>>>>>>>>>>>>>     "message": "",
>>>>>>>>>>>>>>>>>>     "is_paused": false
>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>> -------
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The first time I entered the command, the cluster status was
>>>>>>>>>>>>>>>>>> HEALTH_WARN because of 2 stray daemons (caused by
>>>>>>>>>>>>>>>>>> destroyed OSDs,
>>>>>>>>>>>>>>>>>> rm --replace). I set mgr/cephadm/warn_on_stray_daemons to
>>>>>>>>>>>>>>>>>> false
>>>>>>>>>>>>>>>>>> to ignore these 2 daemons, the cluster is now HEALTH_OK
>>>>>>>>>>>>>>>>>> but it
>>>>>>>>>>>>>>>>>> doesn't help. Following a Red Hat KB entry, I tried to
>>>>>>>>>>>>>>>>>> failover
>>>>>>>>>>>>>>>>>> the mgr, stopped an restarted the upgrade but without any
>>>>>>>>>>>>>>>>>> improvement. I have not seen anything in the logs, except
>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> there is an INF entry every 10s about the destroyed OSD
>>>>>>>>>>>>>>>>>> saying:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> ------
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 2025-04-24T21:30:54.161988+0000 mgr.ijc-mon1.yyfnhz
>>>>>>>>>>>>>>>>>> (mgr.55376028) 14079 : cephadm [INF] osd.253 now down
>>>>>>>>>>>>>>>>>> 2025-04-24T21:30:54.162601+0000 mgr.ijc-mon1.yyfnhz
>>>>>>>>>>>>>>>>>> (mgr.55376028) 14080 : cephadm [INF] Daemon osd.253 on
>>>>>>>>>>>>>>>>>> dig-osd4
>>>>>>>>>>>>>>>>>> was already removed
>>>>>>>>>>>>>>>>>> 2025-04-24T21:30:54.164440+0000 mgr.ijc-mon1.yyfnhz
>>>>>>>>>>>>>>>>>> (mgr.55376028) 14081 : cephadm [INF] Successfully
>>>>>>>>>>>>>>>>>> destroyed old
>>>>>>>>>>>>>>>>>> osd.253 on dig-osd4; ready for replacement
>>>>>>>>>>>>>>>>>> 2025-04-24T21:30:54.164536+0000 mgr.ijc-mon1.yyfnhz
>>>>>>>>>>>>>>>>>> (mgr.55376028) 14082 : cephadm [INF] Zapping devices for
>>>>>>>>>>>>>>>>>> osd.253
>>>>>>>>>>>>>>>>>> on dig-osd4
>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The message seems to be only for one of the 2 destroyed OSDs
>>>>>>>>>>>>>>>>>> since I restarted the mgr. May this be the cause for the
>>>>>>>>>>>>>>>>>> stucked
>>>>>>>>>>>>>>>>>> upgrade? What can I do for fixing this?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks in advance for any hint. Best regards,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Michel
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>>>> _______________________________________________
>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>> 
> 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx