Re: Squid: successfully drained host can't be removed

Adam King <adking@xxxxxxxxxx> · Fri, 25 Jul 2025 11:56:57 -0400

The daemons cephadm "knows" about is actually just based on the contents of
the /var/lib/ceph/<fsid>/ directory on each given host cephadm is managing.
If osd.6 was present, got removed by the host drain process, and then its
daemon directory was still on the host / there was still a container
running for osd.6, it sounds like a bug with the drain process and sounds
worthy of a ticket (I think this is what happened based on what you said).
If I'm misreading and this was a manual removal of osd.6 then it could be
possible that cephadm either hadn't checked the host for what daemons were
there since that removal happened (could verify by checking the REFRESHED
column of `ceph orch ps`, osd.6 should be listed there if you got this
error) or that removal process didn't clean up the daemon directory, in
which case I wouldn't consider it to have been a bug. Assuming it's the
former case and you can show what was actually left on the host for osd.6
or you have a consistent way to reproduce the failed removal, I can take a
look.

On Fri, Jul 25, 2025 at 8:01 AM Eugen Block <eblock@xxxxxx> wrote:

> Hi *,
>
> an unexpected issue occurred today, at least twice, so it seems kind
> of reproducable. I've been preparing a demo in a (virtual) lab cluster
> (19.2.2) and wanted to drain multiple hosts. The first time I didn't
> pay much attention, but the draining seemed stuck (kind of a common
> issue these days), so I intervened and cleaned up until I got into a
> healthy state, all good. Then I did my thing, changed the crush tree,
> added the removed hosts again, cephadm created the OSDs, backfill
> finished successfully.
>
> Now I wanted to reset the cluster again to my starting point, so I
> issued the drain command again for multiple hosts (each host has 2
> OSDs):
>
> # for i in {5..8}; do ceph orch host drain host$i; done
>
> This time all OSDs were drained successfully (I watched 'ceph orch osd
> rm status'), so I wanted to remove the hosts, but it failed:
>
> # for i in {5..8}; do ceph orch host rm host$i --rm-crush-entry; done
> Removed  host 'host5'
> Removed  host 'host6'
> Removed  host 'host7'
> Error EINVAL: Not allowed to remove host8 from cluster. The following
> daemons are running in the host:
> type                 id
> -------------------- ---------------
> osd                  6
>
> Please run 'ceph orch host drain host8' to remove daemons from host
>
>
> But there was nothing to drain anymore, osd.6 was already successfully
> removed from the crush tree. But on host8 there was still a daemon I
> had to clean up manually:
>
> host8:~ # cephadm rm-daemon --name osd.6 --fsid
> 543967bc-e586-32b8-bd2c-2d8b8b168f02 --force
>
> I compared the cephadm.log files (3 out of 4 to-be-drained hosts were
> successfully drained) and on host8 the command rm-daemon was never
> executed (until I did manually). Is this a known issue? It doesn't
> seem to happen with only on host, at least I didn't notice in the
> past. Should I create a tracker for this?
>
> Thanks,
> Eugen
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx