Re: Cephadm upgrade from 16.2.15 -> 17.2.0

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I’m now seeing this:

  cluster:
    id:     95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1
    health: HEALTH_WARN
            Failed to apply 1 service(s): osd.cost_capacity

I’m assuming this is due to the fact that I’ve only upgraded mgr but I wanted to double check before proceeding with the rest of the components.

Thanks
-jeremy




On Sunday, Apr 13, 2025 at 12:59 AM, Jeremy Hansen <jeremy@xxxxxxxxxx> wrote:
Updating mgr’s to 18.2.5 seemed to work just fine.  I will go for the remaining services after the weekend.  Thanks.

-jeremy



On Thursday, Apr 10, 2025 at 6:37 AM, Eugen Block <eblock@xxxxxx> wrote:
Glad I could help! I'm also waiting for 18.2.5 to upgrade our own
cluster from Pacific after getting rid of our cache tier. :-D

Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>:

This seems to have worked to get the orch back up and put me back to
16.2.15. Thank you. Debating on waiting for 18.2.5 to move forward.

-jeremy

On Monday, Apr 07, 2025 at 1:26 AM, Eugen Block <eblock@xxxxxx
(mailto:eblock@xxxxxx)> wrote:
Still no, just edit the unit.run file for the MGRs to use a different
image. See Frédéric's instructions (now that I'm re-reading it,
there's a little mistake with dots and hyphens):

# Backup the unit.run file
$ cp /var/lib/ceph/$(ceph fsid)/mgr.ceph01.eydqvm/unit.run{,.bak}

# Change container image's signature. You can get the signature of the
version you
want to reach from https://quay.io/repository/ceph/ceph?tab=tags. It's
in the URL of a
version.
$ sed -i
's/ceph@sha256:e40c19cd70e047d14d70f5ec3cf501da081395a670cd59ca881ff56119660c8f/ceph@sha256:d26c11e20773704382946e34f0d3d2c0b8bb0b7b37d9017faa9dc11a0196c7d9/g'
/var/lib/ceph/$(ceph fsid)/mgr.ceph01.eydqvm/unit.run

# Restart the container (systemctl daemon-reload not needed)
$ systemctl restart ceph-$(ceph fsid)(a)mgr.ceph01.eydqvm.service

# Run this command a few times and it should show the new version
ceph orch ps --refresh --hostname ceph01 | grep mgr

To get the image signature, you can also look into the other unit.run
files, a version tag would also work.

It depends on how often you need the orchestrator to maintain the
cluster. If you have the time, you could wait a bit longer for other
responses. If you need the orchestrator in the meantime, you can roll
back the MGRs.

https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/32APKOXKRAIZ7IDCNI25KVYFCCCF6RJG/

Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>:

Thank you. The only thing I’m unclear on is the rollback to pacific.

Are you referring to



https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-manager-daemon

Thank you. I appreciate all the help. Should I wait for Adam to
comment? At the moment, the cluster is functioning enough to
maintain running vms, so if it’s wise to wait, I can do that.

-jeremy

On Monday, Apr 07, 2025 at 12:23 AM, Eugen Block <eblock@xxxxxx
(mailto:eblock@xxxxxx)> wrote:
I haven't tried it this way yet, and I had hoped that Adam would chime
in, but my approach would be to remove this key (it's not present when
no upgrade is in progress):

ceph config-key rm mgr/cephadm/upgrade_state

Then rollback the two newer MGRs to Pacific as described before. If
they come up healthy, test if the orchestrator works properly first.
For example, remove a node-exporter or crash or anything else
uncritical and let it redeploy.
If that works, try a staggered upgrade, starting with the MGRs only:

ceph orch upgrade start --image <image-name> --daemon-types mgr

Since there's no need to go to Quincy, I suggest to upgrade to Reef
18.2.4 (or you wait until 18.2.5 is released, which should be very
soon), so set the respective <image-name> in the above command.

If all three MGRs successfully upgrade, you can continue with the
MONs, or with the entire rest.

In production clusters, I usually do staggered upgrades, e. g. I limit
the number of OSD daemons first just to see if they come up healthy,
then I let it upgrade all other OSDs automatically.

https://docs.ceph.com/en/latest/cephadm/upgrade/#staggered-upgrade

Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>:

Snipped some of the irrelevant logs to keep message size down.

ceph config-key get mgr/cephadm/upgrade_state

{"target_name": "quay.io/ceph/ceph:v17.2.0", "progress_id":
"e7e1a809-558d-43a7-842a-c6229fdc57af", "target_id":
"e1d6a67b021eb077ee22bf650f1a9fb1980a2cf5c36bdb9cba9eac6de8f702d9",
"target_digests":


["quay.io/ceph/ceph@sha256:12a0a4f43413fd97a14a3d47a3451b2d2df50020835bb93db666209f3f77617a", "quay.io/ceph/ceph@sha256:cb4d698cb769b6aba05bf6ef04f41a7fe694160140347576e13bd9348514b667"], "target_version": "17.2.0", "fs_original_max_mds": null, "fs_original_allow_standby_replay": null, "error": null, "paused": false, "daemon_types": null, "hosts": null, "services": null, "total_count":
null,
"remaining_count":
null}

What should I do next?

Thank you!
-jeremy

On Sunday, Apr 06, 2025 at 1:38 AM, Eugen Block <eblock@xxxxxx
(mailto:eblock@xxxxxx)> wrote:
Can you check if you have this config-key?

ceph config-key get mgr/cephadm/upgrade_state

If you reset the MGRs, it might be necessary to clear this key,
otherwise you might end up in some inconsistency. Just to be sure.

Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>:

Thanks. I’m trying to be extra careful since this cluster is
actually in use. I’ll wait for your feedback.

-jeremy

On Saturday, Apr 05, 2025 at 3:39 PM, Eugen Block <eblock@xxxxxx
(mailto:eblock@xxxxxx)> wrote:
No, that's not necessary, just edit the unit.run file for
the MGRs to
use a different image. See Frédéric's instructions:




https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/32APKOXKRAIZ7IDCNI25KVYFCCCF6RJG/

But I'm not entirely sure if you need to clear some
config-keys first
in order to reset the upgrade state. If I have time, I'll
try to check
tomorrow, or on Monday.

Zitat von Jeremy Hansen <jeremy@xxxxxxxxxx>:

Would I follow this process to downgrade?





https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-manager-daemon

Thank you

On Saturday, Apr 05, 2025 at 2:04 PM, Jeremy Hansen
<jeremy@xxxxxxxxxx (mailto:jeremy@xxxxxxxxxx)> wrote:
ceph -s claims things are healthy:

ceph -s
cluster:
id: 95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1
health: HEALTH_OK

services:
mon: 3 daemons, quorum cn01,cn03,cn02 (age 20h)
mgr: cn03.negzvb(active, since 26m), standbys: cn01.tjmtph,
cn02.ceph.xyz.corp.ggixgj
mds: 1/1 daemons up, 2 standby
osd: 15 osds: 15 up (since 19h), 15 in (since 14M)

data:
volumes: 1/1 healthy
pools: 6 pools, 610 pgs
objects: 284.59k objects, 1.1 TiB
usage: 3.3 TiB used, 106 TiB / 109 TiB avail
pgs: 610 active+clean

io:
client: 255 B/s rd, 1.2 MiB/s wr, 10 op/s rd, 16 op/s wr




How do I downgrade if the orch is down?

Thank you
-jeremy



On Saturday, Apr 05, 2025 at 1:56 PM, Eugen Block
<eblock@xxxxxx
(mailto:eblock@xxxxxx)> wrote:
It would help if you only pasted the relevant parts.
Anyway, these two
sections stand out:

---snip---
Apr 05 20:33:48 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
debug 2025-04-05T20:33:48.909+0000 7f26f0200700 0
[balancer INFO root]
Some PGs (1.000000) are unknown; try again later
Apr 05 20:33:48 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
debug 2025-04-05T20:33:48.917+0000 7f2663400700 -1 mgr
load Failed to
construct class in 'cephadm'
Apr 05 20:33:48 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
debug 2025-04-05T20:33:48.917+0000 7f2663400700 -1 mgr
load Traceback
(most recent call last):
Apr 05 20:33:48 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
File "/usr/share/ceph/mgr/cephadm/module.py", line 470,
in __init__
Apr 05 20:33:48 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
self.upgrade = CephadmUpgrade(self)
Apr 05 20:33:48 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
File "/usr/share/ceph/mgr/cephadm/upgrade.py", line 112,
in __init__
Apr 05 20:33:48 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
self.upgrade_state: Optional[UpgradeState] =
UpgradeState.from_json(json.loads(t))
Apr 05 20:33:48 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
File "/usr/share/ceph/mgr/cephadm/upgrade.py", line 93,
in from_json
Apr 05 20:33:48 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
return cls(**c)
Apr 05 20:33:48 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
TypeError: __init__() got an unexpected keyword argument
'daemon_types'
Apr 05 20:33:48 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
Apr 05 20:33:48 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
debug 2025-04-05T20:33:48.918+0000 7f2663400700 -1
mgr operator()
Failed to run module in active mode ('cephadm')

Apr 05 20:33:49 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
debug 2025-04-05T20:33:49.273+0000 7f2663400700 -1 mgr
load Failed to
construct class in 'snap_schedule'
Apr 05 20:33:49 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
debug 2025-04-05T20:33:49.273+0000 7f2663400700 -1 mgr
load Traceback
(most recent call last):
Apr 05 20:33:49 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
File
"/usr/share/ceph/mgr/snap_schedule/module.py", line 38,
in __init__
Apr 05 20:33:49 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
self.client = SnapSchedClient(self)
Apr 05 20:33:49 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
File
"/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py", line
158, in __init__
Apr 05 20:33:49 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
with self.get_schedule_db(fs_name) as conn_mgr:
Apr 05 20:33:49 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
File
"/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py", line
192, in get_schedule_db
Apr 05 20:33:49 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
db.executescript(dump)
Apr 05 20:33:49 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
sqlite3.OperationalError: table schedules already exists
Apr 05 20:33:49 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
Apr 05 20:33:49 cn03.ceph.xyz.corp

ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
debug 2025-04-05T20:33:49.274+0000 7f2663400700 -1
mgr operator()
Failed to run module in active mode ('snap_schedule')
---snip---

Your cluster seems to be in an error state (ceph -s)
because of an
unknown PG. It's recommended to have a healthy
cluster before
attemping an upgrade. It's possible that these errors
come from the
not upgraded MGR, I'm not sure.

Since the upgrade was only successful for two MGRs, I
am thinking
about downgrading both MGRs back to 16.2.15, then retry
an upgrade to
a newer version, either 17.2.8 or 18.2.4. I haven't
checked the
snap_schedule error yet, though. Maybe someone else knows
that already.











Attachment: signature.asc
Description: PGP signature

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux