For posterity: cct->_conf->osd_fast_shutdown_timeout OSD errors / Run Full Recovery from ONodes (might take a while) during Reef 18.2.1 to 18.2.7 upgrade

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Posting this for posterity, in case someone runs into it down the line and finds it in the archives when trying to figure out what the heck is going on.

On a Reef 18.2.1 cluster, when nodes reboot, some OSDs experience the below:


    "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/osd/OSD.cc: In function 'int OSD::shutdown()' thread 7f1286b61700 time 2025-08-24T19:57:36.343925+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/osd/OSD.cc: 4495: FAILED ceph_assert(end_time - start_time_func < cct->_conf->osd_fast_shutdown_timeout)\n”,


which seems to lead to the below at OSD startup:

2025-08-25T00:04:29.669+0000 7faa68c20740  1 freelist _read_cfg
2025-08-25T00:04:29.881+0000 7faa68c20740  1 bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty file)
2025-08-25T00:04:29.881+0000 7faa68c20740  0 bluestore(/var/lib/ceph/osd/ceph-29) _init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes (might take a while) …


After some research my understanding is that the root cause is addressed in 18.2.6.

My sense is that after this 18.2.1 cluster is completely on 18.2.7 the problem should fade.  During the upgrade, however, the instances of it complicate the process and require archiving crashes, waiting an hour or two (20TB spinners) for the abovedescribed full recovery at OSD startup, etc.

I’ve found that doubling the default timeout:

ceph config set global  osd_fast_shutdown_timeout 30

is making a dramatic difference.  After setting the above, the upgrade is progressing nicely as expected.


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux