Production cluster in bad shape after several OSD crashes

Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx> · Wed, 26 Mar 2025 21:27:49 +0100

Hi,

We have a production cluster made of 3 mon+mgr, 18 OSD servers and ~500 
OSDs and configured with ~50 pools, 1/2 EC (9+6) and 1/2 replica 3. It 
also has 2 CephFS filesystems with 1 MDS each.

2 days ago, in a period spanning 16 hours, 13 OSD crashed with an OOM. 
The OSD were first restarted but it was decided to reboot the server 
with a crashed OSD and "by mistake" (it was at least useless), the OSD 
of the rebooted server were set noout,norebalance before the reboot. The 
flags were removed after the reboot.

After all of this, 'ceph -s' started to report a lot of misplaced PG and 
recovery started. All the PGs but one were successfully reactivated. One 
stayed in the activating+remapped state (located in a pool used for 
tests). 'ceph health' (I don't put the details here to avoid a too long 
mail but I can shared them) says:

HEALTH_WARN 1 failed cephadm daemon(s); 1 filesystem is degraded; 2 MDSs 
report slow metadata IOs; Reduced data availability: 1 pg inactive; 13 
daemons have recently crashed

and reports about one of the filesystem being degraded despite the only 
PG reported inactive is not part of a pool related to the FS.

The recovery was slow until we realized we should change the mclock 
profile to high_recovery_ops. Then it completed in a few hours. 
Unfortunately the degraded filesystem remains degraded without an 
obvious reason... and the inactive page is still in the 
activating+remapped state. We have not been able to identify a relevant 
error in the logs up to now (but we may have missed something...).

So far we have avoided restarting too many things until we have a better 
understanding of what happened and what is the current state. We only 
restarted the mgr which was using a lot of CPU and the MDS for the 
degraded FS, without any improvement.

We are looking on advices about where to start... It seems we have (at 
least) 2 independent problems:

- A PG that cannot be reactivated with a remap operation that doesn't 
proceed: would stopping osd.17 help (so that osd.460 is reused)?

[root@ijc-mon1 ~]# ceph pg dump_stuck
PG_STAT  STATE                UP            UP_PRIMARY ACTING         
ACTING_PRIMARY
32.7ef   activating+remapped  [100,154,17]         100 
[100,154,460]             100

- 1 degraded filesystem: where to look for a reason?

Thanks in advance for any help?

Cheers,

Michel
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx