OSD failed: still recovering

Alan Murrell <Alan@xxxxxxxx> · Mon, 24 Mar 2025 04:56:58 +0000

Hello,

We had a drive (OSD) failed innour 5 node cluster three days ago (late afternoon of Mar 20).  The PGs have sorted themselves out, but the cluster has has been recovering with backfill since then.  Every time I run a 'ceph -s' it shows a little over 5% misplaced objects with several jobs of backfill_wait and some scrubbing.

What is sort of weird is that if I run the 'ceph -s' as few times in a row, I can see the the percentage of misplaced objects go down a bit but then if I leave it for a while and run 'ceph -s' again, it is still just over 5% misplaced objects but has typically slightly increased.

For example, it might be 5.364% when I check it, and then after checking it several times in a row it might go down to 5.276% but then if I check it again after a few hours, it might be something like 5.478% (so still in the 5% range but slightly increased from last check)

The cluster is on 10Gbit, and I have increased the max_backfills to 4 while the recovery runs, but it just doesn't seem to be making much progress.

I know the failed drive needs to be replaced, but I think it is recommended to wait until the cluster is finished recovering?

Your thoughts/advice (as usual) are greatly appreciated.

Sent from my mobile device.  Please excuse brevity and ttpos.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx