Reducing the OSD Heartbeat Grace & Interval

Alexander Hussein-Kershaw <alexander.husseinkershaw@xxxxxxxxxxx> · Tue, 9 Sep 2025 12:56:07 +0000

Hi Folks,

I'm running Ceph on VMs in Azure. Occasionally, a maintenance event will take a VM down (typically it will freeze the VM for 30s). There are some limited controls over this on Azure, but they are pretty lackluster.

I'm nervous about the 15s heartbeat grace period (default) and the heartbeat interval, which I think will take down an OSD after 15s if no response to heartbeats. But before that point I expect the unresponsive OSD to block all writes to the PGs involved, for the duration of the grace period.

I'm considering reducing the 15s heartbeat to a lower value to reduce the impact of this, with the aim of removing an unresponsive OSD from the cluster faster.

Hoping to get a feel on this before I consider it further.

  *
Is this a totally stupid idea?
  *
Has anyone had any experience tweaking these parameters?
  *
How low is reasonable to go? I set the interval and grace to 1 sec and the cluster seems stable, but I've not ran a load test yet.
  *
Does the heartbeat have a dedicated thread or is it prone to be being blocked behind other traffic?

Many thanks,
Alex
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx