On 4/9/25 4:13 PM, Christoph Hellwig wrote: > On Tue, Apr 08, 2025 at 07:37:48PM +0530, Nilay Shroff wrote: >>>> + * For non-fabric controllers we support delayed removal of head disk >>>> + * node. If we reached up to here then it means that head disk is still >>>> + * alive and so we assume here that even if there's no path available >>>> + * maybe due to the transient link failure, we could queue up the IO >>>> + * and later when path becomes ready we re-submit queued IO. >>>> + */ >>>> + if (!(test_bit(NVME_NSHEAD_FABRICS, &head->flags))) >>>> + return true; >>> >>> Why is this conditional on fabrics or not? The same rationale should >>> apply as much if not more for fabrics controllers. >>> >> For fabrics we already have options like "reconnect_delay" and >> "max_reconnects". So in case of fabric link failures, we delay >> the removal of the head disk node based on those options. > > Yes. But having entirely different behavior for creating a multipath > node and removing it still seems like a rather bad idea. > Yes agreed, but as you know for the NVMeoF, connection is retried (for instrance fabric link goes down) at the respective fabric driver layer. However for NVMe over PCIe, our proposal is to handle the delayed removal of the multipath head node in the NVMe stacked (multipath) driver layer. If we want to unify this behavior across PCIe and fabric controllers then we may allow setting delayed removal of head node option even to the fabric connection however as we know that fabric already supports such behavior at driver layer (by virtue of reconnection_delay and max_reconnects options), IMO, it may not be worthwhile to add this (delayed removal of multipath head node) support for the fabric connection. If the current proposed implementation seems conditional for fabrics, we may change it such that we don't have any explicit checks for the fabrics in the nvme_available_path function and so it could be re-written as follows: diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 0f54889bd483..47a3723605f6 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -428,16 +428,6 @@ static bool nvme_available_path(struct nvme_ns_head *head) if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) return NULL; - /* - * For non-fabric controllers we support delayed removal of head disk - * node. If we reached up to here then it means that head disk is still - * alive and so we assume here that even if there's no path available - * maybe due to the transient link failure, we could queue up the IO - * and later when path becomes ready we re-submit queued IO. - */ - if (!(test_bit(NVME_NSHEAD_FABRICS, &head->flags))) - return true; - list_for_each_entry_srcu(ns, &head->list, siblings, srcu_read_lock_held(&head->srcu)) { if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags)) @@ -451,6 +441,10 @@ static bool nvme_available_path(struct nvme_ns_head *head) break; } } + + if (head->delayed_shutdown_sec) + return true; + return false; } static void nvme_ns_head_submit_bio(struct bio *bio) Please note that head->delayed_shutdown_sec is exported to user via sysfs. The default value of this variable is zero. So it's only when the value of head->delayed_shutdown_sec is set to a non-zero, we allow delayed removal of head node. Moreover, as discussed above, we may not want to export this option to user if the head node refers to the fabric connection. So this option shall be exclusively available to non-fabric multipath connections. Thanks, --Nilay