Hi Rafael, what's your opinion on this race condition in ACPI hotplug? On Fri, Feb 28, 2025 at 11:16:01AM -0600, Bjorn Helgaas wrote: > On Mon, Feb 24, 2025 at 03:00:34PM +0800, Weiwen Hu wrote: > > Ignore the eject notification when the previous ejection is still in > > progress to prevent multiple _EJ0 invocations when the ejection completes. > > > > The first _EJ0 informs the platform to actually eject the device and frees > > the slot for other devices. So the subsequent _EJ0 may accidentally eject > > the new device that is just plugged into this slot. We need to avoid this. > > > > For each acpi_device, this patch introduces a new field `ejecting`, which > > is set before enqueuing the `kacpi_hotplug_wq` and reset before invoking > > _EJ0. Every notifications we received before invoking _EJ0 is targeted to > > the old device. This ensures we don't miss any notifications for newly > > plugged device. And a new flag `should_dedup_eject` allows each driver to > > implement this feature gradually. A driver should set this flag on > > initialization if it will reset `ejecting`. > > Which drivers do you have in mind when you say "each driver can > implement this feature gradually"? You set should_dedup_eject in > acpiphp, so I guess you mean other ACPI hotplug drivers? I see > acpi_hotplug_context mentioned in libata-acpi.c, surface3-wmi.c; maybe > those are the only other current ones? Yes, I mean other ACPI hotplug drivers. I've searched for acpi_evaluate_ej0(), because `ejecting` should be reset just before this. I got dock.c and scan.c. All of them are called from acpi_device_hotplug(). So may be I can change all of them at once. Should we also consider out-of-tree drivers? As for surface3-wmi.c, its s3_wmi_hp_notify function seems only probing lid state, and does not support hot unplug actually. But ata_acpi_detach_device() seems not calling acpi_evaluate_ej0(), From f730ae183863 ("libata: remove functions now handed by ACPI dock driver"), It seems that ata relies on dock to call ej0 for it. > > This fix is not perfect. If we receive an eject notification just after > > resetting `ejecting` but before _EJ0, we will still invoke _EJ0 twice. > > However this seems to be the best solution available, and it strictly > > improves the current situation. > > > > Another potential fix is to add an `ejected` flag to each device and not > > invoke _EJ0 for already ejected devices. However, this approach risks > > losing synchronization with the platform if something else goes wrong, > > potentially preventing the device from being ejected permanently. And we > > need to check with bus driver to make sure the device is really ejected > > successfully. But this check is still racy. So we cannot ensure no extra > > _EJ0 invocations either. > > I see the problem. Thanks for the detailed explanation and details > about reproducing it and the trace. > > I'm not sure whether the platform should reissue the Bus Check > notification based on the fact that the OS hasn't invoked _EJ0 in some > arbitrary time. That seems a little bit presumptuous because, for > example, the platform can't know how long it will take to write out > the dirty page cache. The racyness of the workaround seems > troublesome to me. > > But this is all really an ACPI issue, not a PCI issue, so I'd like to > defer to the ACPI experts here. > > > Signed-off-by: Weiwen Hu <huweiwen@xxxxxxxxxxxxxxxxx> > > --- > > drivers/acpi/osl.c | 6 ++++++ > > drivers/pci/hotplug/acpiphp_glue.c | 15 +++++++++++---- > > include/acpi/acpi_bus.h | 4 +++- > > 3 files changed, 20 insertions(+), 5 deletions(-) > > > > We observed that umount can take extremely long time if there is a lot of > > dirty page cache. umount will take the s_umount semaphore, which will > > block the ejecting process: > > > > __schedule+0x1e0/0x630 > > ? kernfs_put.part.0+0xd4/0x1a0 > > schedule+0x46/0xb0 > > rwsem_down_read_slowpath+0x16b/0x490 > > __get_super.part.0+0xc1/0xe0 > > fsync_bdev+0x11/0x60 > > invalidate_partition+0x5c/0xa0 > > del_gendisk+0x103/0x2e0 > > virtblk_remove+0x27/0xa0 > > virtio_dev_remove+0x36/0x90 > > __device_release_driver+0x172/0x260 > > device_release_driver+0x24/0x30 > > bus_remove_device+0xf6/0x170 > > device_del+0x19b/0x450 > > device_unregister+0x16/0x60 > > unregister_virtio_device+0x11/0x20 > > virtio_pci_remove+0x31/0x60 > > pci_device_remove+0x38/0xa0 > > __device_release_driver+0x172/0x260 > > device_release_driver+0x24/0x30 > > pci_stop_bus_device+0x6c/0x90 > > pci_stop_and_remove_bus_device+0xe/0x20 > > disable_slot+0x49/0x90 > > acpiphp_disable_and_eject_slot+0x15/0x90 > > > > While OS is not invoking _EJ0 timely, the user (or hypervisor) may retry by > > issuing another notification, which will be queued in kacpi_hotplug_wq. > > After the umount is finally done, the _EJ0 will be invoked. Then, if there > > are devices pending attach, the hypervisor may choose to attach it > > immediately to the same slot. The new device can be ejected by the queued > > ejecting process unintentionally. > > > > On Alibaba Cloud, we re-issue the notification around every 10s if the OS > > does not respond. (BTW, do you think platform is allowed to re-issue > > the notification on timeout?) > > We can easily reproduce this issue on Alibaba Cloud ECS: > > > > WRITE_SIZE=2300M # tune this value so that the umount takes 20s > > # replace these disk serial numbers > > DISK_DETACH=bp142xxxxxxxxxxxxxxx # pre-formatted > > DISK_ATTACH=bp109xxxxxxxxxxxxxxx # any > > # start from these two disks detached > > > > INSTANCE_ID=$(curl -sS http://100.100.100.200/latest/meta-data/instance-id) > > echo "instance id: $INSTANCE_ID" > > DISK_PATH=/dev/disk/by-id/nvme-Alibaba_Cloud_Elastic_Block_Storage_ > > > > echo "attaching disk d-$DISK_DETACH" > > aliyun ecs AttachDisk --DiskId "d-$DISK_DETACH" --InstanceId "$INSTANCE_ID" > > > > sleep 2 > > mkdir -p /mnt/slow > > mount "$DISK_PATH$DISK_DETACH" /mnt/slow > > echo "mounted d-$DISK_DETACH to /mnt/slow" > > > > rm -f /mnt/slow/zero > > echo "populating dirty cache" > > head -c $WRITE_SIZE /dev/zero > /mnt/slow/zero; > > > > echo umounting > > ( > > umount /mnt/slow > > echo umounted > > )& > > > > sleep 2 > > echo "detaching disk d-$DISK_DETACH" > > aliyun ecs DetachDisk --DiskId "d-$DISK_DETACH" --InstanceId "$INSTANCE_ID" > > > > sleep 10 > > echo "attaching disk d-$DISK_ATTACH" > > aliyun ecs AttachDisk --DiskId "d-$DISK_ATTACH" --InstanceId "$INSTANCE_ID" > > > > sleep 7 > > wait > > for _ in {1..10}; do > > sleep 1 > > if [ -e "$DISK_PATH$DISK_ATTACH" ]; then > > echo "disk d-$DISK_ATTACH attached, issue not reproduced" > > exit 0 > > fi > > echo "disk d-$DISK_ATTACH not found yet" > > done > > > > echo "issue reproduced" > > exit 1 > > > > And here is the trace we got from `perf trace` while running the above script on an unpatched kernel: > > > > [starting detach] > > 48202.244 kworker/0:0-ev/5 probe:acpi_ev_queue_notify_request(__probe_ip: -1452149680, notify_value: 3) > > 48202.314 kworker/0:0-ev/5 probe:acpi_hotplug_schedule(__probe_ip: -1452297040, src: 3) > > 48203.690 kworker/u8:0-e/1946 probe:acpi_device_hotplug(__probe_ip: -1452251424, src: 3) > > [blocked, retrying detach] > > 58023.813 kworker/0:0-ev/5 probe:acpi_ev_queue_notify_request(__probe_ip: -1452149680, notify_value: 3) > > 58023.881 kworker/0:0-ev/5 probe:acpi_hotplug_schedule(__probe_ip: -1452297040, src: 3) > > [detach done] > > 62834.048 kworker/u8:0-e/1946 probe:acpi_evaluate_ej0(__probe_ip: -1452291632) > > [another device attaching] > > 62954.686 kworker/0:0-ev/5 probe:acpi_ev_queue_notify_request(__probe_ip: -1452149680, notify_value: 1) > > 62954.828 kworker/0:0-ev/5 probe:acpi_hotplug_schedule(__probe_ip: -1452297040, src: 1) > > 63042.506 kworker/u8:0-e/1946 probe:acpi_device_hotplug(__probe_ip: -1452251424, src: 3) > > [the new device is ejected unintentionally] > > 63042.520 kworker/u8:0-e/1946 probe:acpi_evaluate_ej0(__probe_ip: -1452291632) > > [the actual attach task, scanned the bus but got nothing] > > 63266.555 kworker/u8:0-e/1946 probe:acpi_device_hotplug(__probe_ip: -1452251424, src: 1) > > > > With this patch, the acpi_hotplug_schedule at 58023.881 will be skipped to > > suppress the acpi_evaluate_ej0 at 63042.520. > > > > diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c > > index 5ff343096ece..f041c4db10f7 100644 > > --- a/drivers/acpi/osl.c > > +++ b/drivers/acpi/osl.c > > @@ -1193,6 +1193,12 @@ acpi_status acpi_hotplug_schedule(struct acpi_device *adev, u32 src) > > { > > struct acpi_hp_work *hpw; > > > > + if (src == ACPI_NOTIFY_EJECT_REQUEST && adev->flags.should_dedup_eject > > + && atomic_xchg(&adev->hp->ejecting, 1)) { > > + put_device(&adev->dev); > > + return AE_OK; > > + } > > + > > acpi_handle_debug(adev->handle, > > "Scheduling hotplug event %u for deferred handling\n", > > src); > > diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c > > index 5b1f271c6034..3c50f2af1584 100644 > > --- a/drivers/pci/hotplug/acpiphp_glue.c > > +++ b/drivers/pci/hotplug/acpiphp_glue.c > > @@ -68,6 +68,7 @@ static struct acpiphp_context *acpiphp_init_context(struct acpi_device *adev) > > context->hp.notify = acpiphp_hotplug_notify; > > context->hp.fixup = acpiphp_post_dock_fixup; > > acpi_set_hp_context(adev, &context->hp); > > + adev->flags.should_dedup_eject = true; > > return context; > > } > > > > @@ -778,7 +779,8 @@ void acpiphp_check_host_bridge(struct acpi_device *adev) > > } > > } > > > > -static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot); > > +static int > > +acpiphp_disable_and_eject_slot(struct acpi_hotplug_context *hp, struct acpiphp_slot *slot); > > > > static void hotplug_event(u32 type, struct acpiphp_context *context) > > { > > @@ -825,7 +827,7 @@ static void hotplug_event(u32 type, struct acpiphp_context *context) > > case ACPI_NOTIFY_EJECT_REQUEST: > > /* request device eject */ > > acpi_handle_debug(handle, "Eject request in %s()\n", __func__); > > - acpiphp_disable_and_eject_slot(slot); > > + acpiphp_disable_and_eject_slot(&context->hp, slot); > > break; > > } > > > > @@ -999,9 +1001,11 @@ int acpiphp_enable_slot(struct acpiphp_slot *slot) > > > > /** > > * acpiphp_disable_and_eject_slot - power off and eject slot > > + * @hp: the context that received eject notification, can be NULL > > * @slot: ACPI PHP slot > > */ > > -static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot) > > +static int > > +acpiphp_disable_and_eject_slot(struct acpi_hotplug_context *hp, struct acpiphp_slot *slot) > > { > > struct acpiphp_func *func; > > > > @@ -1011,6 +1015,9 @@ static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot) > > /* unconfigure all functions */ > > disable_slot(slot); > > > > + if (hp) > > + atomic_set(&hp->ejecting, 0); > > + > > list_for_each_entry(func, &slot->funcs, sibling) > > if (func->flags & FUNC_HAS_EJ0) { > > acpi_handle handle = func_to_handle(func); > > @@ -1034,7 +1041,7 @@ int acpiphp_disable_slot(struct acpiphp_slot *slot) > > */ > > acpi_scan_lock_acquire(); > > pci_lock_rescan_remove(); > > - ret = acpiphp_disable_and_eject_slot(slot); > > + ret = acpiphp_disable_and_eject_slot(NULL, slot); > > pci_unlock_rescan_remove(); > > acpi_scan_lock_release(); > > return ret; > > diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h > > index aad1a95e6863..870c1ffe47c9 100644 > > --- a/include/acpi/acpi_bus.h > > +++ b/include/acpi/acpi_bus.h > > @@ -151,6 +151,7 @@ typedef void (*acpi_hp_fixup) (struct acpi_device *); > > > > struct acpi_hotplug_context { > > struct acpi_device *self; > > + atomic_t ejecting; > > acpi_hp_notify notify; > > acpi_hp_uevent uevent; > > acpi_hp_fixup fixup; > > @@ -215,7 +216,8 @@ struct acpi_device_flags { > > u32 cca_seen:1; > > u32 enumeration_by_parent:1; > > u32 honor_deps:1; > > - u32 reserved:18; > > + u32 should_dedup_eject:1; > > + u32 reserved:17; > > }; > > > > /* File System */ > > -- > > 2.48.1 > >