Re: [PATCH] ACPI/PCI: deduplicate concurrent eject notify

"Rafael J. Wysocki" <rafael@xxxxxxxxxx> · Fri, 11 Apr 2025 12:39:04 +0200

On Fri, Apr 11, 2025 at 11:12 AM Weiwen Hu <huweiwen@xxxxxxxxxxxxxxxxx> wrote:
>
> Hi Rafael, what's your opinion on this race condition in ACPI hotplug?

In the first place, the platform firmware should wait for an _OST
confirmation of an eject or failure.

I cannot spend more time on this today, though, I'll get to it next week.

> On Fri, Feb 28, 2025 at 11:16:01AM -0600, Bjorn Helgaas wrote:
> > On Mon, Feb 24, 2025 at 03:00:34PM +0800, Weiwen Hu wrote:
> > > Ignore the eject notification when the previous ejection is still in
> > > progress to prevent multiple _EJ0 invocations when the ejection completes.
> > >
> > > The first _EJ0 informs the platform to actually eject the device and frees
> > > the slot for other devices. So the subsequent _EJ0 may accidentally eject
> > > the new device that is just plugged into this slot. We need to avoid this.
> > >
> > > For each acpi_device, this patch introduces a new field `ejecting`, which
> > > is set before enqueuing the `kacpi_hotplug_wq` and reset before invoking
> > > _EJ0.  Every notifications we received before invoking _EJ0 is targeted to
> > > the old device. This ensures we don't miss any notifications for newly
> > > plugged device.  And a new flag `should_dedup_eject` allows each driver to
> > > implement this feature gradually. A driver should set this flag on
> > > initialization if it will reset `ejecting`.
> >
> > Which drivers do you have in mind when you say "each driver can
> > implement this feature gradually"?  You set should_dedup_eject in
> > acpiphp, so I guess you mean other ACPI hotplug drivers?  I see
> > acpi_hotplug_context mentioned in libata-acpi.c, surface3-wmi.c; maybe
> > those are the only other current ones?
>
> Yes, I mean other ACPI hotplug drivers. I've searched for acpi_evaluate_ej0(),
> because `ejecting` should be reset just before this. I got dock.c and scan.c.
> All of them are called from acpi_device_hotplug(). So may be I can change all
> of them at once.  Should we also consider out-of-tree drivers?
>
> As for surface3-wmi.c, its s3_wmi_hp_notify function seems only probing lid
> state, and does not support hot unplug actually.
>
> But ata_acpi_detach_device() seems not calling acpi_evaluate_ej0(), From
> f730ae183863 ("libata: remove functions now handed by ACPI dock driver"), It
> seems that ata relies on dock to call ej0 for it.
>
> > > This fix is not perfect. If we receive an eject notification just after
> > > resetting `ejecting` but before _EJ0, we will still invoke _EJ0 twice.
> > > However this seems to be the best solution available, and it strictly
> > > improves the current situation.
> > >
> > > Another potential fix is to add an `ejected` flag to each device and not
> > > invoke _EJ0 for already ejected devices. However, this approach risks
> > > losing synchronization with the platform if something else goes wrong,
> > > potentially preventing the device from being ejected permanently.  And we
> > > need to check with bus driver to make sure the device is really ejected
> > > successfully. But this check is still racy. So we cannot ensure no extra
> > > _EJ0 invocations either.
> >
> > I see the problem.  Thanks for the detailed explanation and details
> > about reproducing it and the trace.
> >
> > I'm not sure whether the platform should reissue the Bus Check
> > notification based on the fact that the OS hasn't invoked _EJ0 in some
> > arbitrary time.  That seems a little bit presumptuous because, for
> > example, the platform can't know how long it will take to write out
> > the dirty page cache.  The racyness of the workaround seems
> > troublesome to me.
> >
> > But this is all really an ACPI issue, not a PCI issue, so I'd like to
> > defer to the ACPI experts here.
> >
> > > Signed-off-by: Weiwen Hu <huweiwen@xxxxxxxxxxxxxxxxx>
> > > ---
> > >  drivers/acpi/osl.c                 |  6 ++++++
> > >  drivers/pci/hotplug/acpiphp_glue.c | 15 +++++++++++----
> > >  include/acpi/acpi_bus.h            |  4 +++-
> > >  3 files changed, 20 insertions(+), 5 deletions(-)
> > >
> > > We observed that umount can take extremely long time if there is a lot of
> > > dirty page cache.  umount will take the s_umount semaphore, which will
> > > block the ejecting process:
> > >
> > >     __schedule+0x1e0/0x630
> > >     ? kernfs_put.part.0+0xd4/0x1a0
> > >     schedule+0x46/0xb0
> > >     rwsem_down_read_slowpath+0x16b/0x490
> > >     __get_super.part.0+0xc1/0xe0
> > >     fsync_bdev+0x11/0x60
> > >     invalidate_partition+0x5c/0xa0
> > >     del_gendisk+0x103/0x2e0
> > >     virtblk_remove+0x27/0xa0
> > >     virtio_dev_remove+0x36/0x90
> > >     __device_release_driver+0x172/0x260
> > >     device_release_driver+0x24/0x30
> > >     bus_remove_device+0xf6/0x170
> > >     device_del+0x19b/0x450
> > >     device_unregister+0x16/0x60
> > >     unregister_virtio_device+0x11/0x20
> > >     virtio_pci_remove+0x31/0x60
> > >     pci_device_remove+0x38/0xa0
> > >     __device_release_driver+0x172/0x260
> > >     device_release_driver+0x24/0x30
> > >     pci_stop_bus_device+0x6c/0x90
> > >     pci_stop_and_remove_bus_device+0xe/0x20
> > >     disable_slot+0x49/0x90
> > >     acpiphp_disable_and_eject_slot+0x15/0x90
> > >
> > > While OS is not invoking _EJ0 timely, the user (or hypervisor) may retry by
> > > issuing another notification, which will be queued in kacpi_hotplug_wq.
> > > After the umount is finally done, the _EJ0 will be invoked.  Then, if there
> > > are devices pending attach, the hypervisor may choose to attach it
> > > immediately to the same slot.  The new device can be ejected by the queued
> > > ejecting process unintentionally.
> > >
> > > On Alibaba Cloud, we re-issue the notification around every 10s if the OS
> > > does not respond.  (BTW, do you think platform is allowed to re-issue
> > > the notification on timeout?)
> > > We can easily reproduce this issue on Alibaba Cloud ECS:
> > >
> > >     WRITE_SIZE=2300M  # tune this value so that the umount takes 20s
> > >     # replace these disk serial numbers
> > >     DISK_DETACH=bp142xxxxxxxxxxxxxxx  # pre-formatted
> > >     DISK_ATTACH=bp109xxxxxxxxxxxxxxx  # any
> > >     # start from these two disks detached
> > >
> > >     INSTANCE_ID=$(curl -sS http://100.100.100.200/latest/meta-data/instance-id)
> > >     echo "instance id: $INSTANCE_ID"
> > >     DISK_PATH=/dev/disk/by-id/nvme-Alibaba_Cloud_Elastic_Block_Storage_
> > >
> > >     echo "attaching disk d-$DISK_DETACH"
> > >     aliyun ecs AttachDisk --DiskId "d-$DISK_DETACH" --InstanceId "$INSTANCE_ID"
> > >
> > >     sleep 2
> > >     mkdir -p /mnt/slow
> > >     mount "$DISK_PATH$DISK_DETACH" /mnt/slow
> > >     echo "mounted d-$DISK_DETACH to /mnt/slow"
> > >
> > >     rm -f /mnt/slow/zero
> > >     echo "populating dirty cache"
> > >     head -c $WRITE_SIZE /dev/zero > /mnt/slow/zero;
> > >
> > >     echo umounting
> > >     (
> > >             umount /mnt/slow
> > >             echo umounted
> > >     )&
> > >
> > >     sleep 2
> > >     echo "detaching disk d-$DISK_DETACH"
> > >     aliyun ecs DetachDisk --DiskId "d-$DISK_DETACH" --InstanceId "$INSTANCE_ID"
> > >
> > >     sleep 10
> > >     echo "attaching disk d-$DISK_ATTACH"
> > >     aliyun ecs AttachDisk --DiskId "d-$DISK_ATTACH" --InstanceId "$INSTANCE_ID"
> > >
> > >     sleep 7
> > >     wait
> > >     for _ in {1..10}; do
> > >             sleep 1
> > >             if [ -e "$DISK_PATH$DISK_ATTACH" ]; then
> > >                     echo "disk d-$DISK_ATTACH attached, issue not reproduced"
> > >                     exit 0
> > >             fi
> > >             echo "disk d-$DISK_ATTACH not found yet"
> > >     done
> > >
> > >     echo "issue reproduced"
> > >     exit 1
> > >
> > > And here is the trace we got from `perf trace` while running the above script on an unpatched kernel:
> > >
> > >     [starting detach]
> > >      48202.244 kworker/0:0-ev/5 probe:acpi_ev_queue_notify_request(__probe_ip: -1452149680, notify_value: 3)
> > >      48202.314 kworker/0:0-ev/5 probe:acpi_hotplug_schedule(__probe_ip: -1452297040, src: 3)
> > >      48203.690 kworker/u8:0-e/1946 probe:acpi_device_hotplug(__probe_ip: -1452251424, src: 3)
> > >     [blocked, retrying detach]
> > >      58023.813 kworker/0:0-ev/5 probe:acpi_ev_queue_notify_request(__probe_ip: -1452149680, notify_value: 3)
> > >      58023.881 kworker/0:0-ev/5 probe:acpi_hotplug_schedule(__probe_ip: -1452297040, src: 3)
> > >     [detach done]
> > >      62834.048 kworker/u8:0-e/1946 probe:acpi_evaluate_ej0(__probe_ip: -1452291632)
> > >     [another device attaching]
> > >      62954.686 kworker/0:0-ev/5 probe:acpi_ev_queue_notify_request(__probe_ip: -1452149680, notify_value: 1)
> > >      62954.828 kworker/0:0-ev/5 probe:acpi_hotplug_schedule(__probe_ip: -1452297040, src: 1)
> > >      63042.506 kworker/u8:0-e/1946 probe:acpi_device_hotplug(__probe_ip: -1452251424, src: 3)
> > >     [the new device is ejected unintentionally]
> > >      63042.520 kworker/u8:0-e/1946 probe:acpi_evaluate_ej0(__probe_ip: -1452291632)
> > >     [the actual attach task, scanned the bus but got nothing]
> > >      63266.555 kworker/u8:0-e/1946 probe:acpi_device_hotplug(__probe_ip: -1452251424, src: 1)
> > >
> > > With this patch, the acpi_hotplug_schedule at 58023.881 will be skipped to
> > > suppress the acpi_evaluate_ej0 at 63042.520.
> > >
> > > diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> > > index 5ff343096ece..f041c4db10f7 100644
> > > --- a/drivers/acpi/osl.c
> > > +++ b/drivers/acpi/osl.c
> > > @@ -1193,6 +1193,12 @@ acpi_status acpi_hotplug_schedule(struct acpi_device *adev, u32 src)
> > >  {
> > >     struct acpi_hp_work *hpw;
> > >
> > > +   if (src == ACPI_NOTIFY_EJECT_REQUEST && adev->flags.should_dedup_eject
> > > +                   && atomic_xchg(&adev->hp->ejecting, 1)) {
> > > +           put_device(&adev->dev);
> > > +           return AE_OK;
> > > +   }
> > > +
> > >     acpi_handle_debug(adev->handle,
> > >                       "Scheduling hotplug event %u for deferred handling\n",
> > >                        src);
> > > diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c
> > > index 5b1f271c6034..3c50f2af1584 100644
> > > --- a/drivers/pci/hotplug/acpiphp_glue.c
> > > +++ b/drivers/pci/hotplug/acpiphp_glue.c
> > > @@ -68,6 +68,7 @@ static struct acpiphp_context *acpiphp_init_context(struct acpi_device *adev)
> > >     context->hp.notify = acpiphp_hotplug_notify;
> > >     context->hp.fixup = acpiphp_post_dock_fixup;
> > >     acpi_set_hp_context(adev, &context->hp);
> > > +   adev->flags.should_dedup_eject = true;
> > >     return context;
> > >  }
> > >
> > > @@ -778,7 +779,8 @@ void acpiphp_check_host_bridge(struct acpi_device *adev)
> > >     }
> > >  }
> > >
> > > -static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot);
> > > +static int
> > > +acpiphp_disable_and_eject_slot(struct acpi_hotplug_context *hp, struct acpiphp_slot *slot);
> > >
> > >  static void hotplug_event(u32 type, struct acpiphp_context *context)
> > >  {
> > > @@ -825,7 +827,7 @@ static void hotplug_event(u32 type, struct acpiphp_context *context)
> > >     case ACPI_NOTIFY_EJECT_REQUEST:
> > >             /* request device eject */
> > >             acpi_handle_debug(handle, "Eject request in %s()\n", __func__);
> > > -           acpiphp_disable_and_eject_slot(slot);
> > > +           acpiphp_disable_and_eject_slot(&context->hp, slot);
> > >             break;
> > >     }
> > >
> > > @@ -999,9 +1001,11 @@ int acpiphp_enable_slot(struct acpiphp_slot *slot)
> > >
> > >  /**
> > >   * acpiphp_disable_and_eject_slot - power off and eject slot
> > > + * @hp: the context that received eject notification, can be NULL
> > >   * @slot: ACPI PHP slot
> > >   */
> > > -static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
> > > +static int
> > > +acpiphp_disable_and_eject_slot(struct acpi_hotplug_context *hp, struct acpiphp_slot *slot)
> > >  {
> > >     struct acpiphp_func *func;
> > >
> > > @@ -1011,6 +1015,9 @@ static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
> > >     /* unconfigure all functions */
> > >     disable_slot(slot);
> > >
> > > +   if (hp)
> > > +           atomic_set(&hp->ejecting, 0);
> > > +
> > >     list_for_each_entry(func, &slot->funcs, sibling)
> > >             if (func->flags & FUNC_HAS_EJ0) {
> > >                     acpi_handle handle = func_to_handle(func);
> > > @@ -1034,7 +1041,7 @@ int acpiphp_disable_slot(struct acpiphp_slot *slot)
> > >      */
> > >     acpi_scan_lock_acquire();
> > >     pci_lock_rescan_remove();
> > > -   ret = acpiphp_disable_and_eject_slot(slot);
> > > +   ret = acpiphp_disable_and_eject_slot(NULL, slot);
> > >     pci_unlock_rescan_remove();
> > >     acpi_scan_lock_release();
> > >     return ret;
> > > diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
> > > index aad1a95e6863..870c1ffe47c9 100644
> > > --- a/include/acpi/acpi_bus.h
> > > +++ b/include/acpi/acpi_bus.h
> > > @@ -151,6 +151,7 @@ typedef void (*acpi_hp_fixup) (struct acpi_device *);
> > >
> > >  struct acpi_hotplug_context {
> > >     struct acpi_device *self;
> > > +   atomic_t ejecting;
> > >     acpi_hp_notify notify;
> > >     acpi_hp_uevent uevent;
> > >     acpi_hp_fixup fixup;
> > > @@ -215,7 +216,8 @@ struct acpi_device_flags {
> > >     u32 cca_seen:1;
> > >     u32 enumeration_by_parent:1;
> > >     u32 honor_deps:1;
> > > -   u32 reserved:18;
> > > +   u32 should_dedup_eject:1;
> > > +   u32 reserved:17;
> > >  };
> > >
> > >  /* File System */
> > > --
> > > 2.48.1
> > >
>