On Wed, 2 Apr 2025 at 21:19, Mario Limonciello <superm1@xxxxxxxxxx> wrote: > > On 4/1/2025 5:06 PM, Antheas Kapenekakis wrote: > > On Tue, 1 Apr 2025 at 22:54, Mario Limonciello <superm1@xxxxxxxxxx> wrote: > >> > >> On 4/1/2025 1:39 PM, Antheas Kapenekakis wrote: > >>> On Tue, 1 Apr 2025 at 17:24, Mario Limonciello <superm1@xxxxxxxxxx> wrote: > >>>> > >>>> On 4/1/2025 10:03 AM, Antheas Kapenekakis wrote: > >>>>> On Tue, 1 Apr 2025 at 16:09, Mario Limonciello <superm1@xxxxxxxxxx> wrote: > >>>>>> > >>>>>> On 4/1/2025 7:45 AM, Antheas Kapenekakis wrote: > >>>>>>> On Tue, 1 Apr 2025 at 14:30, Mario Limonciello <superm1@xxxxxxxxxx> wrote: > >>>>>>>> > >>>>>>>>>> Here are tags for linking to your patch development to be picked up. > >>>>>>>>>> > >>>>>>>>>> Link: > >>>>>>>>>> https://github.com/bazzite-org/patchwork/commit/95b93b2852718ee1e808c72e6b1836da4a95fc63 > >>>>>>>>>> Co-developed-by: Antheas Kapenekakis <lkml@xxxxxxxxxxx> > >>>>>>>>>> Signed-off-by: Antheas Kapenekakis <lkml@xxxxxxxxxxx> > >>>>>>>>> > >>>>>>>> > >>>>>>>> I don't believe that b4 will pick these up, so I will send out a v2 with > >>>>>>>> them and mark this patch as superceded in patchwork so that Rafael > >>>>>>>> doesn't have to pull everything out of this thread manually. > >>>>>> > >>>>>> FTR I don't have permission on patchwork for linux-acpi. > >>>>>> > >>>>>> I sent out v2 though. > >>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>> And to avoid having this conversation again, there is another Legion > >>>>>>>>> Go S [3] patch you nacked and froze the testing for, so you could go > >>>>>>>>> on the manhunt for the real cause of this one. But it will probably be > >>>>>>>>> needed and you will find that as you get TDP controls going. So if you > >>>>>>>>> want me to prepare that in a timely manner, because that one actually > >>>>>>>>> needs rewriting to be posted, now is the time to say so. > >>>>>>>> > >>>>>>>> Can you please propose what you have in mind on the mailing lists to > >>>>>>>> discuss? It's relatively expensive (in the unit of tech debt) to add > >>>>>>>> quirk infrastructure and so we need to make sure it is the right solution. > >>>>>>>> > >>>>>>>> Derek is working on CPU coefficient tuning in a completely separate > >>>>>>>> driver. If there are issues with that, I would generally prefer the > >>>>>>>> fixes to be in that driver. > >>>>>>> > >>>>>>> CPU coefficient tuning? If you mean the lenovo-wmi-driver, yes I will > >>>>>>> try to make sure the quirk can be potentially added there, or in any > >>>>>>> driver*. > >>>>>> > >>>>>> Yes things like fPPT, sPPT, STAPM, STT limits. > >>>>>> > >>>>>>> > >>>>>>> The idea is to rewrite the patch series to just add a simple delay > >>>>>>> field on the s2idle quirk struct. Then the biggest delay wins and gets > >>>>>>> placed in ->begin. We have been using that series for ~6 months now, > >>>>>>> and it turns out that having a delay system for every call is quite > >>>>>>> pointless. But there are also situations where you might have a device > >>>>>>> such as the Z13 Folio which looks like a USB device but listens to > >>>>>>> s2idle notifications through ACPI, so the hid subsystem might need to > >>>>>>> be able to inject a small delay there. > >>>>>> > >>>>>> So the "general" problem with injecting delays is they are typically not > >>>>>> scalable as they're usually empirically measured and there is no > >>>>>> handshake with the firmware. > >>>>>> > >>>>>> Say for example the EC has some hardcoded value of 200ms to wait for > >>>>>> something. IIRC the Linux timer infrastructure can be off by ~13%. So > >>>>>> if you put 175ms it might work sometimes. You get some reports of this, > >>>>>> so you extend it to 200ms. Great it works 100% of the time because the > >>>>>> old hardcoded value in the EC was 200ms. > >>>>>> > >>>>>> Now say a new EC firmware comes out that for $REASONS changes it to > >>>>>> 250ms. Your old empirically measured value stops working, spend a bunch > >>>>>> of cycles debugging it, measure the new one. You change it to 250ms, > >>>>>> but people with the old one have a problem now because the timing changed. > >>>>>> > >>>>>> So now you have to add infrastructure to say what version of the > >>>>>> firmware gets what delay. > >>>>>> > >>>>>> Then you find out there is another SKU of that model which needs a > >>>>>> different delay, so your complexity has ballooned. > >>>>>> > >>>>>> What if all these "delays" were FW timeouts from failing to service an > >>>>>> interrupt? Or what if they were a flow problem like the device expected > >>>>>> you to issue a STOP command before a RESET command? > >>>>>> > >>>>>> So we need to be /incredibly careful/ with delays and 100% they are the > >>>>>> right answer to a problem. > >>>>> > >>>>> I do get your points. In this case though we sideskirt through a lot > >>>>> of the points because of where the delay is placed. > >>>>> > >>>>> If the instrumentation is in-place, this delay happens before sleep > >>>>> after the screen of the device has turned off (due to early DPMS), the > >>>>> keyboard backlight has turned off (DIsplay off call), and the suspend > >>>>> light pulses (Sleep Entry). So it does not affect device behavior and > >>>>> you can be quite liberal. The user has left the device alone. > >>>>> > >>>>> If the device needs e.g., 250ms you will not put 250ms, you will put > >>>>> 500ms. Still unsure, you bump it to 750ms. Also, even if the > >>>>> manufacturer comes up with a new firmware that fixes this issue, you > >>>>> can keep the delay for the life of the product, because keeping it > >>>>> does not affect device behavior, and writing kernel patches takes time. > >>>>> > >>>>> This is how I think about it, at least. A universal delay might be > >>>>> needed eventually. But for now, limiting the scope to some devices and > >>>>> seeing how that goes should be enough. > >>>>> > >>>>> Antheas > >>>> > >>>> My take is that "universal" delays are never popular. IE hardware that > >>>> "previously" worked perfectly is now slower. So if there /must/ be a > >>>> delay it should be as narrow as possible and justified. > >>>> > >>>> Let me give you an example of another case I'm *actively considering* a > >>>> delay. > >>>> > >>>> I have an OEM's system that if you enter and exit s0i3 too quickly you > >>>> can trigger the over voltage protection (OVP) feature of the VR module. > >>>> When OVP is tripped the system is forced off immediately. This *only > >>>> happens* on the VR module in that vendor's systems. "Normal" Linux > >>>> userspace suspend/resume can't trip it. But connecting a dock "does" > >>>> trip it. > >>>> > >>>> If you look on a scope you can see SLP_S3# pin is toggling faster than > >>>> spec says it should. Naïvely you would say well the easy solution is to > >>>> add a delay somewhere so that SLP_S3# stays in spec. I have a patch > >>>> that does just that. > >>>> > >>>> diff --git a/drivers/platform/x86/amd/pmc/pmc.c > >>>> b/drivers/platform/x86/amd/pmc/pmc.c > >>>> index e6124498b195f..97387ddb281e1 100644 > >>>> --- a/drivers/platform/x86/amd/pmc/pmc.c > >>>> +++ b/drivers/platform/x86/amd/pmc/pmc.c > >>>> @@ -724,10 +724,20 @@ static void amd_pmc_s2idle_check(void) > >>>> struct smu_metrics table; > >>>> int rc; > >>>> > >>>> - /* CZN: Ensure that future s0i3 entry attempts at least 10ms > >>>> passed */ > >>>> - if (pdev->cpu_id == AMD_CPU_ID_CZN && !get_metrics_table(pdev, > >>>> &table) && > >>>> - table.s0i3_last_entry_status) > >>>> - usleep_range(10000, 20000); > >>>> + if (!get_metrics_table(pdev, &table) && > >>>> table.s0i3_last_entry_status) { > >>>> + switch (pdev->cpu_id) { > >>>> + /* CZN: Ensure that future s0i3 entry attempts at least > >>>> 10ms passed */ > >>>> + case AMD_CPU_ID_CZN: > >>>> + usleep_range(10000, 20000); > >>>> + break; > >>>> + /* PHX/HPT: Ensure enough time to avoid VR OVP */ > >>>> + case AMD_CPU_ID_PS: > >>>> + msleep(2500); > >>>> + break; > >>>> + default: > >>>> + break; > >>>> + } > >>>> + } > >>>> > >>>> This stops all the failures, but it also has an impact that any time the > >>>> EC SCI is raised for any reason (like plug in power adapter) the system > >>>> will take 2.5s to go back into s0i3. > >>>> > >>>> Digging further - the intended behavior by the EC and BIOS was to wake > >>>> the system when the dock is connected. > >>>> > >>>> That is the reason this happens is because the EC SCI is raised when the > >>>> dock is connected, but the Notify() the EC sent wasn't received by any > >>>> driver. I've got a patch I'll be sending out soon that adds support for > >>>> the correct driver to wake up on this event. > >>>> > >>>> This prevents the case of the OVP and now we don't *need* to penalize > >>>> everyone to wait 2.5s after EC SCI events and going back to s0i3. If I > >>>> find out there are other ways to trip the problem I still have that > >>>> option though. > >>> > >>> So you are talking about missing the AC/DC burst feature of Windows > >>> here right? I do agree with you that yeah for most devices it is not > >>> necessary. > >> > >> No; I wasn't talking about that, my point was that timing delays are a > >> tempting to solution to a problem, but they're very often papering over > >> something else and a hint to dive deeper. > > > > What I gleaned from what you said is that X manufacturer has a problem > > due to missing AC/DC bursts in linux, where all AC/DC burst is is a 5s > > delay. > > > > The intended behavior of AC/DC bursts is to fully wake up the kernel > > for 5 seconds, and then sleep again. In windows, if a power supply is > > connected, userspace wakes up too, and then the Windows power manager > > sleeps the system again if there is no user activity for 5 seconds. > > However, this should not affect device drivers, so we may consider it > > optional on the Linux side until DEs get support for it and enable it > > themselves I would say. > > > > So in effect, AC/DC bursts are Windows' solution to problems like the > > one you faced. > > > > I am not saying penalize everyone. If I do make a patch for AC/DC it > > will be device specific. But after a point, if random devices start > > getting issues and the quirk list starts to grow, it might become > > inevitable to force it for all of them. > > > > I do get what you are saying with delays though. We had to merge one > > of the initial SOF delay patch variants for the Steam Deck which > > prevents audio crashing on resume, and that was definitely a bandage. > > > > Maybe I'm failing at my search-engine-foo, could you point me at some > docs about this AC/DC burst stuff? AC/DC Burst/AC/DC Burst Suppresed are the events in Sleep Study https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/modern-standby-sleepstudy You can see those when running a sleep study and unplugging a connector. I think suppressed is unplugging Then here is the description for plugging in a charger https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/modern-standby-wake-sources#environmental-context-changes-1 > The Windows power manager will turn on the display when the battery subsystem has indicated > AC power has been connected. The GPIO interrupt for power source changes must cause the > ACPI _PSR method under the power supply device to be executed. The power subsystem must > wake the SoC any time the power source changes, including when the system is attached or > removed from a dock that has a battery or AC power source. After AC power is connected, > the display will remain on for five seconds, unless there is input to the system during this five-second window. And here for unplugging: https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/modern-standby-wake-sources#environmental-context-changes-2 > The GPIO interrupt for power source changes must cause the ACPI _PSR method under > the power supply device to be executed. The power subsystem must wake the SoC any time > the power source changes, including when the system is attached or removed from a dock > that has a battery or AC power source. I guess from the description it is not clear that the device stays on for 5 seconds when unplugging, but from empirical testing I want to say it does. It has been a while. I left 3 devices like an hour ago on Windows and none of them managed to sleep, so I cannot verify this at the moment though. Antheas > [ > And FTR unfortunately it's seeming that my proposal for the alternate > wake source has negative side effects to other machines, so I'm going > back to exploring a timing based quirk tied to SMBIOS data again too :( ] > > >>> > >>> But Microsoft guarantees 5 seconds. We already have the original Ally > >>> unit which gets stuck in prochot due to this so it would be nice to > >>> fix. For the Ally X I am unsure what Asus did but it stays awake for a > >>> nice three seconds after you plug/unplug the charger so it has no > >>> issues. > >>> > >>> So if devices keep getting issues like we will have to eat it and do > >>> AC/DC bursts with all of them. > >>> > >>> And it is the same with entering s0i3 too fast. Some devices just need > >>> a tiny amount of time to do whatever it is their manufacturer > >>> programmed them to do after the Modern Standby notifications. For > >>> handhelds, it is to turn off the controllers because XInput. Asus put > >>> the fade animation so that takes 300ms and if you do it earlier the > >>> controller gets cut before it saves its state and starts to do weird > >>> RGB stuff. Other manufacturers typically do not malfunction but they > >>> still use the notification. > >>> > >>> Only MSI does not, but that controller is quirky before/after sleep > >>> and they released a firmware update today saying something about > >>> controller S3/S4 improvements so they probably do that too now, I need > >>> to check. > >>> > >>> For the Go S, it sets itself to 5W after sleep entry and turns off the > >>> fan. A little delay went a long way in fixing the hang there, which I > >>> suspect is due to aggressive tuning. But I do not know if you guys get > >>> that. We did when we did the initial testing for it and carry the > >>> delay now so I cannot tell you either way. So you should max out the > >>> TDP, run stress -c 16, and make the device sleep 100-200 times to make > >>> sure that is not an issue. > >>> > >>> I do have a plan for trying to rework AC/DC bursts, but first the > >>> s2idle ordering needs to be fixed and I need to rewrite the series for > >>> that. The series we have for that works _fine_ so it is not a priority > >>> to rework but it is not upstreamable in its current state so if you > >>> need that (for the Go S) I need to know now. > >>> > >>> For ACDC my idea would be after the reordering is done to have a quirk > >>> that makes the kernel resume, fire the sleep exit notification, loop > >>> for 5 (maybe 3?) seconds inside the device suspend section prior to > >>> userspace resume, and then as long as a wakeup did not arrive restart > >>> the suspend sequence to sleep again. I would also combine that with > >>> the little s2idle wakeup device you made, so that userspace can enable > >>> wakeups for that if it wants to do resume on dock connection. But that > >>> has a lot of moving parts, including moving the DPMS action to happen > >>> even earlier than your patch does and making sure display on/off does > >>> not fire so that the keyboard backlight does not do weird stuff. > >>> > >>> Antheas > >> > >> I think a good start for what you're talking about would be to rebase > >> your series that reworked s2idle flow on 6.15 code (maybe it's a clean > >> rebase, idk) and then if/when all of us on LKML are happy with it we can > >> layer other concepts on top of that. > > > > Yeah, I will try to do that. However, I have around 30 submitted > > patches in the air right now, and we are about to add another 5-6 to > > the list for the Claw. So it will probably be after a bunch of those > > merge. For the interest of sustainability, if nothing else. So let's > > put a dot on this and pick up the discussion again mid 6.16 in a month > > or so. > > > > Unless you need this series for the Go S, in which case I can try to > > re-order stuff around. So, one of you should use the red light TDP > > mode with an artificial load (or actual, such as a game) and see if > > sleep works properly. Do that on battery. > > Take your time and get to it when you get to it. I just want to make > sure we build a clean foundation for big changes like you have in mind. > > > > > I would do at least 100 suspends with this, as most users do 50-70 > > suspends per reboot. I think I did around 300 to validate the Go S > > quirk. > > > > Antheas >