Dear Bjorn, so, any interest in sending your patch upstream? Have you tested it on other devices? I have no other laptops I can boot Linux on, but I have upgraded the SSD a month ago, and the new one works fine without errors in SMART or DevSta at width x4 (max at my PCIe 3.0 port) with the same LTR1.2_Threshold of 81920ns. Thanks. Sergey. Sergey. On Tue, May 6, 2025 at 12:57 PM Sergey Dolgov <sergey.v.dolgov@xxxxxxxxx> wrote: > > Dear David, > > I've seen only the following SOUTHPORT LTR values in different > combinations depending on device activity: > > ######## No disk activity but varying workload otherwise (idle CPU, > NVIDIA GPU sleeping or not, glxgears on either Intel or NVIDIA GPU, > audio playback or not): LTR values fluctuate between > 0 PMC0:SOUTHPORT_A LTR: RAW: 0x0 > Non-Snoop(ns): 0 Snoop(ns): 0 > LTR_IGNORE: 0 > 1 PMC0:SOUTHPORT_B LTR: RAW: 0x0 > Non-Snoop(ns): 0 Snoop(ns): 0 > LTR_IGNORE: 0 > 8 PMC0:SOUTHPORT_C LTR: RAW: 0x90039003 > Non-Snoop(ns): 3145728 Snoop(ns): 3145728 > LTR_IGNORE: 0 > 12 PMC0:SOUTHPORT_D LTR: RAW: 0x0 > Non-Snoop(ns): 0 Snoop(ns): 0 > LTR_IGNORE: 0 > 13 PMC0:SOUTHPORT_E LTR: RAW: 0x0 > Non-Snoop(ns): 0 Snoop(ns): 0 > LTR_IGNORE: 0 > > and > > 0 PMC0:SOUTHPORT_A LTR: RAW: 0x0 > Non-Snoop(ns): 0 Snoop(ns): 0 > LTR_IGNORE: 0 > 1 PMC0:SOUTHPORT_B LTR: RAW: 0x0 > Non-Snoop(ns): 0 Snoop(ns): 0 > LTR_IGNORE: 0 > 8 PMC0:SOUTHPORT_C LTR: RAW: 0x90039003 > Non-Snoop(ns): 3145728 Snoop(ns): 3145728 > LTR_IGNORE: 0 > 12 PMC0:SOUTHPORT_D LTR: RAW: 0x88b688b6 > Non-Snoop(ns): 186368 Snoop(ns): 186368 > LTR_IGNORE: 0 > 13 PMC0:SOUTHPORT_E LTR: RAW: 0x0 > Non-Snoop(ns): 0 Snoop(ns): 0 > LTR_IGNORE: 0 > > > SOUTHPORT_D is probably connected to the wifi: > > ######## ping -A `ip route list default | awk '{print $3}'` : LTR > values are constant at > 0 PMC0:SOUTHPORT_A LTR: RAW: 0x0 > Non-Snoop(ns): 0 Snoop(ns): 0 > LTR_IGNORE: 0 > 1 PMC0:SOUTHPORT_B LTR: RAW: 0x0 > Non-Snoop(ns): 0 Snoop(ns): 0 > LTR_IGNORE: 0 > 8 PMC0:SOUTHPORT_C LTR: RAW: 0x90039003 > Non-Snoop(ns): 3145728 Snoop(ns): 3145728 > LTR_IGNORE: 0 > 12 PMC0:SOUTHPORT_D LTR: RAW: 0x88b688b6 > Non-Snoop(ns): 186368 Snoop(ns): 186368 > LTR_IGNORE: 0 > 13 PMC0:SOUTHPORT_E LTR: RAW: 0x0 > Non-Snoop(ns): 0 Snoop(ns): 0 > LTR_IGNORE: 0 > > In contrast, with RFKILL'd wifi and bluetooth SOUTHPORT_D LTRs are all 0. > > SOUTHPORT_C is probably connected to the NVMes: > > ######## du -ch / > 0 PMC0:SOUTHPORT_A LTR: RAW: 0x0 > Non-Snoop(ns): 0 Snoop(ns): 0 > LTR_IGNORE: 0 > 1 PMC0:SOUTHPORT_B LTR: RAW: 0x0 > Non-Snoop(ns): 0 Snoop(ns): 0 > LTR_IGNORE: 0 > 8 PMC0:SOUTHPORT_C LTR: RAW: 0x88468846 > Non-Snoop(ns): 71680 Snoop(ns): 71680 > LTR_IGNORE: 0 > 12 PMC0:SOUTHPORT_D LTR: RAW: 0x0 > Non-Snoop(ns): 0 Snoop(ns): 0 > LTR_IGNORE: 0 > 13 PMC0:SOUTHPORT_E LTR: RAW: 0x0 > Non-Snoop(ns): 0 Snoop(ns): 0 > LTR_IGNORE: 0 > > 0 PMC0:SOUTHPORT_A LTR: RAW: 0x0 > Non-Snoop(ns): 0 Snoop(ns): 0 > LTR_IGNORE: 0 > 1 PMC0:SOUTHPORT_B LTR: RAW: 0x0 > Non-Snoop(ns): 0 Snoop(ns): 0 > LTR_IGNORE: 0 > 8 PMC0:SOUTHPORT_C LTR: RAW: 0x880a880a > Non-Snoop(ns): 10240 Snoop(ns): 10240 > LTR_IGNORE: 0 > 12 PMC0:SOUTHPORT_D LTR: RAW: 0x88b688b6 > Non-Snoop(ns): 186368 Snoop(ns): 186368 > LTR_IGNORE: 0 > 13 PMC0:SOUTHPORT_E LTR: RAW: 0x0 > Non-Snoop(ns): 0 Snoop(ns): 0 > LTR_IGNORE: 0 > > 0 PMC0:SOUTHPORT_A LTR: RAW: 0x0 > Non-Snoop(ns): 0 Snoop(ns): 0 > LTR_IGNORE: 0 > 1 PMC0:SOUTHPORT_B LTR: RAW: 0x0 > Non-Snoop(ns): 0 Snoop(ns): 0 > LTR_IGNORE: 0 > 8 PMC0:SOUTHPORT_C LTR: RAW: 0x880a880a > Non-Snoop(ns): 10240 Snoop(ns): 10240 > LTR_IGNORE: 0 > 12 PMC0:SOUTHPORT_D LTR: RAW: 0x0 > Non-Snoop(ns): 0 Snoop(ns): 0 > LTR_IGNORE: 0 > 13 PMC0:SOUTHPORT_E LTR: RAW: 0x0 > Non-Snoop(ns): 0 Snoop(ns): 0 > LTR_IGNORE: 0 > > > These values are independent of the kernel (original 6.14.0 or that > with Bjorn's patch skipping L1.2 config). > > Thanks. > Sergey. > > > On Fri, May 2, 2025 at 10:30 PM David E. Box > <david.e.box@xxxxxxxxxxxxxxx> wrote: > > > > Hi all, > > > > On Thu, 2025-04-10 at 17:09 -0500, Bjorn Helgaas wrote: > > > On Thu, Apr 10, 2025 at 02:59:41PM +0100, Sergey Dolgov wrote: > > > > Dear Bjorn, > > > > > > > > one (probably the main) power user is the CPU at shallow C states post > > > > 7afeb84d14ea. Even at some load (like web browsing) the CPU spends > > > > most time in C7 after reverting 7afeb84d14ea, in contrast to C3 even > > > > at idle in the original 6.14.0. So the main question is what can make > > > > the CPU busy with larger LTR_L1.2_THRESHOLDs? > > > > > > That's a good question and I have no idea what the answer is. > > > Obviously a larger LTR_L1.2_THRESHOLD means less time in L1.2, but I > > > don't know how that translates to CPU C states. > > > > > > These bugs: > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=218394 > > > https://bugzilla.kernel.org/show_bug.cgi?id=215832 > > > > > > mention C states and ASPM. I added some of those folks to cc. > > > > > > > I do have Win10 too, but neither Win binaries of pciutils nor Device > > > > Manager show LTR_L1.2_THRESHOLD. lspci -vv run as Administrator does > > > > report some "latencies" though. Some of them are significantly > > > > smaller, e.g. "Exit Latency L0s <1us, L1 <16us" for the bridge > > > > 00:1d.6, others are significantly larger, e.g. "Exit Latency L1 > > > > unlimited" for the NVMe 6e:00.0, than the LTR_L1.2_THRESHOLDs > > > > calculated by Linux. The full log is attached. > > > > > > I think I'm missing your point here. The L0s/L1 Acceptable Latencies > > > and the L0s/L1 Exit Latencies I see in your Win10 lspci are the same > > > as in Windows, which is what I would expect because these are > > > read-only Device and Link Capability registers and the OS can't > > > influence them: > > > > > > 00:1d.6 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port > > > #15 > > > LnkCap: Port #15, Speed 8GT/s, Width x1, ASPM L0s L1, Exit Latency L0s > > > <1us, L1 <16us > > > > > > 6e:00.0 Non-Volatile memory controller: Intel Corporation Optane NVME SSD > > > H10 > > > DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 > > > unlimited > > > LnkCap: Port #0, Speed 8GT/s, Width x2, ASPM L1, Exit Latency L1 unlimited > > > > > > The DevCap L0s and L1 Acceptable Latencies are "the acceptable total > > > latency that an Endpoint can withstand due to transition from L0s or > > > L1 to L0. It is essentially an indirect measure of the Endpoint's > > > internal buffering." > > > > > > The LnkCap L0s and L1 Exit Latencies are the "time the Port requires > > > to complete transitions from L0s or L1 to L0." > > > > > > I don't know how to relate LTR_L1.2_THRESHOLD to L1. I do know that > > > L0s and L1 were part of PCIe r2.1, but it wasn't until r3.1 that the > > > L1.1 and L1.2 substates were added and L1 was renamed to L1.0. So I > > > expect the L1 latencies to be used to enable L1.0 by itself, and I > > > assume LTR and LTR_L1.2_THRESHOLD are used separately to further > > > enable L1.2. > > > > > > > But do we need to care about precise values? At least we know now that > > > > 7afeb84d14ea has only increased the thresholds, slightly. What happens > > > > if they are underestimated? Can this lead to severe problems, e.g. > > > > data corruption on NVMes? > > > > > > IIUC, LTR messages are essentially a way for the device to say "I have > > > enough local buffer space to hold X ns worth of traffic while I'm > > > waiting for the link to return to L0." > > > > > > Then we should only put the link in L1.2 if we can get to L1.2 and > > > back to L0 in X ns or less, and LTR_L1.2_THRESHOLD is basically the > > > minimum L0 -> L1.2 -> L0 time. > > > > > > If we set LTR_L1.2_THRESHOLD lower than it should be, it seems like > > > we're at risk of overrunning the device's local buffer. Maybe that's > > > OK and the device needs to be able to tolerate that, but it does feel > > > risky to me. > > > > > > There's also the LTR Capability that "specifies the maximum latency a > > > device is permitted to request. Software should set this to the > > > platform's maximum supported latency or less." Evidently drivers can > > > set this (only amdgpu does, AFAICS), but I have no idea how to use it. > > > > > > I suppose setting it to something less than LTR_L1.2_THRESHOLD might > > > cause L1.2 to be used more? This would be writable via setpci, and it > > > looks like it can be updated any time. If you want to play with it, > > > the value and scale are encoded the same way as > > > encode_l12_threshold(), and PCI_EXT_CAP_ID_LTR and related #defines > > > show the register layouts. > > > > > > > If not (and I've never seen one using 5.15 kernels for 4 years), can > > > > we reprogram LTR_L1.2_THRESHOLDs at runtime? Like for the CPU, > > > > introduce 'performance' and 'powersave' governors for the PCI, which > > > > set the thresholds to, say, 2x and 0.5x (2 + 4 + t_common_mode + > > > > t_power_on), respectively. > > > > > > I don't think I would support a sysfs or similar interface to tweak > > > this. Right now computing LTR_L1.2_THRESHOLD already feels like a bit > > > of black magic, and tweaking it would be farther down the road of > > > "well, it seems to help this situation, but we don't really know why." > > > > > > > On Wed, Apr 9, 2025 at 12:18 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > > > > > > > > > On Tue, Apr 08, 2025 at 09:02:46PM +0100, Sergey Dolgov wrote: > > > > > > Dear Bjorn, > > > > > > > > > > > > here are both dmesg from the kernels with your info patch. > > > > > > > > > > Thanks again! Here's the difference: > > > > > > > > > > - pre 7afeb84d14ea > > > > > + post 7afeb84d14ea > > > > > > > > > > pci 0000:02:00.0: parent CMRT 0x28 child CMRT 0x00 > > > > > pci 0000:02:00.0: parent T_POWER_ON 0x2c usec (val 0x16 scale 0) > > > > > pci 0000:02:00.0: child T_POWER_ON 0x0a usec (val 0x5 scale 0) > > > > > pci 0000:02:00.0: t_common_mode 0x28 t_power_on 0x2c l1_2_threshold > > > > > 0x5a > > > > > -pci 0000:02:00.0: encoded LTR_L1.2_THRESHOLD value 0x02 scale 3 > > > > > +pci 0000:02:00.0: encoded LTR_L1.2_THRESHOLD value 0x58 scale 2 > > > > > > > > > > We computed LTR_L1.2_THRESHOLD == 0x5a == 90 usec == 90000 nsec. > > > > > > > > > > Prior to 7afeb84d14ea, we computed *scale = 3, *value = (90000 >> 15) > > > > > == 0x2. But per PCIe r6.0, sec 6.18, this is a latency value of only > > > > > 0x2 * 32768 == 65536 ns, which is less than the 90000 ns we requested. > > > > > > > > > > After 7afeb84d14ea, we computed *scale = 2, *value = > > > > > roundup(threshold_ns, 1024) / 1024 == 0x58, which is a latency value > > > > > of 90112 ns, which is almost exactly what we requested. > > > > > > > > > > In essence, before 7afeb84d14ea we tell the Root Port that it can > > > > > enter L1.2 and get back to L0 in 65536 ns or less, and after > > > > > 7afeb84d14ea, we tell it that it may take up to 90112 ns. > > > > > > > > > > It's possible that the calculation of LTR_L1.2_THRESHOLD itself in > > > > > aspm_calc_l12_info() is too conservative, and we don't actually need > > > > > 90 usec, but I think the encoding done by 7afeb84d14ea itself is more > > > > > correct. I don't have any information about how to improve 90 usec > > > > > estimate. (If you happen to have Windows on that box, it would be > > > > > really interesting to see how it sets LTR_L1.2_THRESHOLD.) > > > > > > > > > > If the device has sent LTR messages indicating a latency requirement > > > > > between 65536 ns and 90112 ns, the pre-7afeb84d14ea kernel would allow > > > > > L1.2 while post 7afeb84d14ea would not. I don't think we can actually > > > > > see the LTR messages sent by the device, but my guess is they must be > > > > > in that range. I don't know if that's enough to account for the major > > > > > difference in power consumption you're seeing. > > > > If the Root Port is attached to a controller in the South Complex — which would > > be the case on a Cannon Lake–based platform — you can observe the resulting LTR > > value sent from the Port using the pmc_core driver: > > > > cat /sys/kernel/debug/pmc_core/ltr_show | grep SOUTHPORT > > > > Needs CONFIG_INTEL_PMC_CORE which the major distros set. > > > > The SOUTHPORTs correspond to Root Ports. Unfortunately, we don’t currently have > > a mapping between the internal PMC SOUTHPORT_X designation and the PCI Bus > > enumeration. However, since this behavior clearly affects C-state entry, you > > should be able to narrow it down by monitoring this file — ideally capturing > > several snapshots, as the values can change depending on device activity. > > > > Note that the value shown may not exactly match what was sent in the LTR > > message, but it won’t be smaller. My current assumption (pending confirmation) > > is that simply entering L1.2 increases the effective LTR value observed by the > > CPU, since it’s unlikely that the LTR message value itself changes solely as a > > result of modifying the threshold. > > > > Incidentally, you can also ignore the LTR from the Port by writing the bit value > > (first column) to the ltr_ignore file in the same folder. This is for testing > > only as it ignores device activity. But you should see deeper C state residency > > after ignoring the problem Port, which would be a way to narrow down the > > SOUTHPORT as well. The LTR consideration can be restored by writing the same bit > > value to the ltr_restore file. > > > > David > > > > > > > > > > > > The AX200 at 6f:00.0 is in exactly the same situation as the > > > > > Thunderbolt bridge at 02:00.0 (LTR_L1.2_THRESHOLD 90 usec, RP set to > > > > > 65536 ns before 7afeb84d14ea and 90112 ns after). > > > > > > > > > > For the NVMe devices at 6d:00.0 and 6e:00.0, LTR_L1.2_THRESHOLD is > > > > > 3206 usec (!), and we set the RP to 3145728 ns (slightly too small) > > > > > before, 3211264 ns after. > > > > > > > > > > For the RTS525A at 70:00.0, LTR_L1.2_THRESHOLD is 126 usec, and we set > > > > > the RP to 98304 ns before, 126976 ns after. > > > > > > > > > > Sorry, no real answers here yet, still puzzled. > > > > > > > > > > Bjorn > > > > > > > >