Hi, > -----Original Message----- > From: Christian Loehle <christian.loehle@xxxxxxx> > Sent: 23 March 2025 09:19 > To: King, Colin <colin.king@xxxxxxxxx>; Jens Axboe <axboe@xxxxxxxxx>; Rafael > J. Wysocki <rafael@xxxxxxxxxx>; Daniel Lezcano <daniel.lezcano@xxxxxxxxxx>; > linux-block@xxxxxxxxxxxxxxx; linux-pm@xxxxxxxxxxxxxxx > Cc: linux-kernel@xxxxxxxxxxxxxxx > Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for > fast I/O devices > > On 3/17/25 10:03, King, Colin wrote: > > Hi Christian, > > > > Follow-up below: > > > >> -----Original Message----- > >> From: Christian Loehle <christian.loehle@xxxxxxx> > >> Sent: 03 March 2025 22:25 > >> To: King, Colin <colin.king@xxxxxxxxx>; Jens Axboe <axboe@xxxxxxxxx>; > >> Rafael J. Wysocki <rafael@xxxxxxxxxx>; Daniel Lezcano > >> <daniel.lezcano@xxxxxxxxxx>; linux-block@xxxxxxxxxxxxxxx; > >> linux-pm@xxxxxxxxxxxxxxx > >> Cc: linux-kernel@xxxxxxxxxxxxxxx > >> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion > >> prevention for fast I/O devices > >> > >> On 3/3/25 16:43, Colin Ian King wrote: > >>> Modern processors can drop into deep sleep states relatively quickly > >>> to save power. However, coming out of deep sleep states takes a > >>> small amount of time and this is detrimental to performance for I/O > >>> devices such as fast PCIe NVME drives when servicing a completed I/O > >>> transactions. > >>> > >>> Testing with fio with read/write RAID0 PCIe NVME devices on various > >>> modern SMP based systems (such as 96 thead Granite Rapids Xeon > >>> 6741P) has shown that on 85-90% of read/write transactions issued on > >>> a CPU are completed by the same CPU, so it makes some sense to > >>> prevent the CPU from dropping into a deep sleep state to help reduce > >>> I/O handling latency. > >> > >> For the platform you tested on that may be true, but even if we > >> constrain ourselves to pci-nvme there's a variety of queue/irq > >> mappings where this doesn't hold I'm afraid. > > > > This code is optional, one can enable it or disable it via the config > > option. Also, even when it is built-in one can disable it by writing 0 to the > sysfs file > > /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms > > > >> > >>> > >>> This commit introduces a simple, lightweight and fast power sleep > >>> demotion mechanism that provides the block layer a way to inform the > >>> menu governor to prevent a CPU from going into a deep sleep when an > >>> I/O operation is requested. While it is true that some I/Os may not > >> > >> s/requested/completed is the full truth, isn't it? > >> > >>> be serviced on the same CPU that issued the I/O request and hence is > >>> not 100% perfect the mechanism does work well in the vast majority > >>> of I/O operations and there is very small overhead with the sleep > >>> demotion prevention. > >>> > >>> Test results on a 96 thread Xeon 6741P with a 6 way RAID0 PCIe NVME > >>> md array using fio 3.35 performing random read and read-write test > >>> on a 512GB file with 8 concurrent I/O jobs. Tested with the > >>> NHM_C1_AUTO_DEMOTE bit set in MSR_PKG_CST_CONFIG_CONTROL set > in > >> the BIOS. > >>> > >>> Test case: random reads, results based on geometic mean of results > >>> from > >>> 5 test runs: > >>> Bandwidth IO-ops Latency Bandwidth > >>> read (bytes/sec) per sec (ns) % Std.Deviation > >>> Baseline: 21365755610 20377 390105 1.86% > >>> Patched: 25950107558 24748 322905 0.16% > >> > >> What is the baseline? > >> Do you mind trying with Rafael's recently posted series? > >> Given the IOPS I'd expect good results from that alone already. > >> https://lore.kernel.org/lkml/1916668.tdWV9SEqCh@xxxxxxxxxxxxx/ > >> > >> (Happy to see teo as comparison too, which you don't modify). > > > > OK, I re-ran the tests on 6.14-rc5 on the same H/W. The "Baseline" is > > 6.14-rc5 without Raphel's patch. I also testing the Baseline with C6 > > and C6P disabled as this stops deeper C-state sleeps and in theory > > should provide "best performance". I also benchmarked with Raphel's patch > and just my patch, and finally Raphels patch AND my patch: > > > > Reads > > Bandwidth Bandwidth latency latency > > Bytes/sec %Std.Dev. nanosecs %Std.Dev. > > Baseline 25689182477 0.15 326177 0.15 > > C6, C6P disabled 25839580014 0.19 324349 0.19 > > Raphels Patch: 25695523747 0.06 326150 0.06 > > My patch: 25782011833 0.07 324999 0.07 > > Raphel + My patch: 25792551514 0.10 324924 0.10 > > So these are mostly equal right? Not really, in this case, we have an system under a lot of random read I/O. Disabling C6 increases the read rate since we remove the latency from the sleep states. With my patch we approach the C6 disabled rate. The results are an average of 5 runs, the metrics have fairly low jitter, so I believe the read rates are reliable. We're getting a better read rate with my patch than the baseline and without the power penalty of C6 being permanently disabled. > > > > > Writes > > Bandwidth Bandwidth latency latency > > Bytes/sec %Std.Dev. nanosecs %Std.Dev. > > Baseline 15220468898 3.33 550290 3.36 > > C6, C6P disabled 13405624707 0.66 625424 0.66 > > Raphels Patch: 14017625200 1.15 598049 1.16 > > My patch: 15444417488 3.73 467818 29.10 > > Raphel + My patch: 14037711032 1.13 597143 1.13 > > These don't make sense to me, why would C6 / C6P be this bad? I double checked, it's ..because I got the row labels mixed up when pasting the results into the email. Doh! It should be: Baseline 13405624707 0.66 625424 0.66 C6, C6P disabled 15220468898 3.33 550290 3.36 > Why would Rafael's patch make it worse? > Are these just noise? > > > > > Combined Read+Writes, Reads > > > > Bandwidth Bandwidth latency latency > > Bytes/sec %Std.Dev. nanosecs %Std.Dev. > > Baseline 10132229433 0.41 484919 0.25 > > C6, C6P disabled 10567536346 0.60 515260 1.16 > > Raphels Patch: 10171044817 0.37 486937 0.20 > > My patch: 10468953527 0.07 504797 0.07 > > Raphel + My patch: 10174707546 1.26 488263 1.13 > > > > Combined Read+Writes, Writes > > > > Bandwidth Bandwidth latency latency > > Bytes/sec %Std.Dev. nanosecs %Std.Dev. > > Baseline 10139393169 0.44 342132 1.23 > > C6, C6P disabled 10583264662 0.63 277052 3.87 > > Raphels Patch: 10178275035 0.39 336989 0.94 > > My patch: 10482766569 1.28 294803 6.87 > > Raphel + My patch: 10183837235 0.38 330657 3.39 > > The above two indicate a +3% from (now mainline) Rafael's patch to yours. > There's no reason why Rafael + your patch should be worse than just your > patch. > If this is statistically significant we would need to look into why. > > FWIW I think the energy cost of essentially remaining in polling even for quite > sporadic IO can't be understated IMO. > Comparison for teo, which might slightly better than menu while not outright > disabling idle would be interesting still. > >