RE: [PATCH] cpuidle: psd: add power sleep demotion prevention for fast I/O devices

"King, Colin" <colin.king@xxxxxxxxx> · Wed, 26 Mar 2025 15:14:09 +0000

Hi,

> -----Original Message-----
> From: Christian Loehle <christian.loehle@xxxxxxx>
> Sent: 23 March 2025 09:19
> To: King, Colin <colin.king@xxxxxxxxx>; Jens Axboe <axboe@xxxxxxxxx>; Rafael
> J. Wysocki <rafael@xxxxxxxxxx>; Daniel Lezcano <daniel.lezcano@xxxxxxxxxx>;
> linux-block@xxxxxxxxxxxxxxx; linux-pm@xxxxxxxxxxxxxxx
> Cc: linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for
> fast I/O devices
> 
> On 3/17/25 10:03, King, Colin wrote:
> > Hi Christian,
> >
> > Follow-up below:
> >
> >> -----Original Message-----
> >> From: Christian Loehle <christian.loehle@xxxxxxx>
> >> Sent: 03 March 2025 22:25
> >> To: King, Colin <colin.king@xxxxxxxxx>; Jens Axboe <axboe@xxxxxxxxx>;
> >> Rafael J. Wysocki <rafael@xxxxxxxxxx>; Daniel Lezcano
> >> <daniel.lezcano@xxxxxxxxxx>; linux-block@xxxxxxxxxxxxxxx;
> >> linux-pm@xxxxxxxxxxxxxxx
> >> Cc: linux-kernel@xxxxxxxxxxxxxxx
> >> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion
> >> prevention for fast I/O devices
> >>
> >> On 3/3/25 16:43, Colin Ian King wrote:
> >>> Modern processors can drop into deep sleep states relatively quickly
> >>> to save power. However, coming out of deep sleep states takes a
> >>> small amount of time and this is detrimental to performance for I/O
> >>> devices such as fast PCIe NVME drives when servicing a completed I/O
> >>> transactions.
> >>>
> >>> Testing with fio with read/write RAID0 PCIe NVME devices on various
> >>> modern SMP based systems (such as 96 thead Granite Rapids Xeon
> >>> 6741P) has shown that on 85-90% of read/write transactions issued on
> >>> a CPU are completed by the same CPU, so it makes some sense to
> >>> prevent the CPU from dropping into a deep sleep state to help reduce
> >>> I/O handling latency.
> >>
> >> For the platform you tested on that may be true, but even if we
> >> constrain ourselves to pci-nvme there's a variety of queue/irq
> >> mappings where this doesn't hold I'm afraid.
> >
> > This code is optional, one can enable it or disable it via the config
> > option. Also, even when it is built-in one can disable it by writing 0 to the
> sysfs file
> >   /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms
> >
> >>
> >>>
> >>> This commit introduces a simple, lightweight and fast power sleep
> >>> demotion mechanism that provides the block layer a way to inform the
> >>> menu governor to prevent a CPU from going into a deep sleep when an
> >>> I/O operation is requested. While it is true that some I/Os may not
> >>
> >> s/requested/completed is the full truth, isn't it?
> >>
> >>> be serviced on the same CPU that issued the I/O request and hence is
> >>> not 100% perfect the mechanism does work well in the vast majority
> >>> of I/O operations and there is very small overhead with the sleep
> >>> demotion prevention.
> >>>
> >>> Test results on a 96 thread Xeon 6741P with a 6 way RAID0 PCIe NVME
> >>> md array using fio 3.35 performing random read and read-write test
> >>> on a 512GB file with 8 concurrent I/O jobs. Tested with the
> >>> NHM_C1_AUTO_DEMOTE bit set in MSR_PKG_CST_CONFIG_CONTROL set
> in
> >> the BIOS.
> >>>
> >>> Test case: random reads, results based on geometic mean of results
> >>> from
> >>> 5 test runs:
> >>>            Bandwidth         IO-ops   Latency   Bandwidth
> >>>            read (bytes/sec)  per sec    (ns)    % Std.Deviation
> >>> Baseline:  21365755610	     20377     390105   1.86%
> >>> Patched:   25950107558       24748     322905   0.16%
> >>
> >> What is the baseline?
> >> Do you mind trying with Rafael's recently posted series?
> >> Given the IOPS I'd expect good results from that alone already.
> >> https://lore.kernel.org/lkml/1916668.tdWV9SEqCh@xxxxxxxxxxxxx/
> >>
> >> (Happy to see teo as comparison too, which you don't modify).
> >
> > OK, I re-ran the tests on 6.14-rc5 on the same H/W. The "Baseline" is
> > 6.14-rc5 without Raphel's patch. I also testing the Baseline with C6
> > and C6P disabled as this stops deeper C-state sleeps and in theory
> > should provide "best performance". I also benchmarked with Raphel's patch
> and just my patch, and finally Raphels patch AND my patch:
> >
> > Reads
> >                         Bandwidth       Bandwidth       latency         latency
> >                         Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
> > Baseline                25689182477     0.15            326177          0.15
> > C6, C6P disabled        25839580014     0.19            324349          0.19
> > Raphels Patch:          25695523747     0.06            326150          0.06
> > My patch:               25782011833     0.07            324999          0.07
> > Raphel + My patch:      25792551514     0.10            324924          0.10
> 
> So these are mostly equal right?

Not really, in this case, we have an system under a lot of random read I/O. Disabling C6 increases
the read rate since we  remove the latency from the sleep states. With my patch we approach the C6 disabled
rate. The results are an average of 5 runs, the metrics have fairly low jitter, so I believe the read
rates are reliable. We're getting a better read rate with my patch than the baseline and without the
power penalty of C6 being permanently disabled. 

> 
> >
> > Writes
> >                         Bandwidth       Bandwidth       latency         latency
> >                         Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
> > Baseline                15220468898     3.33            550290          3.36
> > C6, C6P disabled        13405624707     0.66            625424          0.66
> > Raphels Patch:          14017625200     1.15            598049          1.16
> > My patch:               15444417488     3.73            467818          29.10
> > Raphel + My patch:      14037711032     1.13            597143          1.13
> 
> These don't make sense to me, why would C6 / C6P be this bad?

I double checked, it's ..because I got the row labels mixed up when pasting the results 
into the email. Doh! It should be:

Baseline                 13405624707     0.66            625424          0.66
C6, C6P disabled   15220468898     3.33            550290          3.36

> Why would Rafael's patch make it worse?
> Are these just noise?

> 
> >
> > Combined Read+Writes, Reads
> >
> >                         Bandwidth       Bandwidth       latency         latency
> >                         Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
> > Baseline                10132229433     0.41            484919          0.25
> > C6, C6P disabled        10567536346     0.60            515260          1.16
> > Raphels Patch:          10171044817     0.37            486937          0.20
> > My patch:               10468953527     0.07            504797          0.07
> > Raphel + My patch:      10174707546     1.26            488263          1.13
> >
> > Combined Read+Writes, Writes
> >
> >                         Bandwidth       Bandwidth       latency         latency
> >                         Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
> > Baseline                10139393169     0.44            342132          1.23
> > C6, C6P disabled        10583264662     0.63            277052          3.87
> > Raphels Patch:          10178275035     0.39            336989          0.94
> > My patch:               10482766569     1.28            294803          6.87
> > Raphel + My patch:      10183837235     0.38            330657          3.39
> 
> The above two indicate a +3% from (now mainline) Rafael's patch to yours.
> There's no reason why Rafael + your patch should be worse than just your
> patch.
> If this is statistically significant we would need to look into why.
> 
> FWIW I think the energy cost of essentially remaining in polling even for quite
> sporadic IO can't be understated IMO.
> Comparison for teo, which might slightly better than menu while not outright
> disabling idle would be interesting still.
> 
>