RE: [PATCH v3 1/8] EDAC: Update documentation for the CXL memory patrol scrub control feature

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



>-----Original Message-----
>From: Dave Jiang <dave.jiang@xxxxxxxxx>
>Sent: 28 April 2025 18:45
>To: Shiju Jose <shiju.jose@xxxxxxxxxx>; linux-cxl@xxxxxxxxxxxxxxx;
>dan.j.williams@xxxxxxxxx; Jonathan Cameron
><jonathan.cameron@xxxxxxxxxx>; dave@xxxxxxxxxxxx;
>alison.schofield@xxxxxxxxx; vishal.l.verma@xxxxxxxxx; ira.weiny@xxxxxxxxx
>Cc: linux-edac@xxxxxxxxxxxxxxx; linux-doc@xxxxxxxxxxxxxxx; bp@xxxxxxxxx;
>tony.luck@xxxxxxxxx; lenb@xxxxxxxxxx; leo.duran@xxxxxxx;
>Yazen.Ghannam@xxxxxxx; mchehab@xxxxxxxxxx; nifan.cxl@xxxxxxxxx;
>Linuxarm <linuxarm@xxxxxxxxxx>; tanxiaofei <tanxiaofei@xxxxxxxxxx>;
>Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>; Roberto Sassu
><roberto.sassu@xxxxxxxxxx>; kangkang.shen@xxxxxxxxxxxxx; wanghuiqiang
><wanghuiqiang@xxxxxxxxxx>
>Subject: Re: [PATCH v3 1/8] EDAC: Update documentation for the CXL memory
>patrol scrub control feature
>
>
>
>On 4/7/25 10:49 AM, shiju.jose@xxxxxxxxxx wrote:
>> From: Shiju Jose <shiju.jose@xxxxxxxxxx>
>>
>> Update the Documentation/edac/scrub.rst to include usecases and
>> policies for CXL memory device-based, CXL region-based patrol scrub
>> control and CXL Error Check Scrub (ECS).
>>
>> Signed-off-by: Shiju Jose <shiju.jose@xxxxxxxxxx>
>> ---
>>  Documentation/edac/scrub.rst | 75
>> ++++++++++++++++++++++++++++++++++++
>>  1 file changed, 75 insertions(+)
>>
>> diff --git a/Documentation/edac/scrub.rst
>> b/Documentation/edac/scrub.rst index daab929cdba1..6132853a02fe 100644
>> --- a/Documentation/edac/scrub.rst
>> +++ b/Documentation/edac/scrub.rst
>> @@ -264,3 +264,78 @@ Sysfs files are documented in
>> `Documentation/ABI/testing/sysfs-edac-scrub`
>>
>>  `Documentation/ABI/testing/sysfs-edac-ecs`
>> +
>> +Examples
>> +--------
>> +
>> +The usage takes the form shown in these examples:
>> +
>> +1. CXL memory Patrol Scrub
>> +
>> +The following are the usecases identified why we might increase the scrub
>rate.
>> +
>> +- Scrubbing is needed at device granularity because a device is
>> +showing
>> +  unexpectedly high errors, the scrub control needs to be at device
>> +  granularity
>
>Not sure what the second part of the sentence has to do with defining the use
>case.
>When the per device control is detailed in 1.1, you can refer to the first use case.

Hi Dave,

Thanks for the comments.
Sure. I will correct.
>
>> +
>> +- Scrubbing may apply to memory that isn't online at all yet.Likely
>> +this
>space after period
>
>> +  is setting system wide defaults on boot.
>
>is a system wide default setting on boot.

Will update.
>
>> +
>> +- Scrubbing at higher rate because software has decided that we want
>> +  more reliability for particular data, calling this Differentiated
>> +  Reliability.  That data sits in a region which may cover part of
>> +multiple
>> +  devices. The region interfaces are about supporting this use case.
>
>Please consider:
>Scrubbing at a higher rate because the monitor software has determined that
>more reliability is necessary for a particular data set. This is called
>Differentiated Reliability.
Will update.
>
>The last sentence is not needed. When describing region scrubbing in 1.2, the
>third use case can be referred to.

Will do.
>
>> +
>> +1.1. Device based scrubbing
>> +
>> +CXL memory is exposed to memory management subsystem and ultimately
>> +userspace via CXL devices.
>> +
>> +When combining control via the device interfaces and region
>> +interfaces see
>> +1.2 Region bases scrubbing.
>
>"see section 1.2 ..."
Ok.
>
>> +
>> +Sysfs files for scrubbing are documented in
>> +`Documentation/ABI/testing/sysfs-edac-scrub`
>> +
>> +1.2. Region based scrubbing
>> +
>> +CXL memory is exposed to memory management subsystem and ultimately
>> +userspace via CXL regions. CXL Regions represent mapped memory
>> +capacity in system physical address space. These can incorporate one
>> +or more parts of multiple CXL memory devices with traffic interleaved
>> +across them. The user may want to control the scrub rate via this
>> +more abstract region instead of having to figure out the constituent
>> +devices and program them separately. The scrub rate for each device
>> +covers the whole device. Thus if multiple regions use parts of that
>> +device then requests for scrubbing of other regions may result in a higher
>scrub rate than requested for this specific region.
>> +
>> +Userspace must follow below set of rules on how to set the scrub
>> +rates for any mixture of requirements.
>> +
>> +1. Taking each region in turn from lowest desired scrub rate to highest and
>set
>> +   their scrub rates. Later regions may override the scrub rate on individual
>> +   devices (and hence potentially whole regions).
>> +
>> +2. Take each device for which enhanced scrubbing is required (higher rate)
>and
>> +   set those scrub rates. This will override the scrub rates of
>> +individual devices
>
>> +   leaving any that are not specifically set to scrub at the maximum rate
>required
>> +   for any of the regions they are involved in backing.
>
>I'm having trouble understanding what the second part of this sentence is
>attempting to convey.
Will rephrase the sentence.

>
>> +
>> +Sysfs files for scrubbing are documented in
>> +`Documentation/ABI/testing/sysfs-edac-scrub`
>> +
>> +2. CXL memory Error Check Scrub (ECS)
>> +
>> +The Error Check Scrub (ECS) feature enables a memory device to
>> +perform error checking and correction (ECC) and count single-bit
>> +errors. The associated memory controller triggers the ECS mode with a
>> +trigger sent to the memory device. However, CXL ECS control allows
>> +the user to change the attributes for error count mode and threshold
>> +for reporting errors and reset the ECS
>
>CXL ECX control allows the user to change the attributes for error count mode,
>the threshold for reporting errors, and reset the ECS counter.
>
>I think that's where the commas should go to make the sentence clearer.

Will correct.
>
>> +counter only. Thus, the scope of start Error Check Scrub on a memory
>> +device lies within a memory controller or platform when it is
>> +detecting unexpectedly high errors. Userspace allows to control the
>> +error count mode, threshold number of errors for a segment count
>> +indicating a number of segments having at least a threshold number of errors
>and reset the ECS counter.
>
>Need a comman before 'and'. Although the middle part is excessively long and
>hard to digest.
>Please consider rephrase.
Sure.
>
>> +
>> +Sysfs files for scrubbing are documented in
>> +`Documentation/ABI/testing/sysfs-edac-ecs`
>

Thanks,
Shiju




[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux