Re: [PATCH v3 1/8] EDAC: Update documentation for the CXL memory patrol scrub control feature

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 4/7/25 10:49 AM, shiju.jose@xxxxxxxxxx wrote:
> From: Shiju Jose <shiju.jose@xxxxxxxxxx>
> 
> Update the Documentation/edac/scrub.rst to include usecases and
> policies for CXL memory device-based, CXL region-based patrol scrub
> control and CXL Error Check Scrub (ECS).
> 
> Signed-off-by: Shiju Jose <shiju.jose@xxxxxxxxxx>
> ---
>  Documentation/edac/scrub.rst | 75 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 75 insertions(+)
> 
> diff --git a/Documentation/edac/scrub.rst b/Documentation/edac/scrub.rst
> index daab929cdba1..6132853a02fe 100644
> --- a/Documentation/edac/scrub.rst
> +++ b/Documentation/edac/scrub.rst
> @@ -264,3 +264,78 @@ Sysfs files are documented in
>  `Documentation/ABI/testing/sysfs-edac-scrub`
>  
>  `Documentation/ABI/testing/sysfs-edac-ecs`
> +
> +Examples
> +--------
> +
> +The usage takes the form shown in these examples:
> +
> +1. CXL memory Patrol Scrub
> +
> +The following are the usecases identified why we might increase the scrub rate.
> +
> +- Scrubbing is needed at device granularity because a device is showing
> +  unexpectedly high errors, the scrub control needs to be at device
> +  granularity

Not sure what the second part of the sentence has to do with defining the use case.
When the per device control is detailed in 1.1, you can refer to the first use case.

> +
> +- Scrubbing may apply to memory that isn't online at all yet.Likely this
space after period

> +  is setting system wide defaults on boot.

is a system wide default setting on boot.

> +
> +- Scrubbing at higher rate because software has decided that we want
> +  more reliability for particular data, calling this Differentiated
> +  Reliability.  That data sits in a region which may cover part of multiple
> +  devices. The region interfaces are about supporting this use case.

Please consider:
Scrubbing at a higher rate because the monitor software has determined that
more reliability is necessary for a particular data set. This is called
Differentiated Reliability.

The last sentence is not needed. When describing region scrubbing in 1.2, the third use
case can be referred to.

> +
> +1.1. Device based scrubbing
> +
> +CXL memory is exposed to memory management subsystem and ultimately userspace
> +via CXL devices.
> +
> +When combining control via the device interfaces and region interfaces see
> +1.2 Region bases scrubbing.

"see section 1.2 ..."

> +
> +Sysfs files for scrubbing are documented in
> +`Documentation/ABI/testing/sysfs-edac-scrub`
> +
> +1.2. Region based scrubbing
> +
> +CXL memory is exposed to memory management subsystem and ultimately userspace
> +via CXL regions. CXL Regions represent mapped memory capacity in system
> +physical address space. These can incorporate one or more parts of multiple CXL
> +memory devices with traffic interleaved across them. The user may want to control
> +the scrub rate via this more abstract region instead of having to figure out the
> +constituent devices and program them separately. The scrub rate for each device
> +covers the whole device. Thus if multiple regions use parts of that device then
> +requests for scrubbing of other regions may result in a higher scrub rate than
> +requested for this specific region.
> +
> +Userspace must follow below set of rules on how to set the scrub rates for any
> +mixture of requirements.
> +
> +1. Taking each region in turn from lowest desired scrub rate to highest and set
> +   their scrub rates. Later regions may override the scrub rate on individual
> +   devices (and hence potentially whole regions).
> +
> +2. Take each device for which enhanced scrubbing is required (higher rate) and
> +   set those scrub rates. This will override the scrub rates of individual devices

> +   leaving any that are not specifically set to scrub at the maximum rate required
> +   for any of the regions they are involved in backing.

I'm having trouble understanding what the second part of this sentence is attempting to convey.

> +
> +Sysfs files for scrubbing are documented in
> +`Documentation/ABI/testing/sysfs-edac-scrub`
> +
> +2. CXL memory Error Check Scrub (ECS)
> +
> +The Error Check Scrub (ECS) feature enables a memory device to perform error
> +checking and correction (ECC) and count single-bit errors. The associated
> +memory controller triggers the ECS mode with a trigger sent to the memory
> +device. However, CXL ECS control allows the user to change the attributes
> +for error count mode and threshold for reporting errors and reset the ECS

CXL ECX control allows the user to change the attributes for error count mode,
the threshold for reporting errors, and reset the ECS counter.

I think that's where the commas should go to make the sentence clearer.

> +counter only. Thus, the scope of start Error Check Scrub on a memory device
> +lies within a memory controller or platform when it is detecting unexpectedly
> +high errors. Userspace allows to control the error count mode, threshold
> +number of errors for a segment count indicating a number of segments
> +having at least a threshold number of errors and reset the ECS counter.

Need a comman before 'and'. Although the middle part is excessively long and hard to digest.
Please consider rephrase.

> +
> +Sysfs files for scrubbing are documented in
> +`Documentation/ABI/testing/sysfs-edac-ecs`





[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux