>-----Original Message----- >From: Dave Jiang <dave.jiang@xxxxxxxxx> >Sent: 28 April 2025 18:45 >To: Shiju Jose <shiju.jose@xxxxxxxxxx>; linux-cxl@xxxxxxxxxxxxxxx; >dan.j.williams@xxxxxxxxx; Jonathan Cameron ><jonathan.cameron@xxxxxxxxxx>; dave@xxxxxxxxxxxx; >alison.schofield@xxxxxxxxx; vishal.l.verma@xxxxxxxxx; ira.weiny@xxxxxxxxx >Cc: linux-edac@xxxxxxxxxxxxxxx; linux-doc@xxxxxxxxxxxxxxx; bp@xxxxxxxxx; >tony.luck@xxxxxxxxx; lenb@xxxxxxxxxx; leo.duran@xxxxxxx; >Yazen.Ghannam@xxxxxxx; mchehab@xxxxxxxxxx; nifan.cxl@xxxxxxxxx; >Linuxarm <linuxarm@xxxxxxxxxx>; tanxiaofei <tanxiaofei@xxxxxxxxxx>; >Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>; Roberto Sassu ><roberto.sassu@xxxxxxxxxx>; kangkang.shen@xxxxxxxxxxxxx; wanghuiqiang ><wanghuiqiang@xxxxxxxxxx> >Subject: Re: [PATCH v3 1/8] EDAC: Update documentation for the CXL memory >patrol scrub control feature > > > >On 4/7/25 10:49 AM, shiju.jose@xxxxxxxxxx wrote: >> From: Shiju Jose <shiju.jose@xxxxxxxxxx> >> >> Update the Documentation/edac/scrub.rst to include usecases and >> policies for CXL memory device-based, CXL region-based patrol scrub >> control and CXL Error Check Scrub (ECS). >> >> Signed-off-by: Shiju Jose <shiju.jose@xxxxxxxxxx> >> --- >> Documentation/edac/scrub.rst | 75 >> ++++++++++++++++++++++++++++++++++++ >> 1 file changed, 75 insertions(+) >> >> diff --git a/Documentation/edac/scrub.rst >> b/Documentation/edac/scrub.rst index daab929cdba1..6132853a02fe 100644 >> --- a/Documentation/edac/scrub.rst >> +++ b/Documentation/edac/scrub.rst >> @@ -264,3 +264,78 @@ Sysfs files are documented in >> `Documentation/ABI/testing/sysfs-edac-scrub` >> >> `Documentation/ABI/testing/sysfs-edac-ecs` >> + >> +Examples >> +-------- >> + >> +The usage takes the form shown in these examples: >> + >> +1. CXL memory Patrol Scrub >> + >> +The following are the usecases identified why we might increase the scrub >rate. >> + >> +- Scrubbing is needed at device granularity because a device is >> +showing >> + unexpectedly high errors, the scrub control needs to be at device >> + granularity > >Not sure what the second part of the sentence has to do with defining the use >case. >When the per device control is detailed in 1.1, you can refer to the first use case. Hi Dave, Thanks for the comments. Sure. I will correct. > >> + >> +- Scrubbing may apply to memory that isn't online at all yet.Likely >> +this >space after period > >> + is setting system wide defaults on boot. > >is a system wide default setting on boot. Will update. > >> + >> +- Scrubbing at higher rate because software has decided that we want >> + more reliability for particular data, calling this Differentiated >> + Reliability. That data sits in a region which may cover part of >> +multiple >> + devices. The region interfaces are about supporting this use case. > >Please consider: >Scrubbing at a higher rate because the monitor software has determined that >more reliability is necessary for a particular data set. This is called >Differentiated Reliability. Will update. > >The last sentence is not needed. When describing region scrubbing in 1.2, the >third use case can be referred to. Will do. > >> + >> +1.1. Device based scrubbing >> + >> +CXL memory is exposed to memory management subsystem and ultimately >> +userspace via CXL devices. >> + >> +When combining control via the device interfaces and region >> +interfaces see >> +1.2 Region bases scrubbing. > >"see section 1.2 ..." Ok. > >> + >> +Sysfs files for scrubbing are documented in >> +`Documentation/ABI/testing/sysfs-edac-scrub` >> + >> +1.2. Region based scrubbing >> + >> +CXL memory is exposed to memory management subsystem and ultimately >> +userspace via CXL regions. CXL Regions represent mapped memory >> +capacity in system physical address space. These can incorporate one >> +or more parts of multiple CXL memory devices with traffic interleaved >> +across them. The user may want to control the scrub rate via this >> +more abstract region instead of having to figure out the constituent >> +devices and program them separately. The scrub rate for each device >> +covers the whole device. Thus if multiple regions use parts of that >> +device then requests for scrubbing of other regions may result in a higher >scrub rate than requested for this specific region. >> + >> +Userspace must follow below set of rules on how to set the scrub >> +rates for any mixture of requirements. >> + >> +1. Taking each region in turn from lowest desired scrub rate to highest and >set >> + their scrub rates. Later regions may override the scrub rate on individual >> + devices (and hence potentially whole regions). >> + >> +2. Take each device for which enhanced scrubbing is required (higher rate) >and >> + set those scrub rates. This will override the scrub rates of >> +individual devices > >> + leaving any that are not specifically set to scrub at the maximum rate >required >> + for any of the regions they are involved in backing. > >I'm having trouble understanding what the second part of this sentence is >attempting to convey. Will rephrase the sentence. > >> + >> +Sysfs files for scrubbing are documented in >> +`Documentation/ABI/testing/sysfs-edac-scrub` >> + >> +2. CXL memory Error Check Scrub (ECS) >> + >> +The Error Check Scrub (ECS) feature enables a memory device to >> +perform error checking and correction (ECC) and count single-bit >> +errors. The associated memory controller triggers the ECS mode with a >> +trigger sent to the memory device. However, CXL ECS control allows >> +the user to change the attributes for error count mode and threshold >> +for reporting errors and reset the ECS > >CXL ECX control allows the user to change the attributes for error count mode, >the threshold for reporting errors, and reset the ECS counter. > >I think that's where the commas should go to make the sentence clearer. Will correct. > >> +counter only. Thus, the scope of start Error Check Scrub on a memory >> +device lies within a memory controller or platform when it is >> +detecting unexpectedly high errors. Userspace allows to control the >> +error count mode, threshold number of errors for a segment count >> +indicating a number of segments having at least a threshold number of errors >and reset the ECS counter. > >Need a comman before 'and'. Although the middle part is excessively long and >hard to digest. >Please consider rephrase. Sure. > >> + >> +Sysfs files for scrubbing are documented in >> +`Documentation/ABI/testing/sysfs-edac-ecs` > Thanks, Shiju