>-----Original Message----- >From: Daniel Ferguson <danielf@xxxxxxxxxxxxxxxxxxxxxx> >Sent: 25 July 2025 17:36 >To: Shiju Jose <shiju.jose@xxxxxxxxxx>; rafael@xxxxxxxxxx; linux- >edac@xxxxxxxxxxxxxxx; linux-acpi@xxxxxxxxxxxxxxx; linux-doc@xxxxxxxxxxxxxxx; >bp@xxxxxxxxx; tony.luck@xxxxxxxxx; lenb@xxxxxxxxxx; leo.duran@xxxxxxx; >Yazen.Ghannam@xxxxxxx; mchehab@xxxxxxxxxx >Cc: Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>; linux- >mm@xxxxxxxxx; Linuxarm <linuxarm@xxxxxxxxxx>; rientjes@xxxxxxxxxx; >jiaqiyan@xxxxxxxxxx; Jon.Grimm@xxxxxxx; dave.hansen@xxxxxxxxxxxxxxx; >naoya.horiguchi@xxxxxxx; james.morse@xxxxxxx; jthoughton@xxxxxxxxxx; >somasundaram.a@xxxxxxx; erdemaktas@xxxxxxxxxx; pgonda@xxxxxxxxxx; >duenwen@xxxxxxxxxx; gthelen@xxxxxxxxxx; >wschwartz@xxxxxxxxxxxxxxxxxxx; dferguson@xxxxxxxxxxxxxxxxxxx; >wbs@xxxxxxxxxxxxxxxxxxxxxx; nifan.cxl@xxxxxxxxx; tanxiaofei ><tanxiaofei@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>; Roberto >Sassu <roberto.sassu@xxxxxxxxxx>; kangkang.shen@xxxxxxxxxxxxx; >wanghuiqiang <wanghuiqiang@xxxxxxxxxx>; >vanshikonda@xxxxxxxxxxxxxxxxxxxxxx >Subject: Re: [PATCH v9 2/2] ras: mem: Add memory ACPI RAS2 driver > > >> +static int ras2_hw_scrub_write_addr(struct device *dev, void >> +*drv_data, u64 base) { >> + struct ras2_mem_ctx *ras2_ctx = drv_data; >> + struct acpi_ras2_ps_shared_mem __iomem *ps_sm = >> + TO_ACPI_RAS2_PS_SHMEM(ras2_ctx->comm_addr); >> + bool running; >> + int ret; >> + >> + if (ras2_ctx->bg_scrub) >> + return -EBUSY; >> + >> + guard(mutex)(ras2_ctx->pcc_lock); >> + ps_sm->common.set_caps[0] = RAS2_SUPPORT_HW_PARTOL_SCRUB; >> + >> + if (!ras2_ctx->size) { >> + dev_warn(ras2_ctx->dev, >> + "%s: Invalid address range, base=0x%llx size=0x%llx\n", >> + __func__, base, ras2_ctx->size); >> + return -ERANGE; >> + } >> + >> + ret = ras2_get_patrol_scrub_running(ras2_ctx, &running); >> + if (ret) >> + return ret; >> + >> + if (running) >> + return -EBUSY; >> + >> + ras2_ctx->base = base; >> + ps_sm->params.scrub_params_in &= ~RAS2_PS_SC_HRS_IN_MASK; >> + ps_sm->params.scrub_params_in |= >FIELD_PREP(RAS2_PS_SC_HRS_IN_MASK, >> + ras2_ctx->scrub_cycle_hrs); >> + ps_sm->params.req_addr_range[0] = ras2_ctx->base; >> + ps_sm->params.req_addr_range[1] = ras2_ctx->size; >> + ps_sm->params.scrub_params_in &= ~RAS2_PS_EN_BACKGROUND; >> + ps_sm->params.command = RAS2_START_PATROL_SCRUBBER; >> + >> + ret = ras2_send_pcc_cmd(ras2_ctx, PCC_CMD_EXEC_RAS2); >> + if (ret) { >> + dev_err(ras2_ctx->dev, "Failed to start demand scrubbing >rc(%d)\n", ret); >> + if (ret != -EBUSY) { >> + ps_sm->params.req_addr_range[0] = 0; >> + ps_sm->params.req_addr_range[1] = 0; >> + ras2_ctx->base = 0; >> + ras2_ctx->size = 0; >> + ras2_ctx->od_scrub_sts = OD_SCRUB_STS_IDLE; >> + } >> + return ret; >> + } >> + ras2_ctx->od_scrub_sts = OD_SCRUB_STS_ACTIVE; >> + >> + return ras2_update_patrol_scrub_params_cache(ras2_ctx); >> +} > >After a lot more discussion within Ampere, I've come to believe that using >base==0 && size==0 as a special case, when invoking >GET_PATROL_PARAMETERS after an error, is not a good idea. > >There are three problems with the current solution. > >1) This is undefined behavior and will likely behave differently on different >implementations. For example, 0 is an invalid physical address of the Ampere >systems, could be a valid address on some systems, or could be device memory >on another. In case it is a valid DDR address, the firmware can fill in the Scrub >Parameters, Extended Data Region (ACPI 6.6) with proper information. But in >case it is an invalid address, or device memory, the firmware has to indicate an >error to prevent the OS from consuming bad data reported through these output >parameters. > >2) Pretend you are using acpi_ras2_mem0/scrub1, which corresponds to the 2nd >NUMA node in the system. If we use the current OS driver implementation as-is >and zero out base and size after an error, then when we issue the next >GET_PATROL_PARAMETERS we will have retrieved the parameters >corresponding to the wrong NUMA node. The driver will have fetched the >parameters for NUMA node 1, even though we are interacting with the scrubber >corresponding to NUMA node 2. >This could be problematic in a situation where patrol parameters differ across >different NUMA nodes. Additionally, and perhaps most importantly, using the >GET_PATROL_PARAMETERS to determine if a Scrubber is running require the >correct address range. The current implementation is not getting the running >status of the correct range. > >3) This is a special case of #2. During driver load, ras2_probe issues a >GET_PATROL_PARAMETERS using a base and size equal to 0. The base and size >are zero because requested_address_range is allocated with kzalloc. In this case, >we may not be getting the initial values from the correct range. In the probe stage, while issue GET_PATROL_PARAMETERS, driver has no information about the 'requested_address_range' supposed to set other than set to 0 because RAS2 ('RAS2 Platform Communication Channel Shared Memory Region' table?) does not define fields for the platform to advertise the supported full address range for a scrubber. > >Proposed Solution: >What we propose, is to instead of zeroing out the base and size after an error, >use the full range of the current NUMA node. We believe that a superset of a >currently active scrub range can properly report all the relevant and correct >information. >To be compliant with the specification, FW should set "Flags" field if there is any >on-demand scrub in progress on any memory range in the NUMA node. Again, >this solution assumes that the driver does not allow more than one scrubber to >run within a single NUMA node. All three problems can be solved in the same >way. > >What do you think? Hi Daniel, Please check the v10 of the RAS2 series sent with changes for your requirements. and request please test in your system and share the feedback. https://lore.kernel.org/all/20250801172040.2175-1-shiju.jose@xxxxxxxxxx/ > >Regards, >~Daniel Thanks, Shiju