>-----Original Message----- >From: Daniel Ferguson <danielf@xxxxxxxxxxxxxxxxxxxxxx> >Sent: 14 May 2025 03:55 >To: Shiju Jose <shiju.jose@xxxxxxxxxx>; linux-edac@xxxxxxxxxxxxxxx; linux- >acpi@xxxxxxxxxxxxxxx; linux-doc@xxxxxxxxxxxxxxx >Cc: bp@xxxxxxxxx; rafael@xxxxxxxxxx; tony.luck@xxxxxxxxx; lenb@xxxxxxxxxx; >leo.duran@xxxxxxx; Yazen.Ghannam@xxxxxxx; mchehab@xxxxxxxxxx; >Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>; linux-mm@xxxxxxxxx; >Linuxarm <linuxarm@xxxxxxxxxx>; rientjes@xxxxxxxxxx; >jiaqiyan@xxxxxxxxxx; Jon.Grimm@xxxxxxx; dave.hansen@xxxxxxxxxxxxxxx; >naoya.horiguchi@xxxxxxx; james.morse@xxxxxxx; jthoughton@xxxxxxxxxx; >somasundaram.a@xxxxxxx; erdemaktas@xxxxxxxxxx; pgonda@xxxxxxxxxx; >duenwen@xxxxxxxxxx; gthelen@xxxxxxxxxx; >wschwartz@xxxxxxxxxxxxxxxxxxx; dferguson@xxxxxxxxxxxxxxxxxxx; >wbs@xxxxxxxxxxxxxxxxxxxxxx; nifan.cxl@xxxxxxxxx; tanxiaofei ><tanxiaofei@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>; Roberto >Sassu <roberto.sassu@xxxxxxxxxx>; kangkang.shen@xxxxxxxxxxxxx; >wanghuiqiang <wanghuiqiang@xxxxxxxxxx> >Subject: Re: [PATCH v5 1/2] ACPI:RAS2: Add ACPI RAS2 driver > >> +static int ras2_report_cap_error(u32 cap_status) { >> + switch (cap_status) { >> + case ACPI_RAS2_NOT_VALID: >> + case ACPI_RAS2_NOT_SUPPORTED: >> + return -EPERM; >> + case ACPI_RAS2_BUSY: >> + return -EBUSY; >> + case ACPI_RAS2_FAILED: >> + case ACPI_RAS2_ABORTED: >> + case ACPI_RAS2_INVALID_DATA: >> + return -EINVAL; >> + default: /* 0 or other, Success */ >> + return 0; >> + } >> +} >> + >> +static int ras2_check_pcc_chan(struct ras2_pcc_subspace >> +*pcc_subspace) { >> + struct acpi_ras2_shmem __iomem *gen_comm_base = pcc_subspace- >>comm_addr; >> + u32 cap_status; >> + u16 status; >> + u32 rc; >> + >> + /* >> + * As per ACPI spec, the PCC space will be initialized by >> + * platform and should have set the command completion bit when >> + * PCC can be used by OSPM. >> + * >> + * Poll PCC status register every 3us(delay_us) for maximum of >> + * deadline_us(timeout_us) until PCC command complete bit is >set(cond). >> + */ >> + rc = readw_relaxed_poll_timeout(&gen_comm_base->status, status, >> + status & >PCC_STATUS_CMD_COMPLETE, 3, >> + pcc_subspace->deadline_us); >> + if (rc) { >> + pr_warn("PCC check channel failed for : %d rc=%d\n", >> + pcc_subspace->pcc_id, rc); >> + return rc; >> + } >> + >> + if (status & PCC_STATUS_ERROR) { >> + cap_status = readw_relaxed(&gen_comm_base- >>set_caps_status); >> + rc = ras2_report_cap_error(cap_status); >> + >> + status &= ~PCC_STATUS_ERROR; >> + writew_relaxed(status, &gen_comm_base->status); >> + return rc; >> + } >> + >> + if (status & PCC_STATUS_CMD_COMPLETE) >> + return 0; >> + >> + return -EIO; >> +} > >We still have an outstanding problem. This may sound familiar. > >If a user specifies an invalid address, our firmware will set an error code in the >set_caps_status field of the acpi_ras2_shmem structure. In our case, the error >code is ACPI_RAS2_INVALID_DATA, and the user will observe an EINVAL. This is >expected. > >However, if the user then subsequently attempts to write a VALID address, >ras2_get_patrol_scrub_running will indirectly call ras2_check_pcc_chan using >the previously INVALID address to determine if the scrubber is still running. >Unfortunately, the INVALID address causes ras2_get_patrol_scrub_running to >fail, therefore preventing the user from specifying a VALID address after >specifying an INVALID address. > >The only way to move forward from this inescapable condition is to reboot the >system. > >Here is a demo of the problem as I roughly see it on our system (I've labeled the >line numbers for sake of discussion): >1 [root@myhost scrub0]# echo 0x100000000 > size >2 [root@myhost scrub0]# echo 0x1f00000000 > addr >3 [root@myhost scrub0]# echo 0xcf00000000 > addr >4 write error: Invalid argument >5 [ 214.446338] PCCT PCCT: Failed to start demand scrubbing >6 [root@myhost scrub0]# echo 0x1f00000000 > addr >7 write error: Invalid argument >8 [ 242.263909] PCCT PCCT: failed to read parameters >9 [root@myhost scrub0]# echo 0x100000000 > size >10 write error: Invalid argument >11 [ 246.190196] PCCT PCCT: failed to read parameters > >The upper most memory address on this system is 0xbf00000000. Line 1 and 2 >use valid values, and line 2 produces the expected results. On line 3, I've >specified an INVALID address (outside of valid range). The error on line 5 is >expected after executing the START_PATROL_SCRUBBER command with an >INVALID address. > >Line 6 show how I attempt to specify a VALID address. Unfortunately, >ras2_get_patrol_scrub_running encounters and error after executing >GET_PATROL_PARAMETERS because it used the OLD INVALID values in ps_sm- >>params.req_addr_range. Line 7 and 8 are the result. Since the flow of >execution if aborted at this point, you can never rectify the situation and insert a >valid value into ps_sm->params.req_addr_range, unless you reboot the system. > >One half baked solution to this problem, is to modify >ras2_get_patrol_scrub_running so that if there is a non-zero address or size >specified, AND the last error code we received was INVALID DATA, then assume >the scrubber is NOT running. Hi Daniel, Thanks for reporting the issue. Can you check whether following change fix the issue in your test setup? ============================================= diff --git a/drivers/ras/acpi_ras2.c b/drivers/ras/acpi_ras2.c index 4d9cfd3bdf45..ff4aa1b75860 100644 --- a/drivers/ras/acpi_ras2.c +++ b/drivers/ras/acpi_ras2.c @@ -255,6 +255,13 @@ static int ras2_hw_scrub_write_addr(struct device *dev, void *drv_data, u64 base ret = ras2_send_pcc_cmd(ras2_ctx, PCC_CMD_EXEC_RAS2); if (ret) { dev_err(ras2_ctx->dev, "Failed to start demand scrubbing\n"); + if (ret == -EPERM || ret == -EINVAL) { + ps_sm->params.req_addr_range[0] = 0; + ps_sm->params.req_addr_range[1] = 0; + ras2_ctx->base = 0; + ras2_ctx->size = 0; + ras2_ctx->od_scrub_sts = OD_SCRUB_STS_IDLE; + } return ret; } ras2_ctx->od_scrub_sts = OD_SCRUB_STS_ACTIVE; ============================================== Thanks, Shiju > >Regards, >~Daniel