>-----Original Message----- >From: Daniel Ferguson <danielf@xxxxxxxxxxxxxxxxxxxxxx> >Sent: 16 May 2025 02:49 >To: Shiju Jose <shiju.jose@xxxxxxxxxx>; linux-edac@xxxxxxxxxxxxxxx; linux- >acpi@xxxxxxxxxxxxxxx; linux-doc@xxxxxxxxxxxxxxx >Cc: bp@xxxxxxxxx; rafael@xxxxxxxxxx; tony.luck@xxxxxxxxx; lenb@xxxxxxxxxx; >leo.duran@xxxxxxx; Yazen.Ghannam@xxxxxxx; mchehab@xxxxxxxxxx; >Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>; linux-mm@xxxxxxxxx; >Linuxarm <linuxarm@xxxxxxxxxx>; rientjes@xxxxxxxxxx; >jiaqiyan@xxxxxxxxxx; Jon.Grimm@xxxxxxx; dave.hansen@xxxxxxxxxxxxxxx; >naoya.horiguchi@xxxxxxx; james.morse@xxxxxxx; jthoughton@xxxxxxxxxx; >somasundaram.a@xxxxxxx; erdemaktas@xxxxxxxxxx; pgonda@xxxxxxxxxx; >duenwen@xxxxxxxxxx; gthelen@xxxxxxxxxx; >wschwartz@xxxxxxxxxxxxxxxxxxx; dferguson@xxxxxxxxxxxxxxxxxxx; >wbs@xxxxxxxxxxxxxxxxxxxxxx; nifan.cxl@xxxxxxxxx; tanxiaofei ><tanxiaofei@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>; Roberto >Sassu <roberto.sassu@xxxxxxxxxx>; kangkang.shen@xxxxxxxxxxxxx; >wanghuiqiang <wanghuiqiang@xxxxxxxxxx> >Subject: Re: [PATCH v5 1/2] ACPI:RAS2: Add ACPI RAS2 driver > > > >On 5/14/2025 4:31 AM, Shiju Jose wrote: >>> -----Original Message----- >>> From: Daniel Ferguson <danielf@xxxxxxxxxxxxxxxxxxxxxx> >>> Sent: 14 May 2025 03:55 >>> To: Shiju Jose <shiju.jose@xxxxxxxxxx>; linux-edac@xxxxxxxxxxxxxxx; >>> linux- acpi@xxxxxxxxxxxxxxx; linux-doc@xxxxxxxxxxxxxxx >>> Cc: bp@xxxxxxxxx; rafael@xxxxxxxxxx; tony.luck@xxxxxxxxx; >>> lenb@xxxxxxxxxx; leo.duran@xxxxxxx; Yazen.Ghannam@xxxxxxx; >>> mchehab@xxxxxxxxxx; Jonathan Cameron ><jonathan.cameron@xxxxxxxxxx>; >>> linux-mm@xxxxxxxxx; Linuxarm <linuxarm@xxxxxxxxxx>; >>> rientjes@xxxxxxxxxx; jiaqiyan@xxxxxxxxxx; Jon.Grimm@xxxxxxx; >>> dave.hansen@xxxxxxxxxxxxxxx; naoya.horiguchi@xxxxxxx; >>> james.morse@xxxxxxx; jthoughton@xxxxxxxxxx; >somasundaram.a@xxxxxxx; >>> erdemaktas@xxxxxxxxxx; pgonda@xxxxxxxxxx; duenwen@xxxxxxxxxx; >>> gthelen@xxxxxxxxxx; wschwartz@xxxxxxxxxxxxxxxxxxx; >>> dferguson@xxxxxxxxxxxxxxxxxxx; wbs@xxxxxxxxxxxxxxxxxxxxxx; >>> nifan.cxl@xxxxxxxxx; tanxiaofei <tanxiaofei@xxxxxxxxxx>; Zengtao (B) >>> <prime.zeng@xxxxxxxxxxxxx>; Roberto Sassu <roberto.sassu@xxxxxxxxxx>; >>> kangkang.shen@xxxxxxxxxxxxx; wanghuiqiang ><wanghuiqiang@xxxxxxxxxx> >>> Subject: Re: [PATCH v5 1/2] ACPI:RAS2: Add ACPI RAS2 driver >>> >>>> +static int ras2_report_cap_error(u32 cap_status) { >>>> + switch (cap_status) { >>>> + case ACPI_RAS2_NOT_VALID: >>>> + case ACPI_RAS2_NOT_SUPPORTED: >>>> + return -EPERM; >>>> + case ACPI_RAS2_BUSY: >>>> + return -EBUSY; >>>> + case ACPI_RAS2_FAILED: >>>> + case ACPI_RAS2_ABORTED: >>>> + case ACPI_RAS2_INVALID_DATA: >>>> + return -EINVAL; >>>> + default: /* 0 or other, Success */ >>>> + return 0; >>>> + } >>>> +} >>>> + >>>> +static int ras2_check_pcc_chan(struct ras2_pcc_subspace >>>> +*pcc_subspace) { >>>> + struct acpi_ras2_shmem __iomem *gen_comm_base = pcc_subspace- >>>> comm_addr; >>>> + u32 cap_status; >>>> + u16 status; >>>> + u32 rc; >>>> + >>>> + /* >>>> + * As per ACPI spec, the PCC space will be initialized by >>>> + * platform and should have set the command completion bit when >>>> + * PCC can be used by OSPM. >>>> + * >>>> + * Poll PCC status register every 3us(delay_us) for maximum of >>>> + * deadline_us(timeout_us) until PCC command complete bit is >>> set(cond). >>>> + */ >>>> + rc = readw_relaxed_poll_timeout(&gen_comm_base->status, status, >>>> + status & >>> PCC_STATUS_CMD_COMPLETE, 3, >>>> + pcc_subspace->deadline_us); >>>> + if (rc) { >>>> + pr_warn("PCC check channel failed for : %d rc=%d\n", >>>> + pcc_subspace->pcc_id, rc); >>>> + return rc; >>>> + } >>>> + >>>> + if (status & PCC_STATUS_ERROR) { >>>> + cap_status = readw_relaxed(&gen_comm_base- >>>> set_caps_status); >>>> + rc = ras2_report_cap_error(cap_status); >>>> + >>>> + status &= ~PCC_STATUS_ERROR; >>>> + writew_relaxed(status, &gen_comm_base->status); >>>> + return rc; >>>> + } >>>> + >>>> + if (status & PCC_STATUS_CMD_COMPLETE) >>>> + return 0; >>>> + >>>> + return -EIO; >>>> +} >>> >>> We still have an outstanding problem. This may sound familiar. >>> >>> If a user specifies an invalid address, our firmware will set an >>> error code in the set_caps_status field of the acpi_ras2_shmem >>> structure. In our case, the error code is ACPI_RAS2_INVALID_DATA, and >>> the user will observe an EINVAL. This is expected. >>> >>> However, if the user then subsequently attempts to write a VALID >>> address, ras2_get_patrol_scrub_running will indirectly call >>> ras2_check_pcc_chan using the previously INVALID address to determine if >the scrubber is still running. >>> Unfortunately, the INVALID address causes >>> ras2_get_patrol_scrub_running to fail, therefore preventing the user >>> from specifying a VALID address after specifying an INVALID address. >>> >>> The only way to move forward from this inescapable condition is to >>> reboot the system. >>> >>> Here is a demo of the problem as I roughly see it on our system (I've >>> labeled the line numbers for sake of discussion): >>> 1 [root@myhost scrub0]# echo 0x100000000 > size >>> 2 [root@myhost scrub0]# echo 0x1f00000000 > addr >>> 3 [root@myhost scrub0]# echo 0xcf00000000 > addr >>> 4 write error: Invalid argument >>> 5 [ 214.446338] PCCT PCCT: Failed to start demand scrubbing >>> 6 [root@myhost scrub0]# echo 0x1f00000000 > addr >>> 7 write error: Invalid argument >>> 8 [ 242.263909] PCCT PCCT: failed to read parameters >>> 9 [root@myhost scrub0]# echo 0x100000000 > size >>> 10 write error: Invalid argument >>> 11 [ 246.190196] PCCT PCCT: failed to read parameters >>> >>> The upper most memory address on this system is 0xbf00000000. Line 1 >>> and 2 use valid values, and line 2 produces the expected results. On >>> line 3, I've specified an INVALID address (outside of valid range). >>> The error on line 5 is expected after executing the >>> START_PATROL_SCRUBBER command with an INVALID address. >>> >>> Line 6 show how I attempt to specify a VALID address. Unfortunately, >>> ras2_get_patrol_scrub_running encounters and error after executing >>> GET_PATROL_PARAMETERS because it used the OLD INVALID values in >>> ps_sm- >>>> params.req_addr_range. Line 7 and 8 are the result. Since the flow >>>> of >>> execution if aborted at this point, you can never rectify the >>> situation and insert a valid value into ps_sm->params.req_addr_range, unless >you reboot the system. >>> >>> One half baked solution to this problem, is to modify >>> ras2_get_patrol_scrub_running so that if there is a non-zero address >>> or size specified, AND the last error code we received was INVALID >>> DATA, then assume the scrubber is NOT running. >> Hi Daniel, >> >> Thanks for reporting the issue. >> Can you check whether following change fix the issue in your test setup? >> ============================================= >> diff --git a/drivers/ras/acpi_ras2.c b/drivers/ras/acpi_ras2.c index >> 4d9cfd3bdf45..ff4aa1b75860 100644 >> --- a/drivers/ras/acpi_ras2.c >> +++ b/drivers/ras/acpi_ras2.c >> @@ -255,6 +255,13 @@ static int ras2_hw_scrub_write_addr(struct device >*dev, void *drv_data, u64 base >> ret = ras2_send_pcc_cmd(ras2_ctx, PCC_CMD_EXEC_RAS2); >> if (ret) { >> dev_err(ras2_ctx->dev, "Failed to start demand >> scrubbing\n"); >> + if (ret == -EPERM || ret == -EINVAL) { >> + ps_sm->params.req_addr_range[0] = 0; >> + ps_sm->params.req_addr_range[1] = 0; >> + ras2_ctx->base = 0; >> + ras2_ctx->size = 0; >> + ras2_ctx->od_scrub_sts = OD_SCRUB_STS_IDLE; >> + } >> return ret; >> } >> ras2_ctx->od_scrub_sts = OD_SCRUB_STS_ACTIVE; >> ============================================== >> Thanks, >> Shiju > >We're happy! with this fix. > >For this to work, we had to no-op the START_PATROL_SCRUBBER and >GET_PATROL_PARAMETERS commands when base and size are equal to zero. >Previously, we returned INVALID DATA when base and size were zero. Thanks Daniel for verifying the changes. For demand scrubbing, kernel does not send START_PATROL_SCRUBBER command when size is zero. However GET_PATROL_PARAMETERS command from ras2_update_patrol_scrub_params_cache() is valid case when base and size are equal to zero. > >Maybe we should amend the ACPI spec with this special knowledge. > >Anyways; as of now, with this fix, this driver will work out of the box on our >systems. Please tag V6 sent with fix. https://lore.kernel.org/all/20250516132205.789-1-shiju.jose@xxxxxxxxxx/ > >Thanks a lot, >~Daniel >> >>> >>> Regards, >>> ~Daniel > Thanks, Shiju