Re: [PATCH v5 1/2] ACPI:RAS2: Add ACPI RAS2 driver

Daniel Ferguson <danielf@xxxxxxxxxxxxxxxxxxxxxx> · Thu, 15 May 2025 18:49:03 -0700

On 5/14/2025 4:31 AM, Shiju Jose wrote:
>> -----Original Message-----
>> From: Daniel Ferguson <danielf@xxxxxxxxxxxxxxxxxxxxxx>
>> Sent: 14 May 2025 03:55
>> To: Shiju Jose <shiju.jose@xxxxxxxxxx>; linux-edac@xxxxxxxxxxxxxxx; linux-
>> acpi@xxxxxxxxxxxxxxx; linux-doc@xxxxxxxxxxxxxxx
>> Cc: bp@xxxxxxxxx; rafael@xxxxxxxxxx; tony.luck@xxxxxxxxx; lenb@xxxxxxxxxx;
>> leo.duran@xxxxxxx; Yazen.Ghannam@xxxxxxx; mchehab@xxxxxxxxxx;
>> Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>; linux-mm@xxxxxxxxx;
>> Linuxarm <linuxarm@xxxxxxxxxx>; rientjes@xxxxxxxxxx;
>> jiaqiyan@xxxxxxxxxx; Jon.Grimm@xxxxxxx; dave.hansen@xxxxxxxxxxxxxxx;
>> naoya.horiguchi@xxxxxxx; james.morse@xxxxxxx; jthoughton@xxxxxxxxxx;
>> somasundaram.a@xxxxxxx; erdemaktas@xxxxxxxxxx; pgonda@xxxxxxxxxx;
>> duenwen@xxxxxxxxxx; gthelen@xxxxxxxxxx;
>> wschwartz@xxxxxxxxxxxxxxxxxxx; dferguson@xxxxxxxxxxxxxxxxxxx;
>> wbs@xxxxxxxxxxxxxxxxxxxxxx; nifan.cxl@xxxxxxxxx; tanxiaofei
>> <tanxiaofei@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>; Roberto
>> Sassu <roberto.sassu@xxxxxxxxxx>; kangkang.shen@xxxxxxxxxxxxx;
>> wanghuiqiang <wanghuiqiang@xxxxxxxxxx>
>> Subject: Re: [PATCH v5 1/2] ACPI:RAS2: Add ACPI RAS2 driver
>>
>>> +static int ras2_report_cap_error(u32 cap_status) {
>>> +	switch (cap_status) {
>>> +	case ACPI_RAS2_NOT_VALID:
>>> +	case ACPI_RAS2_NOT_SUPPORTED:
>>> +		return -EPERM;
>>> +	case ACPI_RAS2_BUSY:
>>> +		return -EBUSY;
>>> +	case ACPI_RAS2_FAILED:
>>> +	case ACPI_RAS2_ABORTED:
>>> +	case ACPI_RAS2_INVALID_DATA:
>>> +		return -EINVAL;
>>> +	default: /* 0 or other, Success */
>>> +		return 0;
>>> +	}
>>> +}
>>> +
>>> +static int ras2_check_pcc_chan(struct ras2_pcc_subspace
>>> +*pcc_subspace) {
>>> +	struct acpi_ras2_shmem __iomem *gen_comm_base = pcc_subspace-
>>> comm_addr;
>>> +	u32 cap_status;
>>> +	u16 status;
>>> +	u32 rc;
>>> +
>>> +	/*
>>> +	 * As per ACPI spec, the PCC space will be initialized by
>>> +	 * platform and should have set the command completion bit when
>>> +	 * PCC can be used by OSPM.
>>> +	 *
>>> +	 * Poll PCC status register every 3us(delay_us) for maximum of
>>> +	 * deadline_us(timeout_us) until PCC command complete bit is
>> set(cond).
>>> +	 */
>>> +	rc = readw_relaxed_poll_timeout(&gen_comm_base->status, status,
>>> +					status &
>> PCC_STATUS_CMD_COMPLETE, 3,
>>> +					pcc_subspace->deadline_us);
>>> +	if (rc) {
>>> +		pr_warn("PCC check channel failed for : %d rc=%d\n",
>>> +			pcc_subspace->pcc_id, rc);
>>> +		return rc;
>>> +	}
>>> +
>>> +	if (status & PCC_STATUS_ERROR) {
>>> +		cap_status = readw_relaxed(&gen_comm_base-
>>> set_caps_status);
>>> +		rc = ras2_report_cap_error(cap_status);
>>> +
>>> +		status &= ~PCC_STATUS_ERROR;
>>> +		writew_relaxed(status, &gen_comm_base->status);
>>> +		return rc;
>>> +	}
>>> +
>>> +	if (status & PCC_STATUS_CMD_COMPLETE)
>>> +		return 0;
>>> +
>>> +	return -EIO;
>>> +}
>>
>> We still have an outstanding problem. This may sound familiar.
>>
>> If a user specifies an invalid address, our firmware will set an error code in the
>> set_caps_status field of the acpi_ras2_shmem structure. In our case, the error
>> code is ACPI_RAS2_INVALID_DATA, and the user will observe an EINVAL. This is
>> expected.
>>
>> However, if the user then subsequently attempts to write a VALID address,
>> ras2_get_patrol_scrub_running will indirectly call ras2_check_pcc_chan using
>> the previously INVALID address to determine if the scrubber is still running.
>> Unfortunately, the INVALID address causes ras2_get_patrol_scrub_running to
>> fail, therefore preventing the user from specifying a VALID address after
>> specifying an INVALID address.
>>
>> The only way to move forward from this inescapable condition is to reboot the
>> system.
>>
>> Here is a demo of the problem as I roughly see it on our system (I've labeled the
>> line numbers for sake of discussion):
>> 1  [root@myhost scrub0]# echo 0x100000000 > size
>> 2  [root@myhost scrub0]# echo 0x1f00000000 > addr
>> 3  [root@myhost scrub0]# echo 0xcf00000000 > addr
>> 4  write error: Invalid argument
>> 5  [  214.446338] PCCT PCCT: Failed to start demand scrubbing
>> 6  [root@myhost scrub0]# echo 0x1f00000000 > addr
>> 7  write error: Invalid argument
>> 8  [  242.263909] PCCT PCCT: failed to read parameters
>> 9  [root@myhost scrub0]# echo 0x100000000 > size
>> 10 write error: Invalid argument
>> 11 [  246.190196] PCCT PCCT: failed to read parameters
>>
>> The upper most memory address on this system is 0xbf00000000. Line 1 and 2
>> use valid values, and line 2 produces the expected results. On line 3, I've
>> specified an INVALID address (outside of valid range). The error on line 5 is
>> expected after executing the START_PATROL_SCRUBBER command with an
>> INVALID address.
>>
>> Line 6 show how I attempt to specify a VALID address. Unfortunately,
>> ras2_get_patrol_scrub_running encounters and error after executing
>> GET_PATROL_PARAMETERS because it used the OLD INVALID values in ps_sm-
>>> params.req_addr_range. Line 7 and 8 are the result. Since the flow of
>> execution if aborted at this point, you can never rectify the situation and insert a
>> valid value into ps_sm->params.req_addr_range, unless you reboot the system.
>>
>> One half baked solution to this problem, is to modify
>> ras2_get_patrol_scrub_running so that if there is a non-zero address or size
>> specified, AND the last error code we received was INVALID DATA, then assume
>> the scrubber is NOT running.
> Hi Daniel,
> 
> Thanks for reporting the issue.
> Can you check whether following change fix the issue in your test setup?
> =============================================
> diff --git a/drivers/ras/acpi_ras2.c b/drivers/ras/acpi_ras2.c
> index 4d9cfd3bdf45..ff4aa1b75860 100644
> --- a/drivers/ras/acpi_ras2.c
> +++ b/drivers/ras/acpi_ras2.c
> @@ -255,6 +255,13 @@ static int ras2_hw_scrub_write_addr(struct device *dev, void *drv_data, u64 base
>         ret = ras2_send_pcc_cmd(ras2_ctx, PCC_CMD_EXEC_RAS2);
>         if (ret) {
>                 dev_err(ras2_ctx->dev, "Failed to start demand scrubbing\n");
> +               if (ret == -EPERM || ret == -EINVAL) {
> +                       ps_sm->params.req_addr_range[0] = 0;
> +                       ps_sm->params.req_addr_range[1] = 0;
> +                       ras2_ctx->base = 0;
> +                       ras2_ctx->size = 0;
> +                       ras2_ctx->od_scrub_sts = OD_SCRUB_STS_IDLE;
> +               }
>                 return ret;
>         }
>         ras2_ctx->od_scrub_sts = OD_SCRUB_STS_ACTIVE;
> ==============================================
> Thanks,
> Shiju 

We're happy! with this fix.

For this to work, we had to no-op the START_PATROL_SCRUBBER and
GET_PATROL_PARAMETERS commands when base and size are equal to zero.
Previously, we returned INVALID DATA when base and size were zero.

Maybe we should amend the ACPI spec with this special knowledge.

Anyways; as of now, with this fix, this driver will work out of the box on our
systems.

Thanks a lot,
~Daniel
> 
>>
>> Regards,
>> ~Daniel