RE: [PATCH v5 1/2] ACPI:RAS2: Add ACPI RAS2 driver

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



>-----Original Message-----
>From: Daniel Ferguson <danielf@xxxxxxxxxxxxxxxxxxxxxx>
>Sent: 16 May 2025 02:49
>To: Shiju Jose <shiju.jose@xxxxxxxxxx>; linux-edac@xxxxxxxxxxxxxxx; linux-
>acpi@xxxxxxxxxxxxxxx; linux-doc@xxxxxxxxxxxxxxx
>Cc: bp@xxxxxxxxx; rafael@xxxxxxxxxx; tony.luck@xxxxxxxxx; lenb@xxxxxxxxxx;
>leo.duran@xxxxxxx; Yazen.Ghannam@xxxxxxx; mchehab@xxxxxxxxxx;
>Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>; linux-mm@xxxxxxxxx;
>Linuxarm <linuxarm@xxxxxxxxxx>; rientjes@xxxxxxxxxx;
>jiaqiyan@xxxxxxxxxx; Jon.Grimm@xxxxxxx; dave.hansen@xxxxxxxxxxxxxxx;
>naoya.horiguchi@xxxxxxx; james.morse@xxxxxxx; jthoughton@xxxxxxxxxx;
>somasundaram.a@xxxxxxx; erdemaktas@xxxxxxxxxx; pgonda@xxxxxxxxxx;
>duenwen@xxxxxxxxxx; gthelen@xxxxxxxxxx;
>wschwartz@xxxxxxxxxxxxxxxxxxx; dferguson@xxxxxxxxxxxxxxxxxxx;
>wbs@xxxxxxxxxxxxxxxxxxxxxx; nifan.cxl@xxxxxxxxx; tanxiaofei
><tanxiaofei@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>; Roberto
>Sassu <roberto.sassu@xxxxxxxxxx>; kangkang.shen@xxxxxxxxxxxxx;
>wanghuiqiang <wanghuiqiang@xxxxxxxxxx>
>Subject: Re: [PATCH v5 1/2] ACPI:RAS2: Add ACPI RAS2 driver
>
>
>
>On 5/14/2025 4:31 AM, Shiju Jose wrote:
>>> -----Original Message-----
>>> From: Daniel Ferguson <danielf@xxxxxxxxxxxxxxxxxxxxxx>
>>> Sent: 14 May 2025 03:55
>>> To: Shiju Jose <shiju.jose@xxxxxxxxxx>; linux-edac@xxxxxxxxxxxxxxx;
>>> linux- acpi@xxxxxxxxxxxxxxx; linux-doc@xxxxxxxxxxxxxxx
>>> Cc: bp@xxxxxxxxx; rafael@xxxxxxxxxx; tony.luck@xxxxxxxxx;
>>> lenb@xxxxxxxxxx; leo.duran@xxxxxxx; Yazen.Ghannam@xxxxxxx;
>>> mchehab@xxxxxxxxxx; Jonathan Cameron
><jonathan.cameron@xxxxxxxxxx>;
>>> linux-mm@xxxxxxxxx; Linuxarm <linuxarm@xxxxxxxxxx>;
>>> rientjes@xxxxxxxxxx; jiaqiyan@xxxxxxxxxx; Jon.Grimm@xxxxxxx;
>>> dave.hansen@xxxxxxxxxxxxxxx; naoya.horiguchi@xxxxxxx;
>>> james.morse@xxxxxxx; jthoughton@xxxxxxxxxx;
>somasundaram.a@xxxxxxx;
>>> erdemaktas@xxxxxxxxxx; pgonda@xxxxxxxxxx; duenwen@xxxxxxxxxx;
>>> gthelen@xxxxxxxxxx; wschwartz@xxxxxxxxxxxxxxxxxxx;
>>> dferguson@xxxxxxxxxxxxxxxxxxx; wbs@xxxxxxxxxxxxxxxxxxxxxx;
>>> nifan.cxl@xxxxxxxxx; tanxiaofei <tanxiaofei@xxxxxxxxxx>; Zengtao (B)
>>> <prime.zeng@xxxxxxxxxxxxx>; Roberto Sassu <roberto.sassu@xxxxxxxxxx>;
>>> kangkang.shen@xxxxxxxxxxxxx; wanghuiqiang
><wanghuiqiang@xxxxxxxxxx>
>>> Subject: Re: [PATCH v5 1/2] ACPI:RAS2: Add ACPI RAS2 driver
>>>
>>>> +static int ras2_report_cap_error(u32 cap_status) {
>>>> +	switch (cap_status) {
>>>> +	case ACPI_RAS2_NOT_VALID:
>>>> +	case ACPI_RAS2_NOT_SUPPORTED:
>>>> +		return -EPERM;
>>>> +	case ACPI_RAS2_BUSY:
>>>> +		return -EBUSY;
>>>> +	case ACPI_RAS2_FAILED:
>>>> +	case ACPI_RAS2_ABORTED:
>>>> +	case ACPI_RAS2_INVALID_DATA:
>>>> +		return -EINVAL;
>>>> +	default: /* 0 or other, Success */
>>>> +		return 0;
>>>> +	}
>>>> +}
>>>> +
>>>> +static int ras2_check_pcc_chan(struct ras2_pcc_subspace
>>>> +*pcc_subspace) {
>>>> +	struct acpi_ras2_shmem __iomem *gen_comm_base = pcc_subspace-
>>>> comm_addr;
>>>> +	u32 cap_status;
>>>> +	u16 status;
>>>> +	u32 rc;
>>>> +
>>>> +	/*
>>>> +	 * As per ACPI spec, the PCC space will be initialized by
>>>> +	 * platform and should have set the command completion bit when
>>>> +	 * PCC can be used by OSPM.
>>>> +	 *
>>>> +	 * Poll PCC status register every 3us(delay_us) for maximum of
>>>> +	 * deadline_us(timeout_us) until PCC command complete bit is
>>> set(cond).
>>>> +	 */
>>>> +	rc = readw_relaxed_poll_timeout(&gen_comm_base->status, status,
>>>> +					status &
>>> PCC_STATUS_CMD_COMPLETE, 3,
>>>> +					pcc_subspace->deadline_us);
>>>> +	if (rc) {
>>>> +		pr_warn("PCC check channel failed for : %d rc=%d\n",
>>>> +			pcc_subspace->pcc_id, rc);
>>>> +		return rc;
>>>> +	}
>>>> +
>>>> +	if (status & PCC_STATUS_ERROR) {
>>>> +		cap_status = readw_relaxed(&gen_comm_base-
>>>> set_caps_status);
>>>> +		rc = ras2_report_cap_error(cap_status);
>>>> +
>>>> +		status &= ~PCC_STATUS_ERROR;
>>>> +		writew_relaxed(status, &gen_comm_base->status);
>>>> +		return rc;
>>>> +	}
>>>> +
>>>> +	if (status & PCC_STATUS_CMD_COMPLETE)
>>>> +		return 0;
>>>> +
>>>> +	return -EIO;
>>>> +}
>>>
>>> We still have an outstanding problem. This may sound familiar.
>>>
>>> If a user specifies an invalid address, our firmware will set an
>>> error code in the set_caps_status field of the acpi_ras2_shmem
>>> structure. In our case, the error code is ACPI_RAS2_INVALID_DATA, and
>>> the user will observe an EINVAL. This is expected.
>>>
>>> However, if the user then subsequently attempts to write a VALID
>>> address, ras2_get_patrol_scrub_running will indirectly call
>>> ras2_check_pcc_chan using the previously INVALID address to determine if
>the scrubber is still running.
>>> Unfortunately, the INVALID address causes
>>> ras2_get_patrol_scrub_running to fail, therefore preventing the user
>>> from specifying a VALID address after specifying an INVALID address.
>>>
>>> The only way to move forward from this inescapable condition is to
>>> reboot the system.
>>>
>>> Here is a demo of the problem as I roughly see it on our system (I've
>>> labeled the line numbers for sake of discussion):
>>> 1  [root@myhost scrub0]# echo 0x100000000 > size
>>> 2  [root@myhost scrub0]# echo 0x1f00000000 > addr
>>> 3  [root@myhost scrub0]# echo 0xcf00000000 > addr
>>> 4  write error: Invalid argument
>>> 5  [  214.446338] PCCT PCCT: Failed to start demand scrubbing
>>> 6  [root@myhost scrub0]# echo 0x1f00000000 > addr
>>> 7  write error: Invalid argument
>>> 8  [  242.263909] PCCT PCCT: failed to read parameters
>>> 9  [root@myhost scrub0]# echo 0x100000000 > size
>>> 10 write error: Invalid argument
>>> 11 [  246.190196] PCCT PCCT: failed to read parameters
>>>
>>> The upper most memory address on this system is 0xbf00000000. Line 1
>>> and 2 use valid values, and line 2 produces the expected results. On
>>> line 3, I've specified an INVALID address (outside of valid range).
>>> The error on line 5 is expected after executing the
>>> START_PATROL_SCRUBBER command with an INVALID address.
>>>
>>> Line 6 show how I attempt to specify a VALID address. Unfortunately,
>>> ras2_get_patrol_scrub_running encounters and error after executing
>>> GET_PATROL_PARAMETERS because it used the OLD INVALID values in
>>> ps_sm-
>>>> params.req_addr_range. Line 7 and 8 are the result. Since the flow
>>>> of
>>> execution if aborted at this point, you can never rectify the
>>> situation and insert a valid value into ps_sm->params.req_addr_range, unless
>you reboot the system.
>>>
>>> One half baked solution to this problem, is to modify
>>> ras2_get_patrol_scrub_running so that if there is a non-zero address
>>> or size specified, AND the last error code we received was INVALID
>>> DATA, then assume the scrubber is NOT running.
>> Hi Daniel,
>>
>> Thanks for reporting the issue.
>> Can you check whether following change fix the issue in your test setup?
>> =============================================
>> diff --git a/drivers/ras/acpi_ras2.c b/drivers/ras/acpi_ras2.c index
>> 4d9cfd3bdf45..ff4aa1b75860 100644
>> --- a/drivers/ras/acpi_ras2.c
>> +++ b/drivers/ras/acpi_ras2.c
>> @@ -255,6 +255,13 @@ static int ras2_hw_scrub_write_addr(struct device
>*dev, void *drv_data, u64 base
>>         ret = ras2_send_pcc_cmd(ras2_ctx, PCC_CMD_EXEC_RAS2);
>>         if (ret) {
>>                 dev_err(ras2_ctx->dev, "Failed to start demand
>> scrubbing\n");
>> +               if (ret == -EPERM || ret == -EINVAL) {
>> +                       ps_sm->params.req_addr_range[0] = 0;
>> +                       ps_sm->params.req_addr_range[1] = 0;
>> +                       ras2_ctx->base = 0;
>> +                       ras2_ctx->size = 0;
>> +                       ras2_ctx->od_scrub_sts = OD_SCRUB_STS_IDLE;
>> +               }
>>                 return ret;
>>         }
>>         ras2_ctx->od_scrub_sts = OD_SCRUB_STS_ACTIVE;
>> ==============================================
>> Thanks,
>> Shiju
>
>We're happy! with this fix.
>
>For this to work, we had to no-op the START_PATROL_SCRUBBER and
>GET_PATROL_PARAMETERS commands when base and size are equal to zero.
>Previously, we returned INVALID DATA when base and size were zero.

Thanks Daniel for verifying the changes.

For demand scrubbing, kernel does not send START_PATROL_SCRUBBER command
when size is zero.
However GET_PATROL_PARAMETERS command from ras2_update_patrol_scrub_params_cache()
is valid case when base and size are equal to zero.

>
>Maybe we should amend the ACPI spec with this special knowledge.
>
>Anyways; as of now, with this fix, this driver will work out of the box on our
>systems.
  
Please tag V6 sent with fix.
https://lore.kernel.org/all/20250516132205.789-1-shiju.jose@xxxxxxxxxx/
 
>
>Thanks a lot,
>~Daniel
>>
>>>
>>> Regards,
>>> ~Daniel
>

Thanks,
Shiju






[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux