Re: [PATCH v7 08/19] scsi: detect support for command duration limits

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/30/25 15:39, Damien Le Moal wrote:
> On 2025/04/30 7:13, Friedrich Weber wrote:
>> Hi,
>>
>> One of our users reports that, in their setup, hotplugging new disks doesn't
>> work anymore with recent kernels (details below). The issue appeared somewhere
>> between kernels 6.4 and 6.5, and they bisected the change to this patch:
>>
>>   624885209f31 (scsi: core: Detect support for command duration limits)
>>
>> The issue is also reproducible on a mainline kernel 6.14.4 build from [1]. When
>> hotplugging a disk under 6.14.4, the following is logged (I've redacted some
>> identifiers, let me know in case I've been too overzealous with that):
>>
>> Apr 28 16:41:13 pbs-disklab kernel: mpt3sas_cm0: handle(0xa) sas_address(0xREDACTED_SAS_ADDR) port_type(0x1)
>> Apr 28 16:41:13 pbs-disklab kernel: scsi 5:0:1:0: Direct-Access     WDC      REDACTED_SN  C5C0 PQ: 0 ANSI: 7
>> Apr 28 16:41:13 pbs-disklab kernel: scsi 5:0:1:0: SSP: handle(0x000a), sas_addr(0xREDACTED_SAS_ADDR), phy(2), device_name(REDACTED_DEVICE_NAME)
>> Apr 28 16:41:13 pbs-disklab kernel: scsi 5:0:1:0: enclosure logical id (REDACTED_LOGICAL_ID), slot(0) 
>> Apr 28 16:41:13 pbs-disklab kernel: scsi 5:0:1:0: enclosure level(0x0000), connector name(     )
>> Apr 28 16:41:13 pbs-disklab kernel: scsi 5:0:1:0: qdepth(254), tagged(1), scsi_level(8), cmd_que(1)
>> Apr 28 16:41:13 pbs-disklab kernel: scsi 5:0:1:0: Power-on or device reset occurred
>> Apr 28 16:41:16 pbs-disklab kernel: mpt3sas_cm0: log_info(0x31110e05): originator(PL), code(0x11), sub_code(0x0e05)
> 
> This decodes to:
> 
> Code:     	00110000h	PL_LOGINFO_CODE_RESET See Sub-Codes below (PL_LOGINFO_SUB_CODE)
> Sub Code: 	00000E00h	PL_LOGINFO_SUB_CODE_DISCOVERY_SATA_ERR
> 
>> Apr 28 16:41:18 pbs-disklab kernel: mpt3sas_cm0: log_info(0x31130000): originator(PL), code(0x13), sub_code(0x0000)
>> Apr 28 16:41:18 pbs-disklab kernel: sd 5:0:1:0: Attached scsi generic sg1 type 0
>> Apr 28 16:41:18 pbs-disklab kernel: sd 5:0:1:0: [sdb] Test Unit Ready failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>> Apr 28 16:41:18 pbs-disklab kernel: sd 5:0:1:0: [sdb] Read Capacity(16) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>> Apr 28 16:41:18 pbs-disklab kernel: sd 5:0:1:0: [sdb] Sense not available.
>> Apr 28 16:41:18 pbs-disklab kernel: sd 5:0:1:0: [sdb] Read Capacity(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>> Apr 28 16:41:18 pbs-disklab kernel: sd 5:0:1:0: [sdb] Sense not available.
>> Apr 28 16:41:18 pbs-disklab kernel: sd 5:0:1:0: [sdb] 0 512-byte logical blocks: (0 B/0 B)
>> Apr 28 16:41:18 pbs-disklab kernel: sd 5:0:1:0: [sdb] 0-byte physical blocks
>> Apr 28 16:41:18 pbs-disklab kernel: sd 5:0:1:0: [sdb] Test WP failed, assume Write Enabled
>> Apr 28 16:41:18 pbs-disklab kernel: sd 5:0:1:0: [sdb] Asking for cache data failed
>> Apr 28 16:41:18 pbs-disklab kernel: sd 5:0:1:0: [sdb] Assuming drive cache: write through
>> Apr 28 16:41:18 pbs-disklab kernel:  end_device-5:1: add: handle(0x000a), sas_addr(0xREDACTED_SAS_ADDR)
>> Apr 28 16:41:18 pbs-disklab kernel: mpt3sas_cm0: handle(0x000a), ioc_status(0x0022) failure at drivers/scsi/mpt3sas/mpt3sas_transport.c:225/_transport_set_identify()!
>> Apr 28 16:41:18 pbs-disklab kernel: sd 5:0:1:0: [sdb] Attached SCSI disk
>> Apr 28 16:41:18 pbs-disklab kernel: mpt3sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0xREDACTED_SAS_ADDR)
>> Apr 28 16:41:18 pbs-disklab kernel: mpt3sas_cm0: removing handle(0x000a), sas_addr(0xREDACTED_SAS_ADDR)
>> Apr 28 16:41:18 pbs-disklab kernel: mpt3sas_cm0: enclosure logical id(REDACTED_LOGICAL_ID), slot(0)
>> Apr 28 16:41:18 pbs-disklab kernel: mpt3sas_cm0: enclosure level(0x0000), connector name(     )
>>
>> and the block device isn't accessible afterwards. It does seem to be visible
>> after a reboot.
>>
>> lspci on this host shows:
>>
>> 02:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 [1000:0097] (rev 02)
>> 	Subsystem: Broadcom / LSI SAS9300-8i [1000:30e0]
>> 	Kernel driver in use: mpt3sas
>> 	Kernel modules: mpt3sas
>>
>> The HBA is placed on a PCIe 3.0 x8 slot (not bifurcated) and connected via
>> SFF-8643 to a simple 2U 12xLFF SAS3 Supermicro box. The user can also reproduce
>> the issue with other HBAs with e.g. the SAS3108 and SAS3816 chipsets.
>>
>> The device doesn't seem to support CDL. So if I see correctly, the only
>> effective change introduced by the patch are the four scsi_cdl_check_cmd (and
>> thus scsi_report_opcode) calls to check for CDL support. Hence we wondered
>> whether may be the cause of the issue. We ran a few tests to verify:
>>
>> - disabling "REPORT SUPPORTED OPERATION CODES" by passing
>>   `scsi_mod.dev_flags=WDC:REDACTED_SN:536870912` (the flag being
>>   BLIST_NO_RSOC) resolves the issue (hotplug works again), but I imagine
>>   disabling RSOC altogether isn't a good workaround. This test was not done
>>   on a mainline kernel, but I don't think it would make a difference.
> 
> So it seems that the HBA SAT is choking on the report supported opcode command.
> I have several mpt3sas HBAs and I have never seen this issue running the latest
> FW version for these (EOL) HBAs. So I am tempted to say that an HBA FW update
> should resolve the issue, BUT, I do not recall doing any drive hotplug tests
> though. This issue may trigger only with hotplug and not with a cold start...
> Can you confirm that ?
> 
Yes, a cold boot works. With hotplug it enters a broken state and any
subsequent reboots don't fix the issue.
Removing power is needed to fix the issue again.

They mentioned the following tests:

- Get the 20TB disk in a faulty state by booting kernels 6.5 and above
(6.14.X in this case, diskcaddy light on server keeps blinking, dmesg
shows power-reset)
- Reboot server, reboot into same kernel (6.14.X)
- Disk remains in faulty state, does not attached to system or show up
under any path (lsblk, df, blkid)

and

- Get 20TB disk into faulty state by hotswapping on kern 6.5 and above.
- Shut off machine, remove from power & reattach.
- Start machine
- 20TB disk mounts during boot, accessible in OS as block-device after.


>>
>> - we patched out the four calls to scsi_cdl_check_cmd and unconditionally set
>>   cdl_supported to 0, see [2] for the patch (on top of 6.14.4). This resolves
>>   the issue.
>>
>> - I suspected that particularly the two latter scsi_cdl_check_cmd calls with a
>>   nonzero service action might be problematic, so we patched them out
>>   specifically but kept the other two calls without a service action, see [3]
>>   for the patch (on top of 6.14.4). But with this patch, hotplug still does
>>   not work.
>>
>> - the RSOC commands themselves don't seem to be problematic per se. We asked
>>   the user to boot a (non-mainline) kernel with the `scsi_mod.dev_flags`
>>   parameter to disable RSOC as above, hotplug the disk (this succeeds), and
>>   then query the four opcodes/service actions using `sg_opcodes`, and this
>>   looks okay [4] (reporting that CDL is not supported).
>>
>> I wonder whether these results might suggest the RSOC queries are problematic
>> not in general, but at this particular point (during device initialization) in
>> this particular hardware setup? If this turns out to be the case -- would it be
>> feasible to suppress these RSOC queries if CDL is not enabled via sysfs?
> 
> I would be tempted to say that indeed it is the RSOC command handling in the HBA
> SAT that has issues. But your command line checks [4] tend to indicate
> otherwise. The issue may trigger only with timing differences with hotplug though.
> 
> The other possible problem may be that the RSOC command translation is actually
> fine but ends up generating an ATA command that the drive is not happy about,
> either because of a drive FW bug or because of the timing the drive receives
> that command. Given that this is a WD drive, I can probably check that if you
> can send to me the drive model and FW rev (sending that information off-list is
> fine).
> 
>> If you have any ideas for further troubleshooting, we're happy to gather more
>> data. I'll be AFK for a few weeks, but Mira (in CC) will take over in the
>> meantime.
> 
> Checking the HBA FW version would be a start, and also if you can confirm if
> this issue happens only on hotplug or also during cold boot would be nice. I am
> traveling right now and will not be able to test hot-plugging drives on my
> setups until end of next week.
> 

They provided controller information via `sas3ircu` and `storcli`:

sas3ircu:

  Controller type                         : SAS3008
  BIOS version                            : 8.37.00.00
  Firmware version                        : 16.00.16.00

storcli:

Firmware Package Build = 24.18.0-0021
Firmware Version = 4.670.00-6500
CPLD Version = 26515-00A
Bios Version = 6.34.01.0_4.19.08.00_0x06160200
HII Version = 03.23.06.00
Ctrl-R Version = 5.18-0400
Preboot CLI Version = 01.07-05:#%0000
NVDATA Version = 3.1611.00-0005
Boot Block Version = 3.07.00.00-0003
Driver Name = megaraid_sas
Driver Version = 07.727.03.00-rc1

And the disk information from `smartctl --xall`

20T:

=== START OF INFORMATION SECTION ===
Vendor:               WDC
Product:              WUH722020BL5204
Revision:             C5C0
Compliance:           SPC-5
User Capacity:        20,000,588,955,648 bytes [20.0 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      <id>
Serial number:        <S/N>
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Thu May  1 15:23:35 2025 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Enabled

18T:

=== START OF INFORMATION SECTION ===
Vendor:               WDC
Product:              WUH721818AL5204
Revision:             C8C2
Compliance:           SPC-5
User Capacity:        18,000,207,937,536 bytes [18.0 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      <id>
Serial number:        <S/N>
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Thu May  1 15:25:27 2025 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Enabled

The 18T disk is not affected by this issue. Hotplug works as expected
with it.


If you need any additional information, please let us know!






[Index of Archives]     [Linux Filesystems]     [Linux SCSI]     [Linux RAID]     [Git]     [Kernel Newbies]     [Linux Newbie]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Samba]     [Device Mapper]

  Powered by Linux