On 6/3/25 20:28, Friedrich Weber wrote: >>> They provided controller information via `sas3ircu` and `storcli`: >>> >>> sas3ircu: >>> >>> Controller type : SAS3008 >>> BIOS version : 8.37.00.00 >>> Firmware version : 16.00.16.00 >> >> Is this the latest available FW for this HBA ? (see below) > > It seems 16.00.16.00 is even newer than the latest version available on > the Broadcom website, which is a bit strange -- I only found [1] there > which has an older 16.00.14.00 (3008_FW_PH16.00.14.00.rar). So this is an old/now EOL 9300 series HBA, right ? Or is this a 3008 controller chip as part of the server motherboard (e.g. a supermicro HBA ?) Looking at the Broadcom support page for legacy products, the latest FW version seems to be 16.00.10.00. >>> storcli: >>> >>> Firmware Package Build = 24.18.0-0021 >>> Firmware Version = 4.670.00-6500 >>> CPLD Version = 26515-00A >>> Bios Version = 6.34.01.0_4.19.08.00_0x06160200 >>> HII Version = 03.23.06.00 >>> Ctrl-R Version = 5.18-0400 >>> Preboot CLI Version = 01.07-05:#%0000 >>> NVDATA Version = 3.1611.00-0005 >>> Boot Block Version = 3.07.00.00-0003 >>> Driver Name = megaraid_sas >>> Driver Version = 07.727.03.00-rc1 >> >> Unfortunately, I do not have any megaraid model so I cannot test/recreate. I >> only have mpt3sas (9300, 9400 and 9500 series HBAs) and mpi3mr models (9600 HBA >> series). > > We just realized this is actually the firmware information for a > different unrelated controller on the same host (a LSI MegaRAID SAS-3 > 3108 using the megaraid_sas driver). But the megaraid_sas one is not > used in our tests, so please ignore the storcli output we provided. > Sorry for the confusion. > > The controller we're testing with is the SAS3008 I mentioned initially, > with firmware version 16.00.16.00 as reported by sas3ircu above. I do not have this FW... Not sure what the HBA itself is too. I only have some Broadcom 9300-XX HBAs that have the 3008 controller. > FWIW, the user reports they have also seen the same issue with a > SAS3-9500-8e Tri-mode HBA. This one had a FW update last month or so. So checking the latest is required. >>> And the disk information from `smartctl --xall` >>> >>> 20T: >>> >>> === START OF INFORMATION SECTION === >>> Vendor: WDC >>> Product: WUH722020BL5204 ... >>> Product: WUH721818AL5204 I have these. I will try to check. But again, I seriously doubt this has anything to do with the drives since these do not support CDL, nor do the HBAs you listed. None of then support CDL so calling scsi_report_opcode() for checking CDL, we should always see the HBA SAT return "CDL not supported". >> I do not think that the drives are relevant for this issue. How the HBA react >> to a command error from the drive resulting from the HBA command translation >> likely is the issue. > > I see, but it is certainly strange that 18T vs 20T drives do seem to > make a difference (hotplug works with 18T and doesn't work with 20T). Probably a timing difference since these drives are not the same generation. They have different timing on scan. >>> If you need any additional information, please let us know! >> >> Adding the Broadcom folks to this thread, since as suspected, this seems to be >> an HBA issue. I strongly suspect that it relates to a recent very similar issue >> I have seen with the mpi3mr driver and a 9600 Broadcom HBA: any hotplug of a >> drive would completely crash the HBA and a full power cycle was needed to >> recover. A simple reboot would not be sufficient. I think the latest HBA FW >> version fixes that problem. >> >> Broadcom team, >> >> Any comment ? Broadcom ? Would you care to comment ? At this point, I have no idea what is going on. My hunch is that it is the HBA SAT misbehaving. But that is only a hunch. To prove it, we would likely need a bus trace and have Broadcom look at HBA logs (which can be extracted using storecli). All of this likely means involving the technical support of the vendors. -- Damien Le Moal Western Digital Research