Re: smartctl failed with error -22

Tim Holloway <timh@xxxxxxxxxxxxx> · Thu, 21 Aug 2025 12:44:01 -0400

I rely on SMART for 2 things:

1. Repeatedly sending me messages about nonfatal bad sectors that no one 
seems to know how to correct for.

2. Not saying anything before a device crashes.

Yeah. But I run it anyway because you never know.

The reported error is too far abstracted from the actual failure and I 
cannot find anything about -22 as a SMART result code. *n*x errno 22 is 
EINVAL, which seems unlikely, but it is possible that smartd got 
misconfigured.

Run smartctl -l /dev/sdc to launch an out-of-band long test. When it is 
done, use smartctl to report the results and see if anything is flagged.

On 8/21/25 09:10, Anthony D'Atri wrote:

On Aug 21, 2025, at 4:07 AM, Miles Goodhew <ceph@xxxxxxxxx> wrote:

Hi Robert,
  I'm not an expert on the low-level details and "modern" Ceph, so I hope I don't lead you on any wild goose chases, but I might at least give some leads.
  It seems odd that the metrics mention NVM/e - I'm guessing that it's just a cross-product test and tries all tools on all devices.
Recent releases of smartctl pass through stats for NVMe devices via the name-cli command "nvme".  Whether it invokes that for all devices, ordering, etc I don't know.

SMART test failure is more of an issue. It's a pity the error message is so nondescript. Some things I can think of from simplest to most complicated are:
* Are smartmontools installed on the drive host?
Does it happen with other drives on the same host?

If you have availability through your chassis vendor, look for a firmware update.

* Does the monitoring UID have sudo access?
* Does a manual "sudo smartctl -a /dev/sdc" give the same or similar result?
* Is the drive managed by a hardware RAID controller or concentrator (Like Dell PERC or a USB adapter or something)
* (This is a stretch) Is there an OSD for the drive that's given the "NVME" class?

Hope that gives you something.

M0les.

On Thu, 21 Aug 2025, at 17:15, Robert Sander wrote:
Hi,

On a new cluster with version 19.2.3 the device health metrics only show a smartctl error:

{
     "20250821-000313": {
         "dev": "/dev/sdc",
         "error": "smartctl failed",
         "nvme_smart_health_information_add_log_error": "nvme returned an error: sudo: exit status: 1",
         "nvme_smart_health_information_add_log_error_code": -22,
         "nvme_vendor": "ata",
         "smartctl_error_code": -22,
         "smartctl_output": "smartctl returned an error (1): stderr:\nsudo: exit status: 1\nstdout:\n"
     }
}

The device in question (like all the other in the cluster) is a Samsung MZ7L37T6 SATA SSD.

What is happening here?

Regards
--
Robert Sander
Linux Consultant

Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: +49 30 405051 - 0
Fax: +49 30 405051 - 19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx