As I said, I rely on SMART for 2 things. Alas. My servers all run SMART (smartd) on all storage devices and report to Nagios, which then spreads the joy. I generally run smartctl when I first install a drive just to make sure there are no surprises. Some of these drives have been moved around to accommodate local conditions so they're not always factory-fresh. Traditionally, a hard drive would remap factory-defective sectors before sale. That would make them effectively invisible for practical computing use. Within the Linux OS environment you can run badblocks to map post-sale defective sectors. As far as I know, the hardware sector remap has no standard utility available to add failing sectors and the badblock list is something that would have to be used by the specific filesystem driver in question. By all appearances, SMART continues to test all sectors, regardless of OS usage, since it runs internal to the drive itself and thus knows nothing of the OS. So it would flag sectors that have been walled off from actual use. So far, the only SMART errors I've been annoyed by have been in OS partitions and never on an OSD, so I don't know just how closely SMART and Ceph interact. My bad sector messages (usually no more than 3 for a given drive) typically run for up to 3 years without further degradation and since I have both robust data storage (e.g., Ceph) and robust backups, I'm happy to run drives into the ground. It's not like I'm being paid a handsome sum to keep everything flawless. Tim On Thu, 2025-08-21 at 16:04 -0400, Anthony D'Atri wrote: > > > I rely on SMART for 2 things: > > Bit of nomenclature clarification: > > SMART is a mechanism for interacting with storage devices, mostly but > not exclusively reading status and metrics > smartctl is a CLI utility > smartd is a daemon > smartmontools is a package that includes smartctl and smartd > > > > 1. Repeatedly sending me messages about nonfatal bad sectors that > > no one seems to know how to correct for. > > That sounds like you have smartd running and configured to send > email? I personally haven't found value in smartd; ymmv. > > Drives sometimes encounter failed writes that result in a grown > defect and mapping of the subject LBA. When that LBA is written > again, it generally succeeds. Since Nautilus the OSD will retry a > certain number of writes, and as a result we see a lot fewer > inconsistent PGs than we used to. > > When a drive reports grown errors, those are worth tracking. Every > drive has some number of factory bad blocks that are remapped out of > the box. A portion of individual HDDs will develop additional grown > defects over its lifetime. A few of these are not cause for alarm, a > lot of them IMHO is cause to replace the drive. The threshold for "a > lot" is not clear, perhaps somewhere in the 5-10 range? > > SSDs will report reallocated blocks and in some cases the numbers of > spares used/remaining. It is worth tracking these and alerting if a > drive is running short on spares or experiences a high rate of new > reallocated blocks. > > > > > 2. Not saying anything before a device crashes. > > > > Yeah. But I run it anyway because you never know. > > The overall health status? Yeah that usually has limited utility. > I've seen plenty of drives that are clearly throwing errors yet > report healthy. Arguably a firmware bug. > > > The reported error is too far abstracted from the actual failure > > and I cannot find anything about -22 as a SMART result code. > > I'm pretty sure that's an errno number for launching the subprocess, > nothing to do with SMART itself. I'd check dmesg and > /var/log/{messages, syslog} to see if anything was reported for that > drive. If the drive is SAS/SATA and hung off an LSI HBA, also try > `storcli64 /c0 show termlog >/var/tmp/termlog.txt` > > > > *n*x errno 22 is EINVAL, which seems unlikely, but it is possible > > that smartd got misconfigured. > > Best configuration IMHO is to stop, disable, and mask the service. > > > > > Run smartctl -l /dev/sdc to launch an out-of-band long test. When > > it is done, use smartctl to report the results and see if anything > > is flagged. > > > > On 8/21/25 09:10, Anthony D'Atri wrote: > > > > > > > On Aug 21, 2025, at 4:07 AM, Miles Goodhew <ceph@xxxxxxxxx> > > > > wrote: > > > > > > > > Hi Robert, > > > > I'm not an expert on the low-level details and "modern" Ceph, > > > > so I hope I don't lead you on any wild goose chases, but I > > > > might at least give some leads. > > > > It seems odd that the metrics mention NVM/e - I'm guessing > > > > that it's just a cross-product test and tries all tools on all > > > > devices. > > > Recent releases of smartctl pass through stats for NVMe devices > > > via the name-cli command "nvme". Whether it invokes that for all > > > devices, ordering, etc I don't know. > > > > > > > > > > SMART test failure is more of an issue. It's a pity the error > > > > message is so nondescript. Some things I can think of from > > > > simplest to most complicated are: > > > > * Are smartmontools installed on the drive host? > > > Does it happen with other drives on the same host? > > > > > > If you have availability through your chassis vendor, look for a > > > firmware update. > > > > > > > * Does the monitoring UID have sudo access? > > > > * Does a manual "sudo smartctl -a /dev/sdc" give the same or > > > > similar result? > > > > * Is the drive managed by a hardware RAID controller or > > > > concentrator (Like Dell PERC or a USB adapter or something) > > > > * (This is a stretch) Is there an OSD for the drive that's > > > > given the "NVME" class? > > > > > > > > Hope that gives you something. > > > > > > > > M0les. > > > > > > > > > > > > On Thu, 21 Aug 2025, at 17:15, Robert Sander wrote: > > > > > Hi, > > > > > > > > > > On a new cluster with version 19.2.3 the device health > > > > > metrics only show a smartctl error: > > > > > > > > > > { > > > > > "20250821-000313": { > > > > > "dev": "/dev/sdc", > > > > > "error": "smartctl failed", > > > > > "nvme_smart_health_information_add_log_error": "nvme > > > > > returned an error: sudo: exit status: 1", > > > > > "nvme_smart_health_information_add_log_error_code": - > > > > > 22, > > > > > "nvme_vendor": "ata", > > > > > "smartctl_error_code": -22, > > > > > "smartctl_output": "smartctl returned an error (1): > > > > > stderr:\nsudo: exit status: 1\nstdout:\n" > > > > > } > > > > > } > > > > > > > > > > The device in question (like all the other in the cluster) is > > > > > a Samsung MZ7L37T6 SATA SSD. > > > > > > > > > > What is happening here? > > > > > > > > > > Regards > > > > > -- > > > > > Robert Sander > > > > > Linux Consultant > > > > > > > > > > Heinlein Consulting GmbH > > > > > Schwedter Str. 8/9b, 10119 Berlin > > > > > > > > > > https://www.heinlein-support.de > > > > > > > > > > Tel: +49 30 405051 - 0 > > > > > Fax: +49 30 405051 - 19 > > > > > > > > > > Amtsgericht Berlin-Charlottenburg - HRB 220009 B > > > > > Geschäftsführer: Peer Heinlein - Sitz: Berlin > > > > > _______________________________________________ > > > > > ceph-users mailing list -- ceph-users@xxxxxxx > > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > > > > > _______________________________________________ > > > > ceph-users mailing list -- ceph-users@xxxxxxx > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > > > ceph-users mailing list -- ceph-users@xxxxxxx > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx