miscellaneous

Monitor your hard drive health with SMART daemon

In the last post we covered some possible options that we have to backup our data. It all boils down to providing redundancy in case something fails, which allows us to have some peace of mind in the event of hard drive failure.

But having copies of our data doesn’t mean that we are just going to sit and wait until the hard drive fails. What if the secondary backup drive fails? What if we don’t realize until it is too late and our main drive fails as well?

Drives will fail sooner or later, the question is… when will this happen?

 

Introducing S.M.A.R.T

S.M.A.R.T, which is the acronym for Self-Monitoring, Analysis and Reporting Technology is a technology that appeared in 2004 for PATA IDE devices and now has spread to all kinds of storage devices such as HDD, SDD, or eMMC and technologies such as ATA, SCSI and NVMe. SMART is included inside any hard drive that we can buy nowadays and its mission is to provide the storage device with the capability of monitoring its own health status, and report it to the operating system.

In other words, our hard drives are aware of how old they are getting and they can detect signs of wear that are a strong prediction of imminent failure, even before the drive actually fails. According to Seagate, 60% of spinning drive failures are of mechanical nature and can be predicted before they happen, which gives us an opportunity of saving our data and replacing them before it is too late.

In Linux we can query the SMART system through the tools included in smartmontools.

 

How is my hard drive doing?

 

SMART predictions will be based on physical statistics, such as temperature, read errors,  seek errors and such. Bad performance on these indicators is a strong red flag, and studying them allows us to assess how worn out our hardware is getting. You can see some descriptions in this list.

In order to check these counters, first make sure that SMART is enabled in the drive

, then you can get all available information from

We might have noticed this section

As we can infer from it, SMART enabled drives are able to perform an offline self test on themselves in order to assess the health of the drive. We have mainly two kinds of self test

  • Short: uses basic heuristics that can normally predict failure. They take typically less than five minutes.
  • Long: includes a surface scan and can take hours, but the results will be more accurate.

It goes without saying that the performance of the drive will go down while busy running the tests.

In order to launch a short test just

, or a for long one

The test can be cancelled with

The test results can be obtained with smartctl -a  or smartctl -l selftest.

We can compare some stats for a drive in a good state

, with a drive that is starting to fail

The following one is already failing hard, we can even see the reason

You can see some good examples on how to react when bad things happens here.

While some stats will be more significant than others depending on the vendor, the following seem to be good indicators across the board

 

Monitoring drives automatically

 

smartmontools also come with a the smartd daemon. This service will monitor the device status and notify us through a configurable action.

Enable and start the service

, and edit smartd.conf according to your needs.

You can shedule long and short tests during the week, for instance the following will do a short scan every day at 2 am, and a long one on Sundays at 3 am. See the man page for details

You can also specify whether you want your hard drive to be woken up for the test if it is idle or sleeping, see the powermode options here.

Adding -m my@email.com will result in an email being sent when problems are detected. Also we can have our own scripts called with the -M  switch, in which case we have the diagnostics information available as variables such as  SMARTD_MESSAGE and SMARTD_FAILTYPE. See the man page for the full list.

For instance this simple script will warn all logged in users in the terminal

 

For more details and examples, check the Arch Wiki, and /usr/share/doc/smartmontools/examples.

 

Resources

 

https://www.smartmontools.org/wiki

https://linux.die.net/man/8/smartd

https://linux.die.net/man/5/smartd.conf

Five stats to predict hard drive failure

What stats indicate hard drive failure

Author: nachoparker

Humbly sharing things that I find useful [ github dockerhub ]

One Commnet on “Monitor your hard drive health with SMART daemon

Leave a Reply

Your email address will not be published. Required fields are marked *