Testing SSD Health

Seek Time

Zero OS uses a seek time metric to determine if disks are HDD or SSD. If your SSD is being detected as HDD, it’s possible that there’s an underlying hardware issue causing poor performance. In this post you’ll find info on a few different diagnostics that can help identify SSDs with such issues.

Before you proceed, it’s a good idea to try some basic troubleshooting:

  • Reboot the node
  • Reseat the cable terminations at the drives and at the mainboard
  • If you have storage controller cards, reseat those as well

If you’ve already tried those steps, the rest of this post can help you see what Zos is seeing and identify which disk is at fault. Once you’ve done that, trying to move this drive to a different port on the main board.

Live Linux

To run these tests, first boot into a live Linux distro like grml (the small image is okay for our purposes here). You can flash the iso file onto a USB stick with a tool like USBImager, attach the USB stick to your 3Node, and select it as a temporary boot device.

All commands in this tutorial should be run as root. Grml will present you with a root shell after boot. On other distros you may need to open a terminal and switch to the root user:

sudo su root

Then get a list of all disks on the system:

fdisk -l

Identify the drives attached to the system, including the USB stick you booted from, according to their models and sizes. Make note of the drive paths, normally /dev/sd... or /dev/nvme... for NVMe.

Check Seek Times

We can use the same tool that Zero OS uses to check seek times. For convenience, I’ve provided a precompiled copy of the program at a shortened url. Use these commands to download and make the file executable:

wget tinyurl.com/seektime
chmod u+x seektime

To run seektime on a single disk, specify its path (note for NVMe, a namespace designation, like n1 is required):

# SATA
./seektime /dev/sda
./seektime /dev/sdb
...

# NVMe
./seektime /dev/nvme0n1
./seektime /dev/nvme1n1
...

To check all disks of a given type, a wildcard in a short script can be used:

# SATA
for disk in /dev/sd?; do ./seektime $disk; done

# NVMe
for disk in /dev/nvme?n1; do ./seektime $disk; done

The output will include the measured average seek time and a determination about whether the disk is SSD or HDD. Here’s an example output, with seek values in microseconds:

/dev/sda: SSD (103 us)
/dev/sdb: HDD (3336 us)

Zos assumes that anything with a seek time greater than 500 microseconds, or .5 milliseconds, is an HDD. If your SSD is only performing at HDD levels, this is a sign that the disk is failing and should be replaced.

Additional Diagnostics

Smartctl for SATA

For SATA SSDs, we can get access to built in diagnostics using smartctl. This is included with grml, but you might need to install it on other distros.

Initiate short test

First, we’ll should initiate a short self test on the disk:

smartctl -t short /dev/sda

Or use a script like above to test all disks:

for disk in /dev/sd?; do smartctl -t short $disk; done

Check the results

This will take a minute or two. You can check whether the test is complete with:

smartctl -c /dev/sda

Look for “Self test execution status”. If it says in progress, you’ll need to wait a bit longer. Otherwise, you can now query the results. A simple pass or fail result can be retrieved with:

smartctl -H /dev/sda

For the full results:

smartctl -a /dev/sda | less

This will pipe the results into less so you can scroll through them. Hit q to exit when you’re done.

Interpreting the results

One common indicator of a failing disk is attribute id #5, the reallocated sector count. Drives are built to tolerate a certain number of reallocated sectors, but if you see a number higher than zero in the raw value here, that can be a sign of trouble.

While the SMART data can be useful in identifying failed or failling disks, not every disk with a problem can be spotted this way. Zero OS simply looks at the performance of the disk, and treats it accordingly.

You can consult with other resources online (1, 2) for more information about what output of smartctl means.

Long test

You can also run a full test, which will scan the entire drive and can take hours to complete. Just specify long instead:

smartctl -t long /dev/sda

Smartctl for NVMe

NVMe drives have some support under smartctl. It seems that NVMe drives don’t all support triggering a self test like on SATA drives. However, they do still collect health info that can be queried, for example:

smartctl -a /dev/nvme0

Conclusion

If Zero OS is detecting your SSD as HDD, the best course of action is probably to replace the disk. Running the tests in this post can help to identify disks with slow seek times and other issues as reported by onboard diagnostics. A firmware update for the disk could be worth trying, as well as moving it to a different main board port.

I’m not aware of any other troubleshooting for issues like this, but if you are, let me know below and I’ll incorporate it into this post. Please also share your experience in the replies if you’ve had issues with disks, tried any diagnostics, or have additional tips!

1 Like

Great post and guide. Thanks for this Scott.

I wonder if using the badblocks function could help resolve this issue. The example here is with the disk sda.

sudo badblocks -svw -b 512 -t 0x00 /dev/sda

Perhaps someone tried it before.

Hey @Mik, badblocks is definitely worth a mention here, as another tool that can help identify bad disks.

The -svw option set is a destructive test that will write over all the data on the disk, also using the default of four write patterns that will take quite some time especially on larger disks.

An alternative is -svn which will do a non destructive single pass test–much quicker and should leave the original data intact.

Since I wrote this post, one farmer responded that they solved their issue by rearranging the disks within their system. I’ll add these troubleshooting steps, which maybe should have been obvious: reseat the cables and try different connectors on the mainboard!

1 Like

The ubuntu live image, offers gui seek time and read/write benchmarking that is very useful for checking drives. It’s in the disks Utility, just click the gear>benchmark.

1 Like

Great point!

Ubuntu really is user-friendly especially if you don’t want to get into the command lines.

I realized this video actually covers getting to ubuntu and the only modification of procedure needed for benchmarking is choosing that instead of format!

1 Like

Great tip, thanks @ParkerS. I wanted to provide a way to run the exact same test that Zos runs, but the gui tools in Ubuntu should also be helpful in identifying issues with SSDs.

maybe @flowmotion could add it to his boot/backup bootable image?

1 Like

The whole test stuff from the first post? sounds interesting!

Just adding the seektime executable in a way that can check all the disks and print the output on screen would be a great start. I’d consider everything beyond that to be extra :slight_smile: