A Few Common Questions About Drive Errors
Here are a few common questions which come into our support department about disk drive errors. In particular this article will discuss SATA disk drives.
There is a common and fairly accurate analogy that a magnetic disk drive (rotating platter(s) with head(s) “flying” over the platter(s), is like a super-fast airplane flying hundreds of miles per hour at a very low height – say 10-20 feet.
If in this analogy there is a 50 foot tall boulder or debris in the path of the airplane there will be a “crash”. There can be damage to the airplane (disk heads), and there can be damage to the boulder (disk platter).
This is a very bad thing, physically scraping your data off of the drive platter, which can lead to more crashes, etc.
This debris can appear in the drive as the drive ages, or if the drive has been subject to physical abuse or impact.
Another way that physical damage can happen is if the drive is running and is “jarred” or impacted. Think of bumping or even dropping your laptop computer as it is running. The impact can travel to the disk drive and cause the disk heads to “slap” into the media, causing physical damage, pieces of the media to flake off, etc. Remember, a head-slap is similar to an “airplane flying into a boulder” slap – probably a very bad thing!
Bit Rot is the term for when during the aging of the drive media, heads, or circuitry it has areas which change. These changes may be very slight, but they may make areas of the drive “weaker” or less able to hold the magnetic pattern (your data) written on those areas. Or perhaps there was a very fast and slight power glitch just as the data is being written or read from the drive. The media may not have any physical damage, but the data on isn’t correct.
When data is written to a disk it includes error-checking data (ECC) which the drive can use to:
1. check that the data being read is still valid, and
2. possibly correct the data if it is no longer valid.
ECC methods can detect a single bad bit of data out of the entire block (normally 512 bytes or 4096 bits). That’s pretty impressive error detection and correction!
If the data cannot be recovered using ECC methods then the computer or drive will probably retry the read or write operation a few times to see if the error was just a fluke.
If the data still cannot be recovered the drive will try to reallocate the bad block – moving the data from the bad block to one of the drive’s spare blocks.
Differences between SAS/SCSI/FC and SATA
Since this article is specifically about SATA drives we won’t take too much time to discuss SAS/SCSI/FC drives. Other than to point out that SAS/SCSI/FC drives allow the user to adjust the retry and ECC process. SATA drives to not allow these changes or adjustments to be made.
Differences Between “Consumer” and Enterprise” Drives
Enterprise class drives are typically configured in the server into RAID arrays. RAID arrays can deal with errors without help from the drive, and so they don’t need or want the drive to take the time to try to do its own error correction. Drive error correction methods take finite amounts of time.
In fact a SAS/SCSI/FC drive will show you various types of error counts – error correction which took a lot of time (probably retries) and those that took less time (probably ECC) –
Even non-RAID type of enterprise applications may need the drive to treat errors differently from one case to another. For example, in a system that is collecting video data which is coming in very fast and can only be captured once may want to ignore all types of error correction and just try to capture all of the data – errors or not.
Versus a disk drive holding your bank balance, where speed is not anywhere near as important as data perfection.
As mentioned above, SAS/SCSI/FC drives allow you to adjust these error correction methods while SATA drives do not.
Which is why there are “desktop” SATA drives and “enterprise” SATA drives – the drives will have different firmware to deal appropriately with errors in the intended application or use of the drive.
To check for drive errors on a SATA drive you need to look at the drive’s SATA SMART DATA ATTRIBUTES, in particular ATTRIBUTES 5 (Reallocated Sector Count – sectors or blocks which have already been reallocated) and 197 (Current Pending Sector Count – sectors waiting to be reallocated. In the STB Suite you can see these ATTRIBUTES by using the top menu ATA/SATA->Commands->View SMART Data function. In multi-drive mode (DMM) you use the SMART test step to record the ATTRIBUTES to the log files and also to screen or fail drives which exceed your chosen thresholds for these counts.
SATA drives will automatically check for errors every time a READ or a WRITE operation is executed. If the READ/WRITE fails then the drive will try to recover the data and possibly mark the sector as bad and reallocate it to a spare sector. If that happens you should see the SMART ATTRIBUTES 5 and/or 197 increment.
Note: this repair or reallocation is automatic in SATA drives, so a good way to scan a drive to try to get any bad sectors reallocated is to use the STB Suite to do a simple Sequential Read test to the entire drive – all blocks.
In my personal opinion and use I would never use a drive to hold valuable data if it had any reallocated sectors. That means I personally would reject any drive with > 0 reallocated sectors.
That’s just my opinion – here is my reasoning:
If the drive reallocated a sector it will always be because of a “problem”. That problem could be as benign as a power fluctuation during the read or write – where there really isn’t anything in the drive causing the problem. OTOH, that error could be because the drive has been impacted while running, causing head slap and media damage. You can’t know why a sector was reallocated, only that it was reallocated. I would not take the chance. But of course the threshold settings to reject a drive for this cause is 100% user decided and defined in the STB Suite.
In a word – “no”.
In one sense SATA drives are simple as far as dealing with discovering and trying to correct or repair errors. There are really no user accessible adjustments to change. A simple sequential read test will discovery and hopefully correct any errors found.
A DMM test sequence to do this would look something like this:
Note: This is a non-destructive test – it will not damage, change, or overwrite any user data.
- Use the SMART test step to record all of the pre-test SMART ATTRIBUTE values to the .log files.
- Do a Sequential Read test of the entire drive, exactly like this –
- Then do another SMART test step – so you can compare ATTRIBUTES 5 & 197 to see if any new bad sectors were discovered and remapped.Your test sequence should look like this –