Disk Drive Health Issues

In a best-case situation the drive would execute the command in the shortest time possible. A “good-but-not-best” case would be where the drive executes the command, but due to error correction methods it takes slightly longer to complete. There may be several levels of “good-but-not-best”, each level taking more time to complete than the level before. Finally there is the worst case, where the drive spends considerable time trying to execute the command but finally failing.

To make matters more interesting a disk drive may have all the above, where data retrieval from one area of the drive may complete in the fastest possible time but transfers from another area are as slow as can be without actual failure.

In the world of data transfer, time is the enemy. The faster a command executes the better.

How can you quantify the real-world I/O performance of a disk drive? You need a macro view of the entire drive that also provides a micro view of all problem areas. In the case of disk read performance, you need to read all blocks on the drive and record the time each read takes to complete. The shorter the completion times the better. To visualize this data it should be graphed as block-number versus I/O Completion time.

Look at the graph in Figure 1 – the X axis is the block number read and the Y axis is the time it took the READ I/O to complete. The Red plot shows the slowest I/O time out the past 100 reads. The Green plot shows the average completion time of the last 100 reads, and the Blue plot shows the average of all reads.

Figure 1

Notice the large difference between the slowest I/O completion time and the average. In this case something is causing the drive to take much longer to complete the read. Also notice the correlation between the Red and Green plots. Since the Green plots show the average completion time across 100 I/O’s these plots will indicate how many of the I/O’s are slow.

For example at block number 400,000 the Red plot showed .05 seconds, and the Green shows around .012. This tells us that there were just a few blocks within a 100 block range that had trouble. On the other hand, at block 600,000 we see the Red plot peaking again around .06, while the Green average is almost the same. This would indicate that most if not all the blocks in this region are having trouble.

What causes slow I/O completion times? Usually it stems from the drive having to do varying degrees of error correction. We recommend looking at two suspect areas, grown defects and error correction counts.

Figure 2 shows the Primary defect list of the drive, Figure 3 shows there are no Grown defects

Figure 2

Figure 3

This implies that the problem is not related to the drive accumulating defects. That leaves the question of is the drive having to do excessive amounts of error correction which in turn is slowing command execution? To confirm this you must examine the Error Correction logs for the drive. A summary of these logs is shown in Figure 4.

Figure 4

As noted in Figure 4, the drive is recording many read retry efforts.

What causes retries? Why would the drive need to retry the read operation? It could be a number of issues, from media damage due to a head crash, wear and aging of the media, servo tracking problems possibly cause by temperature extremes or power loss during writes, etc. Retries over a very large block range may indicate disk head or preamp problems.

To verify if the problem stems from a permanent problem or not we recommend two steps –

Reformat the drive – this will erase any grown defects and replace them if the format media scan shows a problem.
Write to all blocks of the drive. This will force the drive to update both sector data and error correction code data.

After performing these two steps run the Performance/Error graph test again and note any differences – as Figure 5 shows the problem area still has a few blocks that are reading slow, but the average I/O completion time is now low. Again, this indicates that only a few blocks are having problems.

Figure 5

Checking the drive Error Correction logs in Figure 6 now shows fewer corrections are happening – the format/write process was able to repair the problem. It would still be prudent to watch the drive – occasionally run the same test and log examination sequence to be sure that the problem was indeed temporary.

Figure 6

We have shown how to view the I/O performance of the entire drive, and two methods to try to interpret this information and repair the problems that were found.