Correctable disk errors and Audio/Video applications

In typical A/V applications such as non-linear editing, personal video recorders, or video on demand, disk drives are called upon to play and record one or more streams of video and audio data. This data stream must be maintained at a certain data rate, otherwise video and audio dropouts may occur, degrading the playback or even worse, permanently losing valuable data. The more data streams, the more critical drive throughput performance becomes. This article will address one important issue which can have a critical effect on drive performance – correctable errors.

A correctable read error occurs when a drive tries to read a block of data and the data cannot be correctly read. When this occurs the drive electronics can use a number of methods to try to deliver the data. These methods can include retrying the read, retrying with slight physical head offsets, or using error correction algorithms. It is a tribute to error correction that a block of data with a relatively large error can be corrected and the data recovered, but this ability comes at a price and that is time. It takes time to process an error correction algorithm, and it takes time to physically re-read data. And time takes its toll on data throughput.

To illustrate the impact that read errors can have on performance, we will build a test that will read a contiguous stream of data from a disk. We will measure the baseline data throughput of this drive by reading 10 MB of data while timing how long it takes to complete the read. To insure that we are reading from an area of the disk with no errors we will check the drives log pages to insure that no error correction was applied during the read.

Once the drive baseline read performance is quantified we will introduce a fixed number of correctable data error to the drive, then re-run the timed read test to measure what effect correction of the errors has on data transfer rate.

As a further step we will modify the behavior of the drive as far is its response to correctable errors, setting the drive to do minimal (if any) error correction steps. By modifying specific Mode Page parameters the drive will ignore the previously correctable errors, transfer the corrupted data to the host, and continue reading. In an A/V application having occasional small amounts of incorrect data but maintaining the data transfer rate is an acceptable compromise.

The graph below shows the data transfer rates sustained with the three conditions described. The blue bar shows the drive baseline performance transferring 10 MB from a completely error free portion of the drive. The sustained transfer rate is 59.69 MB/s – perfectly acceptable for video use.

The red bar shows the impact that two correctable errors within the 10 MB of data has on transfer rate – the sustained transfer rate is reduced from 56.69 MB/sec to 18.83 MB/s. The two errors represent approximately 40 bytes out of 10485760 bytes of data.

The green bar shows the effect of setting the drive to ignore correctable errors – the sustained transfer rate is now back in the acceptable range for video applications.

How are correctable errors normally detected when they occur? In a word, they are not. Unless the drives mode pages are set up specifically to notify the host upon a corrected error – which is almost never the case – the host computer will never know that corrections are occurring. Specialized software can be used to query the drive to ascertain how many times error correction was needed, and what type was applied while reading the drive, but in general neither applications software or operating system software will be aware of this information.

Similarly, typical drive tests are only looking for hard failures, therefore these transparent corrected errors will not be revealed, other than as a decrease in transfer rate performance. This decrease may be gradual, and may not surface until an inopportune time.

A better test methodology for drives in AV use is to periodically sequentially read the entire drive, while tracking transfer rate and all corrected error information, storing this information in a historical database. The database can be consulted each time the test is run on the drive, and the transfer rates, number of correctable errors, and their location on the drive compared in order to reveal drives which are ageing and degrading to the point of not being usable in AV applications. This allows predicting drive failure in time to prevent loss of important data and delays in projects.