Disk Drive Troubleshooting 101

Introduction

Disk drives are complex devices, marvels of mechanical engineering and real-time computing magic.

For example, the heads of a rotating magnetic disk physically fly over the moving platters, flying at a height of as little as 3 nanometers at a speed over 128 mph! No wonder bumps and drops can do so much damage – just like flying an extremely fragile airplane into metal-hard ground at 128 mph! Not only can the airplane (drive head) be damaged, but the ground (platter surface) can be dug-up, furloughed, damaged. The magnetic coating on the platter is like a thin layer of top soil – scraping it away scrapes away data.

Bottom line – be gentle with your disk drives. You may be under pressure to get a large number of drives tested by the end of the day – but take your time. Move them gently and slowly. Never move a drive while it is spinning. Never drop a drive or bump it into anything. And keep the drive cooled while testing – never power up a disk drive without some kind of fan to move air around the drive to draw away the heat it generates.

All disk drives have a built-in computer to control all the physical operations of the drive as well as dealing with data encoding/decoding, queuing, transferring data in and out – a marvel of real-time computing!

Disk drives can work perfectly, or they cannot work at all, or they can “sort of” work, work marginally or poorly. The goal of basic disk drive testing is to:

determine if the drive is working at all or not
determine if the drive itself can tell us if it has had a problem in the past
determine if the drive can reliably store and retrieve data
determine if drive settings are appropriate for the intended use of the drive and,
determine the performance characteristics of the drive.

Is the drive working or not?

Determining if the drive is alive or not is simply a matter of connecting it to a test system and checking that the drive spins up, is “online”, and can report its capacity. Using the STB Suite Original mode look at the device window or click the Scan System button to scan all of the storage controllers in your test system. Do you see the drive? Does it report a valid capacity? Does the drive information (manufacturer name, drive part number, firmware version) look reasonable?

If the drive does not show up at all on the test machine you must check all cabling and power for the drive test fixture. See if there are indicator LEDs on the drive that show any activity. Listen to see if you can hear that the drive is spinning, or gently feel if you can sense vibration from the drive.

If the drive does not report a reasonable capacity, for example if it reports it has zero blocks or a negative number of blocks than the drive may need a low-level format before continuing testing.

If the drive information is jumbled or wrong – for instance if the drive is a “SEAGATE” drive but the STB Suite is reporting it as “SEAGGGG” you may have a dead drive or you may have a cabling/termination problem. Try moving the drive to a different slot or connector or bus to see if the problem goes away.

Here is what you should see:

Can the drive tell us it has had problems?

Disk drives will store historical data which can be retrieved and analyzed. SCSI/FC/SAS drives store this type of information in LOG PAGES. ATA/SATA drives store this information in SATA SMART data.

SCSI/SAS/FC Log Pages

A quick way to see an overview of some of the more important Log Pages is to double-click on the drive in the device selection window to bring up the Device Information display. Select the Error Data tab and you will see historical data describing how much data has been read and written and the number of and type of errors that have happened during reads and writes –

As a general rule – uncorrected errors are always a bad thing to have. Uncorrected errors usually will cause the LBA in question to be marked as bad.

For SCSI/SAS/FC drives this will mean an increase in the drives G defect list. On the Device Information display click the Statistics tab and note the number of G List defects –

As another general rule – good disk drives don’t have any grown defects.

For a detailed “raw” data view of every log page a drive has you can right-click on the drive in the device selection window, then from the Quick Command list choose View Log Pages–

Be sure to use the Browse button to select a log page definition file – the file “default.dat” is usually fine for any disk drive. The available Log Pages are shown on the left of the display, double-clicking on a Log Page will display that pages parameters on the right.

Note that you can save all of this information to a file. A good thing to do as you test drives is to build a database of the drives you’ve tested.

ATA/SATA SMART Information

This same type of historical error data is found in the SMART data for ATA and SATA drives. To view and save this info go to the STB Suite main menu ATA/SATA->Commands->View SMART Data choice. Select the drive of interest from the lists to the right and the SMART data will be displayed on the left.

Note: to learn about how to interpret SATA SMART data look at

http://en.wikipedia.org/wiki/S.M.A.R.T.

The top of the display will show all SMART attributes and will indicate pending problems And at the bottom of the display you will see attributes which may indicate the actual number of errors or counts such as Power-On Hours, etc –

Can the drive reliably write and read data?

Obviously a disk drive must be able to reliably store and retrieve data. The best way to test these two functions is to run a test which first writes a known data pattern to every block on the drive. Then every block on the drive is read and using data compare each block of data is checked to insure that it is exactly the same as what was written.

Obviously writing and reading the entire drive will take some amount of time. Can you reliably determine if the drives write/read functionality is OK by checking less than the entire drive? Technically, probably not – what if there is a problem with a block which you didn’t test? Statistically – maybe yes. The choice is up to you – balancing the accuracy of your results with the time it takes to complete a test.

The good news in this regard is the STB Suite Disk Manufacturing Module (DMM) is extremely efficient and fast – it can test many drives at once. DMM will tell you exactly how long a given test is going to take to complete so as you test more drives you will soon learn how many drives per hour or day you will be able to test to your company’s specification.

Another choice to be made concerning write/read testing is the access method. The most basic access method is sequential – the test starts at LBA 0 and progresses sequentially through to the last block. Another access method is Random. As the name implies, random access moves through the drives blocks in a random manner. An advantage of random access is that it will generate more vibration in the drives under test, which will stress the drive harder. A new STB Suite access method is CPAM. CPAM is a method which creates random access and also guarantees that every block on the drive will be accessed once and only once.

Getting started with DMM is covered in earlier articles and in videos on our web site at :

http://www.stbsuite.com/training

The STB Original mode also has a number of canned tests, more appropriate for testing a single drive at a time. Select a drive then click the top menu Disk-Tests choice pulldown to see a list of available tests – for example, the Quick QC test checks write/read functionality at the beginning, middle, and end of the drive. The Quick Drive Profile Test will show a good overview of the drive

Is the drive set up appropriately?

SCSI/SAS/FC drive behavior can be specified by Mode Page settings. To see the most common or important Mode Page settings go back to STB Original mode, double-click on your drive, and choose the Mode Page tab

Settings such as enabling or disabling read ahead and write caching are shown, to change any settings click the Change/Edit Mode Pages button.

Note: in general you will want Write Caching (WCE) to be ON. This will greatly increase the write speed of the drive.

There are many settings available via Mode Pages. DMM has a feature whereby you can set up a golden drive with all mode pages set the way you or your customer defines – and then during DMM testing each drive under test will automatically have all of its Mode Pages set to match your golden drive.

What is the performance profile of your drive?

For a quick look at the performance of an individual drive you can select it in STB Original mode, then run a test and watch the real-time performance with Drive Watch – here is a view of the write performance of a drive:

DMM will reveal real-time performance metrics as well as logging them to log files. Here is a similar example to the above – a sequential write test :

Note that DMM tells you how long the current test step will take to complete – experiment with this and you will quickly get a feel for estimating how long any given test is going to take to complete.

And the DMM log file for this drive shows:

Summary

Basic testing to confirm overall drive health is easily done with the STB Suite. Use Original mode to examine single drives at a time in detail/depth, and use DMM to test many drives at a time.

STB Original Mode Advantages:

extreme depth of detail available to examine each and every drive setting
versatile set of tests to get a quick snapshot view of drive health

STB Suite DMM Advantages:

multi-threaded high speed multi drive testing
can test multiple HBA’s/controllers simultaneously
extremely detailed logs generated for each drive under test
easy to define any type of test sequence
test sequences can be saved and reloaded