Introduction
As mentioned in the article “What are you testing” we know that when you run a DMM test sequence you are actually testing much more than just the drive under test (DUT).
What is important is to make as much of the stuff between the DUT and the HBA as transparent as possible.
That way, when a test fails you can be assured that the problem is actually in the DUT and that you’re not seeing a false negative.
In this article we’ll discuss how to test components which might wear out over time.
What can wear out or change?
As you’d figure, the only thing in a test setup that can change over time are things that move, rub, plug and unplug.
Pieces of the test setup which physically move with use.
And what that boils down to is – enclosure slots.
There are two pieces in a typical drive/enclosure setup which can wear out :
- the connector on the drive can and
- the mating connector in the drive enclosure.
Keeping an eye on these, checking for degradation over time, is possible and should be a part of your test system maintenance.
How to run a checkup
We are going to deal with SAS/SATA enclosures & cans in this article.
Basically, to check on the health of each enclosure slot, you need to periodically run a standard test using a “golden” drive in each slot of your test system enclosure.
The standard test will simply generate some I/O traffic using a simple sequential write and read test.
And then record the drives’ Log Pages so we can look at the health of the interface.
This test must be run against a SAS drive, as SAS drives record and can report on the very low-level signal integrity of their interface.
SATA drives do not do this recording/reporting.
And, as in the “What are you testing” line of thought, you want to use a known good or “golden” drive. This is so if you do see any interface errors you will know that they originate in the slot, not in the drive.
So – find a SAS drive, as fast as you have (hopefully you are running with 12G SAS hardware throughout), and as clean as you can find.
By clean we mean you have tested the drive in a known good slot and the drive does not cause any interface errors.
Run the Slot Check test sequence, then using a plain-old text editor like Notepad pen the .log file for each drive/slot.
Look in the log file for the section titled Page 18 – Protocol Specific Page
Here is what a perfect slot will look like –
Page 0x18
Parameter Code = 1
# of phys = 1,
SAS Address = 5000C50096E478BD
Attached SAS Address = 5001422213A8C740
Attached PHY Identifier = 04
Invalid Word Count = 0
Running Disparity Error Count = 0
Loss of DWord Sync = 0
PHY Reset Problem = 0
And here is what a problem slot looks like –
Page 18
Parameter Code = 1
# of phys = 1,
SAS Address = 5000C50058DE5F7D
Attached SAS Address = 500605B00743B063
Attached PHY Identifier = 04
Invalid Word Count = 121E21
Running Disparity Error Count = 121E20
Loss of DWord Sync = 1210B3
PHY Reset Problem = 3
It is easy – zero error counts = good, lots of errors = bad!
Summary
When a test fails you want to know with as much certainty as possible that it was the drive that failed, not the slot.
Running SlotCheck periodically and comparing results with past results will give you that certainty.
Run the Slot Test once a month.
Keep a record of the error counts from Log Page 0x18.
Compare the results of previous test runs with the current run, and note if the error rate is increasing.
If you do find a slot where the error counts are climbing you should test to make sure that the errors are slot-based and not drive-based.
To do this move the drive from the suspect slot to another slot and run the slot test again.
Did the bad result follow the drive?
If so you may have a problem with that drive rather than with the slot.
If you do find that you have a bad or failing slot, you can either
-cover that slot and don’t use it anymore, or
-replace the enclosure