In this Issue:

Ask Dr. SCSI – Why don’t I see any devices to test with the STB Suite?

Q. “I recently installed the STB Suite and I don’t see any devices to test. Dr. SCSI, do you know why?”

A. This typically means the STBTrace driver didn’t properly get installed on your system.

The STBTrace driver, which handles some of the basic IO operations, can be installed via the executable InstallSTBTrace_V840.exe, which is included in the STB folder

Install the InstallSTBTrace_v840.exe and verify the status of the driver:

Typically this driver will fail to install if STB was installed without being logged into the test system Locally and as Administrator.

What are you really testing? Drives, Enclosures, Cables, HBA’s, or EVERYTHING?

Your goal – test a drive, or a bunch of drives.

So, you connect the drive(s) to your test system, run the STB Suite, and look at results. Some drives pass, others fail. Or all pass. Or fail.

But wait!

What passed? Or more important in the case of failure, what failed?

The drive you tested? Or something else?

What exactly are you testing when you test that drive? Well, in actuality you are testing every component or cable or device between the STB Suite software and the drive you ultimately need to test. Look at this drawing –

It shows your test environment with these components:

Test System (computer)
HBA
Cable
Enclosure

Four pieces of equipment so far… then within the enclosure you will either have

Backplane with connectors, like this
and disks

RAID/JOB active circuitry
Enclosure services circuitry
Other monitoring circuits
Backplane
And disks, like this

Then finally

Either 6. or 9. the disk(s) you want to test.

So – either 6 or 9 “things” in the data path, all “being tested” during your disk test.

In other words – 6 or 9 different pieces of equipment, all of which could fail and make you think that a perfectly good disk drive has a problem when in fact it doesn’t!

When test results show that a drive failed but in fact it didn’t that is known as a false negative.

And false negative results can cost you time, money, and customer good will, and so should be avoided whenever possible.

The way to avoid false negatives and the way to insure that what your test results reflect are the state of the disks is to make certain that every piece of equipment or cable between the disk and the HBA is as transparent and perfect as possible.

Every component or cable needs to be perfect. That includes using cables designed for the speed of the hardware they are connecting.

Scrimping anywhere along the data path is a recipe for false negative results! The first way to insure this is to simplify! Components or equipment which isn’t there can’t fail!

Simplify! The setup that doesn’t have RAID, Enclosure processors, other monitors, etc. is going to give better results than one with all these non-essential extras.

Q: Why do we recommend that you do NOT have any RAID components in your test system, even if they have a JBOD mode?

A: Because fewer components in the test path is better.

So with that in mind the test setup above will be inherently more reliable if you set it up like shown with the drives to the right of the enclosure, rather than the more complicated setup shown where the drives are below the enclosure.

Less is better!

More complexity anywhere along the data path = a recipe for false negatives.

Keep your test system as simple as possible.

Eliminate as much from the data path as you can.

Keep in mind what you are testing in your test setup – simplify as much as you can for the best consistent results!

Test System Checkup

Introduction

As mentioned in the article “What are you testing” we know that when you run a DMM test sequence you are actually testing much more than just the drive under test (DUT).

What is important is to make as much of the stuff between the DUT and the HBA as transparent as possible.

That way, when a test fails you can be assured that the problem is actually in the DUT and that you’re not seeing a false negative.

In this article we’ll discuss how to test components which might wear out over time.

What can wear out or change?

As you’d figure, the only thing in a test setup that can change over time are things that move, rub, plug and unplug.

Pieces of the test setup which physically move with use.

And what that boils down to is – enclosure slots.

There are two pieces in a typical drive/enclosure setup which can wear out :

the connector on the drive can and
the mating connector in the drive enclosure.

Keeping an eye on these, checking for degradation over time, is possible and should be a part of your test system maintenance.

How to run a checkup

We are going to deal with SAS/SATA enclosures & cans in this article.

Basically, to check on the health of each enclosure slot, you need to periodically run a standard test using a “golden” drive in each slot of your test system enclosure.

The standard test will simply generate some I/O traffic using a simple sequential write and read test.

And then record the drives’ Log Pages so we can look at the health of the interface.

This test must be run against a SAS drive, as SAS drives record and can report on the very low-level signal integrity of their interface.

SATA drives do not do this recording/reporting.

And, as in the “What are you testing” line of thought, you want to use a known good or “golden” drive. This is so if you do see any interface errors you will know that they originate in the slot, not in the drive.

So – find a SAS drive, as fast as you have (hopefully you are running with 12G SAS hardware throughout), and as clean as you can find.

By clean we mean you have tested the drive in a known good slot and the drive does not cause any interface errors.

Run the Slot Check test sequence, then using a plain-old text editor like Notepad pen the .log file for each drive/slot.

Look in the log file for the section titled Page 18 – Protocol Specific Page

Here is what a perfect slot will look like –

Page 0x18
Parameter Code = 1
# of phys = 1,
SAS Address = 5000C50096E478BD
Attached SAS Address = 5001422213A8C740
Attached PHY Identifier = 04
Invalid Word Count = 0
Running Disparity Error Count = 0
Loss of DWord Sync = 0
PHY Reset Problem = 0

And here is what a problem slot looks like –

Page 18
Parameter Code = 1
# of phys = 1,
SAS Address = 5000C50058DE5F7D
Attached SAS Address = 500605B00743B063
Attached PHY Identifier = 04
Invalid Word Count = 121E21
Running Disparity Error Count = 121E20
Loss of DWord Sync = 1210B3
PHY Reset Problem = 3

It is easy – zero error counts = good, lots of errors = bad!

Summary

When a test fails you want to know with as much certainty as possible that it was the drive that failed, not the slot.

Running SlotCheck periodically and comparing results with past results will give you that certainty.

Run the Slot Test once a month.

Keep a record of the error counts from Log Page 0x18.

Compare the results of previous test runs with the current run, and note if the error rate is increasing.

If you do find a slot where the error counts are climbing you should test to make sure that the errors are slot-based and not drive-based.

To do this move the drive from the suspect slot to another slot and run the slot test again.

Did the bad result follow the drive?

If so you may have a problem with that drive rather than with the slot.

If you do find that you have a bad or failing slot, you can either

-cover that slot and don’t use it anymore, or

-replace the enclosure

Test system checkout article online here

Using BAM to uncover a difficult error

Recently here at STB we were working with a customer concerning an error while formatting a disk drive – the error was the format was failing very early on in the process (at about the 30 second mark). We had cases where the Format would fail, and other cases where the Format would succeed. At first glance at the two scenarios where the Format failed and the other where it succeeded, the CDB being issued was identical!! That is the I/O going to the exact same disk drive was identical, but yet one time it was failing, while another time it was succeeding.

We took a BAM trace, making sure to capture the “SRB Phase”, and discovered what the problem was – the timeout on the Format command were different in the two cases! One was set for 108000 seconds (or 30 hours), while the other was set for 30 seconds.

Let’s take a look at the BAM traces – first for the one that succeeded:

In the above BAM trace, you can see the Format command going to the drive (if you look at the “Ctr” column the Format command has Ctr = 50). The associated “SRB” phase for the Format command has Ctr = 52. In your trace, if you use your mouse and select a row, the data for that row will be displayed in the “Raw Data” column in the bottom-half of the BAM main screen. So we have selected the “SRB” row (i.e. the row with Ctr = 52) and the data is displayed in the “Raw Data” tab. We know that at offset 0x0014 of the SRB phase is the timeout value – in our case it is “E0 A5 01 00” (which is decimal 108000 seconds or 30 hours). This Format succeeds.

Now let’s look at the BAM trace of the Format command where the Format fails:

In the above BAM trace, the Format has Ctr = 56, and its associated SRB phase has Ctr = 59. Looking at the “Raw Data” tab, we see at offset 0x0014 (i.e. where the timeout value is) we see the value “1E 00 00 00” (which is decimal 30). So, because the timeout value was only 30 seconds, and the Format took much-much longer than 30 seconds, the format failed.

SCSI Toolbox’s Endurance/Stress Testing Application

STB Suite JEDEC Application

What is the SCSI Toolbox JEDEC Application?

SCSI Toolbox’s new JEDEC application implements the Endurance/Stress Testing of solid state drives as outlined in the JEDEC documents JES218A and JES219. These documents specify an extremely complex I/O generation pattern, consisting of various transfer sizes, each transfer size requiring a specific probability, and targeting particular sections of the drive with varying probabilities, and aligning transfers that are 4K or larger to be aligned on 4K-boundaries. In addition to these complex requirements, when testing more than 1 drive, each drive must have their targeted sections shifted by 5%. The details of all these requirements will be discussed in various sections below.

What does the SCSI Toolbox JEDEC Application do?

The SCSI Toolbox JEDEC Application issues 72 different types of I/Os!! Each type of I/O consists of following type of information:

Write or Read
Transfer Size (1 block all the way thru 128 blocks)
Section of drive to target (first 5% of drive, next 15% of drive, final 80% of drive)
Probability of the I/O, that is how often or frequent the I/O must be issued

Below are some examples of what each I/O “looks” like:

Example 1: Write, 1 Block, target first 5% of drive, probability = 1%
Example 2: Read, 4 Blocks, target section 5-to-20% of drive, probability = .15%
Example 3: Write, 8 Blocks, target final 80% of drive, probability = 6.7%

How are we getting these probabilities? In the JEDEC documents they specify that the first 5% of drive must be targeted 50% of the time, the next 15% of the drive must be targeted 30% of the time, and the final 80% of the drive must be targeted 20% of the time. In addition to these probabilities, the probabilities assigned to each transfer size is as follows:

Transfers of 1 Block must have 4% probability
Transfers of 2 Blocks must have 1% probability
Transfers of 3 Blocks must have 1% probability
Transfers of 4 Blocks must have 1% probability
Transfers of 5 Blocks must have 1% probability
Transfers of 6 Blocks must have 1% probability
Transfers of 7 Blocks must have 1% probability
Transfers of 8 Blocks must have 67% probability
Transfers of 16 Blocks must have 10% probability
Transfers of 32 Blocks must have 7% probability
Transfers of 64 Blocks must have 3% probability
Transfers of 128 Blocks must have 3% probability

And finally, Writes must have 50% probability while Reads must also have 50% probability.

Putting all of these together, let’s see how we got the probability in our three examples above.

Example 1: The Write must occur with 50% probability, 1 block transfers must occur with 4% probability, and the first 5% of the drive must occur with 50% probability. Multiplying these out we get

(0.5) * (0.04) * (0.5) = 0.01 (which is 1% probability)

Example 2: The Read must occur with 50% probability, 4 block transfers must occur with 1% probability, and the section of the drive from the 5% mark to the 20% mark of the drive must occur with 30% probability. Multiplying these out we get

(0.5) * (0.01) * (0.3) = 0.0015 (which is 0.15% probability)

Example 3: The Write must occur with 50% probability, 8 block transfers must occur with 67% probability, and the final 80% of the drive must occur with 20% probability. Multiplying these out we get

(0.5) * (0.67) * (0.2) = 0.067 (which is 6.7% probability)

How does the SCSI Toolbox JEDEC Application guarantee all the assigned probabilities?

The SCSI Toolbox JEDEC Application issues 72 different types of I/Os, which means the application must guarantee 72 probabilities, one for each I/O!! The engineers at SCSIToolbox have developed a function that uniquely maps random numbers to each of the 72 I/Os. It is beyond the scope of this document to describe how this function works, but it suffices to say that each of the 72 I/Os are chosen “randomly” with their assigned probabilities. One cannot guess what sequence of I/Os will be generated. As an example, after running the application for say 1,000,000,000 I/Os (1 billion I/Os), approximately 6.7% of these 1 billion I/Os (or 6,700,000 I/Os) will fit the profile described in Example 3.

For more information or if you have questions about the JEDEC Test Software contact SCSI Toolbox.

What is Performa?

Performa is the STB Suite annual support and maintenance plan.

In most cases each purchase of the STB Suite includes 12 months of Performa coverage.

What does that coverage include?

Updates to the STB Suite
- There are typically two major updates to the STB Suite per year. In between these major updates there are typically a number of maintenance updates which will be used to fix bugs and occasionally introduce new features.
- With Performa coverage you are entitled to all of these.
Product Support
- Performa coverage provides you with contact with our development team, to answer questions, discuss changes or improvements, etc. With decades of storage experience our support team is willing and able to help you.
  Our World-class support typically responds to email support issues within one hour!
New License discounts
- SCSI Toolbox now offers attractive discounts on new licenses when you keep your licenses covered by the Performa program.
  - - 1-3 licenses actively covered you’ll receive a 10% Performa Discount on New licenses.
    - 4-10 licenses actively covered you’ll receive a 15% Performa Discount on New licenses.
    - 11-20 licenses actively covered you’ll receive a 20% Performa Discount on New licenses.

June 2017