STB Suite

Using BAM to uncover a difficult error

Recently here at STB we were working with a customer concerning an error while formatting a disk drive – the error was the format was failing very early on in the process (at about the 30 second mark).  We had cases where the Format would fail, and other cases where the Format would succeed.  At first glance at the two scenarios where the Format failed and the other where it succeeded, the CDB being issued was identical!!  That is the I/O going to the exact same disk drive was identical, but yet one time it was failing, while another time it was succeeding.

We took a BAM trace, making sure to capture the “SRB Phase”, and discovered what the problem was – the timeout on the Format command were different in the two cases!  One was set for 108000 seconds (or 30 hours), while the other was set for 30 seconds.

Let’s take a look at the BAM traces – first for the one that succeeded:

In the above BAM trace, you can see the Format command going to the drive (if you look at the “Ctr” column the Format command has Ctr = 50).  The associated “SRB” phase for the Format command has Ctr = 52.  In your trace, if you use your mouse and select a row, the data for that row will be displayed in the “Raw Data” column in the bottom-half of the BAM main screen.  So we have selected the “SRB” row (i.e. the row with Ctr = 52) and the data is displayed in the “Raw Data” tab.  We know that at offset 0x0014 of the SRB phase is the timeout value – in our case it is “E0 A5 01 00” (which is decimal 108000 seconds or 30 hours).  This Format succeeds.

Now let’s look at the BAM trace of the Format command where the Format fails:

In the above BAM trace, the Format has Ctr = 56, and its associated SRB phase has Ctr = 59.  Looking at the “Raw Data” tab, we see at offset 0x0014 (i.e. where the timeout value is) we see the value “1E 00 00 00” (which is decimal 30).  So, because the timeout value was only 30 seconds, and the Format took much-much longer than 30 seconds, the format failed.