Discussion:
ZOL + SATA + Interposers + SAS Expanders is bad Mkay
Dead Horse
2013-10-10 15:22:19 UTC
Permalink
History has a way of repeating itself, this time with ZFS on Linux instead
of Solaris. In a nutshell using SATA drives + interposers with a SAS
expander + ZOL ends up in all kinds of fun and games. I have spent a few
hours here and there on a couple of lab servers this past week. (J4400
Array + Seagate Moose Drives + LSI SAS 9207-8E). I looked at some ways to
try and mitigate or fix this issue with ZOL. However digging under the hood
with this not so new issue + ZOL yields that it is actually far more
sinister under Linux then it was with Solaris.

Read on for the gory details. (log file attached for the curious as well)

This issue is fundamentally caused by the SATA --> SAS protocol translation
via interposers to a SAS expander. The irony here is that it takes ZFS to
supply the IO loading needed to trip it up easily. Using MD+LVM or HW raid
+ LVM with this type of setup was never able to create the perfect storm.
However I should also note some irony here in the fact that I did try using
BTRFS on this setup and like ZFS it was also able bring this issue out.

Basically what is happening here is that a hardware error occurs be it a
read error or a drive overwhelmed with cache-sync commands a device or bus
reset is issued. What happens next is that the single reset turns into **
MANY** resets. The initial reset on the SAS expander causes all in-progress
IO operations to abort. It should then generate a proper SAS hardware
error/sense code. Here is where the problem starts. The interposers instead
lose things in protocol translation from SATA --> SAS and instead return a
generic hardware error/sense code. Thus now the Linux (sg) driver steps in
and tries to be helpful by issuing another reset in an effort to right the
ship. Thus given this scenario if you have a lot of IO going on, EG: the
kind of IO that ZFS can generate with a disk subsystem like this needless
to say the ship never rights itself and instead makes a quick trip to Davey
Jones...

In Solaris this could be mitigated or worked around via setting:

allow-bus-device-reset=0; in sd.conf.

Setting this allows bus-wide resets to occur, but it specifically disables
the reset in response to generic hardware and media errors.

code snippet from Solaris sd.c :

if ((un->un_reset_retry_count != 0) &&
(xp->xb_retry_count == un->un_reset_retry_count)) {
mutex_exit(SD_MUTEX(un));
/* Do NOT do a RESET_ALL here: too intrusive. (4112858) */
if (un->un_f_allow_bus_device_reset == TRUE) {

boolean_t try_resetting_target = B_TRUE;

/*
* We need to be able to handle specific ASC when we are
* handling a KEY_HARDWARE_ERROR. In particular
* taking the default action of resetting the target may
* not be the appropriate way to attempt recovery.
* Resetting a target because of a single LUN failure
* victimizes all LUNs on that target.
*
* This is true for the LSI arrays, if an LSI
* array controller returns an ASC of 0x84 (LUN Dead) we
* should trust it.
*/

if (sense_key == KEY_HARDWARE_ERROR) {
switch (asc) {
case 0x84:
if (SD_IS_LSI(un)) {
try_resetting_target = B_FALSE;
}
break;
default:
break;
}
}

if (try_resetting_target == B_TRUE) {
int reset_retval = 0;
if (un->un_f_lun_reset_enabled == TRUE) {
SD_TRACE(SD_LOG_IO_CORE, un,
"sd_sense_key_medium_or_hardware_"
"error: issuing RESET_LUN\n");
reset_retval =
scsi_reset(SD_ADDRESS(un),
RESET_LUN);
}
if (reset_retval == 0) {
SD_TRACE(SD_LOG_IO_CORE, un,
"sd_sense_key_medium_or_hardware_"
"error: issuing RESET_TARGET\n");
(void) scsi_reset(SD_ADDRESS(un),
RESET_TARGET);
}
}
}


Accordingly the code from the Linux kernel (sg) driver

code snipper from sg.c:

*case* SG_SCSI_RESET:

*if* (sdp->detached)

*return* -ENODEV;

*if* (filp->f_flags & O_NONBLOCK) {

*if* (scsi_host_in_recovery(sdp->device->host))

*return* -EBUSY;

} *else if* (!scsi_block_when_processing_errors(sdp->device))

*return* -EBUSY;

result = get_user(val, ip);

*if* (result)

*return* result;

*if* (SG_SCSI_RESET_NOTHING == val)

*return* 0;

*switch* (val) {

*case* SG_SCSI_RESET_DEVICE:

val = SCSI_TRY_RESET_DEVICE;

*break*;

*case* SG_SCSI_RESET_TARGET:

val = SCSI_TRY_RESET_TARGET;

*break*;

*case* SG_SCSI_RESET_BUS:

val = SCSI_TRY_RESET_BUS;

*break*;

*case* SG_SCSI_RESET_HOST:

val = SCSI_TRY_RESET_HOST;

*break*;

*default*:

*return* -EINVAL;

}

*if* (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO))

*return* -EACCES;

*return* (scsi_reset_provider(sdp->device, val) ==

SUCCESS) ? 0 : -EIO;

The difference with Linux is that there appears to be no apparent way to
disable (sg) from resetting a device(s) in response to generic
hardware/media errors/unhandled sense codes. (EG: sysfs attributes) Thus
consequently once the latter sets in and ends up in a bus wide reset it's a
lost cause.

At the end of the day the best thing to do is ultimately to use SAS or
NL-SAS and avoid SATA+Interposers like the plague.


- DHC (AKA: GEHC CST)

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Marcus Sorensen
2013-10-10 15:35:45 UTC
Permalink
Nice writeup. We just noticed those LSI interposers the other day and
thought "cool!". Although we haven't really had a need for them since
our SAS expanders accept SATA drives, we thought it would be nice to
enable the dual channel at a lower cost than SAS drives.

On Thu, Oct 10, 2013 at 9:22 AM, Dead Horse
Post by Dead Horse
History has a way of repeating itself, this time with ZFS on Linux instead
of Solaris. In a nutshell using SATA drives + interposers with a SAS
expander + ZOL ends up in all kinds of fun and games. I have spent a few
hours here and there on a couple of lab servers this past week. (J4400 Array
+ Seagate Moose Drives + LSI SAS 9207-8E). I looked at some ways to try and
mitigate or fix this issue with ZOL. However digging under the hood with
this not so new issue + ZOL yields that it is actually far more sinister
under Linux then it was with Solaris.
Read on for the gory details. (log file attached for the curious as well)
This issue is fundamentally caused by the SATA --> SAS protocol translation
via interposers to a SAS expander. The irony here is that it takes ZFS to
supply the IO loading needed to trip it up easily. Using MD+LVM or HW raid +
LVM with this type of setup was never able to create the perfect storm.
However I should also note some irony here in the fact that I did try using
BTRFS on this setup and like ZFS it was also able bring this issue out.
Basically what is happening here is that a hardware error occurs be it a
read error or a drive overwhelmed with cache-sync commands a device or bus
reset is issued. What happens next is that the single reset turns into
*MANY* resets. The initial reset on the SAS expander causes all in-progress
IO operations to abort. It should then generate a proper SAS hardware
error/sense code. Here is where the problem starts. The interposers instead
lose things in protocol translation from SATA --> SAS and instead return a
generic hardware error/sense code. Thus now the Linux (sg) driver steps in
and tries to be helpful by issuing another reset in an effort to right the
ship. Thus given this scenario if you have a lot of IO going on, EG: the
kind of IO that ZFS can generate with a disk subsystem like this needless to
say the ship never rights itself and instead makes a quick trip to Davey
Jones...
allow-bus-device-reset=0; in sd.conf.
Setting this allows bus-wide resets to occur, but it specifically disables
the reset in response to generic hardware and media errors.
if ((un->un_reset_retry_count != 0) &&
(xp->xb_retry_count == un->un_reset_retry_count)) {
mutex_exit(SD_MUTEX(un));
/* Do NOT do a RESET_ALL here: too intrusive. (4112858) */
if (un->un_f_allow_bus_device_reset == TRUE) {
boolean_t try_resetting_target = B_TRUE;
/*
* We need to be able to handle specific ASC when we are
* handling a KEY_HARDWARE_ERROR. In particular
* taking the default action of resetting the target may
* not be the appropriate way to attempt recovery.
* Resetting a target because of a single LUN failure
* victimizes all LUNs on that target.
*
* This is true for the LSI arrays, if an LSI
* array controller returns an ASC of 0x84 (LUN Dead) we
* should trust it.
*/
if (sense_key == KEY_HARDWARE_ERROR) {
switch (asc) {
if (SD_IS_LSI(un)) {
try_resetting_target = B_FALSE;
}
break;
break;
}
}
if (try_resetting_target == B_TRUE) {
int reset_retval = 0;
if (un->un_f_lun_reset_enabled == TRUE) {
SD_TRACE(SD_LOG_IO_CORE, un,
"sd_sense_key_medium_or_hardware_"
"error: issuing RESET_LUN\n");
reset_retval =
scsi_reset(SD_ADDRESS(un),
RESET_LUN);
}
if (reset_retval == 0) {
SD_TRACE(SD_LOG_IO_CORE, un,
"sd_sense_key_medium_or_hardware_"
"error: issuing RESET_TARGET\n");
(void) scsi_reset(SD_ADDRESS(un),
RESET_TARGET);
}
}
}
Accordingly the code from the Linux kernel (sg) driver
if (sdp->detached)
return -ENODEV;
if (filp->f_flags & O_NONBLOCK) {
if (scsi_host_in_recovery(sdp->device->host))
return -EBUSY;
} else if (!scsi_block_when_processing_errors(sdp->device))
return -EBUSY;
result = get_user(val, ip);
if (result)
return result;
if (SG_SCSI_RESET_NOTHING == val)
return 0;
switch (val) {
val = SCSI_TRY_RESET_DEVICE;
break;
val = SCSI_TRY_RESET_TARGET;
break;
val = SCSI_TRY_RESET_BUS;
break;
val = SCSI_TRY_RESET_HOST;
break;
return -EINVAL;
}
if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO))
return -EACCES;
return (scsi_reset_provider(sdp->device, val) ==
SUCCESS) ? 0 : -EIO;
The difference with Linux is that there appears to be no apparent way to
disable (sg) from resetting a device(s) in response to generic
hardware/media errors/unhandled sense codes. (EG: sysfs attributes) Thus
consequently once the latter sets in and ends up in a bus wide reset it's a
lost cause.
At the end of the day the best thing to do is ultimately to use SAS or
NL-SAS and avoid SATA+Interposers like the plague.
- DHC (AKA: GEHC CST)
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Dead Horse
2013-10-10 15:39:39 UTC
Permalink
Spring for NL-SAS drives they will cost less or the same then Enterprise
SATA+Interposer and will ensure you avoid the above rather unfortunate
situation.
- DHC
Post by Marcus Sorensen
Nice writeup. We just noticed those LSI interposers the other day and
thought "cool!". Although we haven't really had a need for them since
our SAS expanders accept SATA drives, we thought it would be nice to
enable the dual channel at a lower cost than SAS drives.
On Thu, Oct 10, 2013 at 9:22 AM, Dead Horse
Post by Dead Horse
History has a way of repeating itself, this time with ZFS on Linux
instead
Post by Dead Horse
of Solaris. In a nutshell using SATA drives + interposers with a SAS
expander + ZOL ends up in all kinds of fun and games. I have spent a few
hours here and there on a couple of lab servers this past week. (J4400
Array
Post by Dead Horse
+ Seagate Moose Drives + LSI SAS 9207-8E). I looked at some ways to try
and
Post by Dead Horse
mitigate or fix this issue with ZOL. However digging under the hood with
this not so new issue + ZOL yields that it is actually far more sinister
under Linux then it was with Solaris.
Read on for the gory details. (log file attached for the curious as well)
This issue is fundamentally caused by the SATA --> SAS protocol
translation
Post by Dead Horse
via interposers to a SAS expander. The irony here is that it takes ZFS to
supply the IO loading needed to trip it up easily. Using MD+LVM or HW
raid +
Post by Dead Horse
LVM with this type of setup was never able to create the perfect storm.
However I should also note some irony here in the fact that I did try
using
Post by Dead Horse
BTRFS on this setup and like ZFS it was also able bring this issue out.
Basically what is happening here is that a hardware error occurs be it a
read error or a drive overwhelmed with cache-sync commands a device or
bus
Post by Dead Horse
reset is issued. What happens next is that the single reset turns into
*MANY* resets. The initial reset on the SAS expander causes all
in-progress
Post by Dead Horse
IO operations to abort. It should then generate a proper SAS hardware
error/sense code. Here is where the problem starts. The interposers
instead
Post by Dead Horse
lose things in protocol translation from SATA --> SAS and instead return
a
Post by Dead Horse
generic hardware error/sense code. Thus now the Linux (sg) driver steps
in
Post by Dead Horse
and tries to be helpful by issuing another reset in an effort to right
the
Post by Dead Horse
ship. Thus given this scenario if you have a lot of IO going on, EG: the
kind of IO that ZFS can generate with a disk subsystem like this
needless to
Post by Dead Horse
say the ship never rights itself and instead makes a quick trip to Davey
Jones...
allow-bus-device-reset=0; in sd.conf.
Setting this allows bus-wide resets to occur, but it specifically
disables
Post by Dead Horse
the reset in response to generic hardware and media errors.
if ((un->un_reset_retry_count != 0) &&
(xp->xb_retry_count == un->un_reset_retry_count)) {
mutex_exit(SD_MUTEX(un));
/* Do NOT do a RESET_ALL here: too intrusive. (4112858) */
if (un->un_f_allow_bus_device_reset == TRUE) {
boolean_t try_resetting_target = B_TRUE;
/*
* We need to be able to handle specific ASC when we are
* handling a KEY_HARDWARE_ERROR. In particular
* taking the default action of resetting the target may
* not be the appropriate way to attempt recovery.
* Resetting a target because of a single LUN failure
* victimizes all LUNs on that target.
*
* This is true for the LSI arrays, if an LSI
* array controller returns an ASC of 0x84 (LUN Dead) we
* should trust it.
*/
if (sense_key == KEY_HARDWARE_ERROR) {
switch (asc) {
if (SD_IS_LSI(un)) {
try_resetting_target = B_FALSE;
}
break;
break;
}
}
if (try_resetting_target == B_TRUE) {
int reset_retval = 0;
if (un->un_f_lun_reset_enabled == TRUE) {
SD_TRACE(SD_LOG_IO_CORE, un,
"sd_sense_key_medium_or_hardware_"
"error: issuing RESET_LUN\n");
reset_retval =
scsi_reset(SD_ADDRESS(un),
RESET_LUN);
}
if (reset_retval == 0) {
SD_TRACE(SD_LOG_IO_CORE, un,
"sd_sense_key_medium_or_hardware_"
"error: issuing RESET_TARGET\n");
(void) scsi_reset(SD_ADDRESS(un),
RESET_TARGET);
}
}
}
Accordingly the code from the Linux kernel (sg) driver
if (sdp->detached)
return -ENODEV;
if (filp->f_flags & O_NONBLOCK) {
if (scsi_host_in_recovery(sdp->device->host))
return -EBUSY;
} else if
(!scsi_block_when_processing_errors(sdp->device))
Post by Dead Horse
return -EBUSY;
result = get_user(val, ip);
if (result)
return result;
if (SG_SCSI_RESET_NOTHING == val)
return 0;
switch (val) {
val = SCSI_TRY_RESET_DEVICE;
break;
val = SCSI_TRY_RESET_TARGET;
break;
val = SCSI_TRY_RESET_BUS;
break;
val = SCSI_TRY_RESET_HOST;
break;
return -EINVAL;
}
if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO))
return -EACCES;
return (scsi_reset_provider(sdp->device, val) ==
SUCCESS) ? 0 : -EIO;
The difference with Linux is that there appears to be no apparent way to
disable (sg) from resetting a device(s) in response to generic
hardware/media errors/unhandled sense codes. (EG: sysfs attributes) Thus
consequently once the latter sets in and ends up in a bus wide reset
it's a
Post by Dead Horse
lost cause.
At the end of the day the best thing to do is ultimately to use SAS or
NL-SAS and avoid SATA+Interposers like the plague.
- DHC (AKA: GEHC CST)
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Andreas Dilger
2013-10-11 02:14:23 UTC
Permalink
History has a way of repeating itself, this time with ZFS on Linux instead of Solaris. In a nutshell using SATA drives + interposers with a SAS expander + ZOL ends up in all kinds of fun and games.
[snip]
Accordingly the code from the Linux kernel (sg) driver
Do you know what error was returned from the SCSI later by Linux?
if (sdp->detached)
return -ENODEV;
if (filp->f_flags & O_NONBLOCK) {
if (scsi_host_in_recovery(sdp->device->host))
return -EBUSY;
} else if (!scsi_block_when_processing_errors(sdp->device))
return -EBUSY;
result = get_user(val, ip);
if (result)
return result;
if (SG_SCSI_RESET_NOTHING == val)
return 0;
switch (val) {
val = SCSI_TRY_RESET_DEVICE;
break;
val = SCSI_TRY_RESET_TARGET;
break;
val = SCSI_TRY_RESET_BUS;
break;
val = SCSI_TRY_RESET_HOST;
break;
return -EINVAL;
}
if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO))
return -EACCES;
return (scsi_reset_provider(sdp->device, val) ==
SUCCESS) ? 0 : -EIO;
The difference with Linux is that there appears to be no apparent way to disable (sg) from resetting a device(s) in response to generic hardware/media errors/unhandled sense codes. (EG: sysfs attributes) Thus consequently once the latter sets in and ends up in a bus wide reset it's a lost cause.
I'm also a bit confused (or clueless, SCSI layer isn't really my thing) why the SG code is under discussion here instead of the SD code? I thought that if was possible (after many years if waiting) to disable all of the SCSI and block layer retries? Without that, things like MD multipath wouldn't be very useful.
At the end of the day the best thing to do is ultimately to use SAS or NL-SAS and avoid SATA+Interposers like the plague.
It might also be useful to see if this could be fixed in the Linux code in a similar manner as Solaris.

Cheers, Andreas

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Richard Yao
2013-10-11 03:02:17 UTC
Permalink
Post by Andreas Dilger
I'm also a bit confused (or clueless, SCSI layer isn't really my thing) why the SG code is under discussion here instead of the SD code? I thought that if was possible (after many years if waiting) to disable all of the SCSI and block layer retries? Without that, things like MD multipath wouldn't be very useful.
I believe that the sg code is the Linux equivalent of the Illumos sd code.
Andrew Galloway
2013-10-11 17:12:16 UTC
Permalink
While this is very unfortunate, I can't help but feel a little vindicated.
I can't tell you how many Linux fanbois (and this hurt me especially, as I
like to think of myself as one) tore me apart over the past few years, both
online and in person, when I would suggest they avoid SATA drives behind
SAS expanders/interposers/etc with ZFS "like the plague" (as you say),
telling me that it /must/ be a Solaris bug because, and I'm quoting 100's
of people here, some of them practically screaming angry at me like I just
shot their dog, "Linux never has problems with SATA over SAS, I do it all
the time". :(

Now with my "I told you so" over, something even less useful: For the
record, just adding that allow-bus-device-reset=0 will not fix all the SATA
over SAS issues we've seen in the field on Solaris-based systems. It helps
significantly in some scenarios, and not at all in others. Only one problem
that falls under what I generally refer to as the "SATA over SAS problem",
of which there were more than a few, is that one bad disk causes entire bus
resets. There are a number of other weirdo scenarios that crop up (and
unfortunately I haven't been dealing with them daily like I used to, so my
memory has gone). It got to the point where we stopped trying to pinpoint
the problems and fix them, and started simply denying use of SATA behind
SAS for commercially supported systems altogether (thus why I haven't had
to deal with it in the past year and a half or more). That point was well
after we'd engaged with hardware vendors, firmware coders, various disk
drive manufacturers, and so on. The general gist was that the combination
is just flat out toxic. We got unofficial confirmation of problems from
nearly every manufacturer we talked to, or so I'm led to believe.

Worse yet, we were actually somewhat homogeneous in terms of hardware here,
despite at the time technically supporting all-comers - it was, 99% of the
time, SATA drives behind LSI-based SAS expanders/interposers, usually
within a small subset of chassis hardware from vendors like Dell and
SuperMicro. Even given only a dozen or so JBOD's from only 2-3 vendors, and
3-4 HBA cards all LSI in play, on probably only a dozen motherboards or so
(again from only 2-3 vendors), we still couldn't even fully quantify the
problems, much less fix them all. I can't imagine what we may have run into
on a broader hardware set. It was also incredibly intermittent. There are
/still/ customers running hardware setups that bit other customers and yet
have never seen an issue in 3+ years, and others where the problems have
been manageable -- yet with the same hardware, other customers ended up
having to give up and completely replace with NL-SAS. I have long suspects
this to be related to workload, but not having direct and constant access
to all of the systems, I could not prove that out. (And please to anyone
about to yell at me or Dead Horse or otherwsie that THEIR setup works,
please read that last bit -- some people just never had a problem, more
power to them. That doesn't mean it isn't a problem.)

I mention this not to scare off anyone who wants to tackle this issue - but
to suggest that adding in that one bus reset code fix is not going to
magically resolve all problems and make SATA over SAS with ZFS work fine. I
just wanted to put that out there so nobody gets bit. :(

- Andrew
www.nex7.com
Post by Andreas Dilger
Post by Andreas Dilger
I'm also a bit confused (or clueless, SCSI layer isn't really my thing)
why the SG code is under discussion here instead of the SD code? I
thought that if was possible (after many years if waiting) to disable all
of the SCSI and block layer retries? Without that, things like MD
multipath wouldn't be very useful.
I believe that the sg code is the Linux equivalent of the Illumos sd code.
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gordan Bobic
2013-10-11 17:51:55 UTC
Permalink
I've seen enough problems with every denomination of drive, controller and
combination thereof that I wouldn't specifically single any one particular
combination out. They simple fact is that they all suck. I've had issue
with LSI SAS cards and SATA drives, with nothing else between them. I've
seen LSI and Adaptec cards that fail spectacularly when IOMMU support is
enabled because they seem to contain phantom bridges that try to do DMA,
and get blocked by the IOMMU because they don't show up anywhere and thus
don't get set up (they aren't in lspci, for example, yet IOMMU throws error
about device IDs that supposedly don't exist).

If you want an easy life, use the on-board SATA ports connected to the SB
(Intel or AMD, never had a single issue with either). Second least
problematic are cheap SATA controllers. Something like SIL3132. Yes, it's
crap on the performance front, but by and large it'll just work, including
with SIL3726 port multipliers. Marvell SATA controllers I've also not had
problems with when driving disks directly. But try plugging in a port
multiplier and things get weirder - with a single port expander it works
fine. Plug in another one into the second port, and only the devices on the
first port show up. Plug in one expander and a disk directly to the second
port, and they all show up.

As soon as you start using SAS controllers (my experience is mainly with
LSI ones) things start to creak in various intermittent and hard to
reproduce ways.

And frequently the problem is an almost-but-not-quite-duff disk. This sort
of issue takes a while to work out, as it can take some time to establish
and verify that instability follows a disk, rather than something else.
Post by Andrew Galloway
While this is very unfortunate, I can't help but feel a little vindicated.
I can't tell you how many Linux fanbois (and this hurt me especially, as I
like to think of myself as one) tore me apart over the past few years, both
online and in person, when I would suggest they avoid SATA drives behind
SAS expanders/interposers/etc with ZFS "like the plague" (as you say),
telling me that it /must/ be a Solaris bug because, and I'm quoting 100's
of people here, some of them practically screaming angry at me like I just
shot their dog, "Linux never has problems with SATA over SAS, I do it all
the time". :(
Now with my "I told you so" over, something even less useful: For the
record, just adding that allow-bus-device-reset=0 will not fix all the SATA
over SAS issues we've seen in the field on Solaris-based systems. It helps
significantly in some scenarios, and not at all in others. Only one problem
that falls under what I generally refer to as the "SATA over SAS problem",
of which there were more than a few, is that one bad disk causes entire bus
resets. There are a number of other weirdo scenarios that crop up (and
unfortunately I haven't been dealing with them daily like I used to, so my
memory has gone). It got to the point where we stopped trying to pinpoint
the problems and fix them, and started simply denying use of SATA behind
SAS for commercially supported systems altogether (thus why I haven't had
to deal with it in the past year and a half or more). That point was well
after we'd engaged with hardware vendors, firmware coders, various disk
drive manufacturers, and so on. The general gist was that the combination
is just flat out toxic. We got unofficial confirmation of problems from
nearly every manufacturer we talked to, or so I'm led to believe.
Worse yet, we were actually somewhat homogeneous in terms of hardware
here, despite at the time technically supporting all-comers - it was, 99%
of the time, SATA drives behind LSI-based SAS expanders/interposers,
usually within a small subset of chassis hardware from vendors like Dell
and SuperMicro. Even given only a dozen or so JBOD's from only 2-3 vendors,
and 3-4 HBA cards all LSI in play, on probably only a dozen motherboards or
so (again from only 2-3 vendors), we still couldn't even fully quantify the
problems, much less fix them all. I can't imagine what we may have run into
on a broader hardware set. It was also incredibly intermittent. There are
/still/ customers running hardware setups that bit other customers and yet
have never seen an issue in 3+ years, and others where the problems have
been manageable -- yet with the same hardware, other customers ended up
having to give up and completely replace with NL-SAS. I have long suspects
this to be related to workload, but not having direct and constant access
to all of the systems, I could not prove that out. (And please to anyone
about to yell at me or Dead Horse or otherwsie that THEIR setup works,
please read that last bit -- some people just never had a problem, more
power to them. That doesn't mean it isn't a problem.)
I mention this not to scare off anyone who wants to tackle this issue -
but to suggest that adding in that one bus reset code fix is not going to
magically resolve all problems and make SATA over SAS with ZFS work fine. I
just wanted to put that out there so nobody gets bit. :(
- Andrew
www.nex7.com
Post by Andreas Dilger
Post by Andreas Dilger
I'm also a bit confused (or clueless, SCSI layer isn't really my thing)
why the SG code is under discussion here instead of the SD code? I
thought that if was possible (after many years if waiting) to disable all
of the SCSI and block layer retries? Without that, things like MD
multipath wouldn't be very useful.
I believe that the sg code is the Linux equivalent of the Illumos sd code.
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Andrew Galloway
2013-10-11 20:07:13 UTC
Permalink
Post by Gordan Bobic
I've seen enough problems with every denomination of drive, controller and
combination thereof that I wouldn't specifically single any one particular
combination out. They simple fact is that they all suck. I've had issue
with LSI SAS cards and SATA drives, with nothing else between them. I've
seen LSI and Adaptec cards that fail spectacularly when IOMMU support is
enabled because they seem to contain phantom bridges that try to do DMA,
and get blocked by the IOMMU because they don't show up anywhere and thus
don't get set up (they aren't in lspci, for example, yet IOMMU throws error
about device IDs that supposedly don't exist).
Hah! I've never seen that one. Nice.
Post by Gordan Bobic
If you want an easy life, use the on-board SATA ports connected to the SB
(Intel or AMD, never had a single issue with either). Second least
problematic are cheap SATA controllers. Something like SIL3132. Yes, it's
crap on the performance front, but by and large it'll just work, including
with SIL3726 port multipliers. Marvell SATA controllers I've also not had
problems with when driving disks directly. But try plugging in a port
multiplier and things get weirder - with a single port expander it works
fine. Plug in another one into the second port, and only the devices on the
first port show up. Plug in one expander and a disk directly to the second
port, and they all show up.
As soon as you start using SAS controllers (my experience is mainly with
LSI ones) things start to creak in various intermittent and hard to
reproduce ways.
And frequently the problem is an almost-but-not-quite-duff disk. This sort
of issue takes a while to work out, as it can take some time to establish
and verify that instability follows a disk, rather than something else.
Agreed, with the caveat that I only see all this nonsense when the disk(s)
in question are SATA. NL-SAS & SAS drives (which are the same thing
really), no problems. They just work. If they break, they break, and you
know it, and they don't take the whole bus with them, for example. Even
when they go especially whacko, they don't generally infect other drives
with their madness.. whereas they nearly always do when it's SATA back
there behind those SAS expanders.

I also can't stress enough your comment on 'takes a while to work out'. I'd
even add 'may never be'. It can be hellishly difficult and time-consuming
to try to discover which SATA drive is your troublemaker. In a few cases,
I've seen the solution be to literally buy or deliver an entirely new build
that no longer has expanders just to plug all the drives into just to find
the bad drive. :(

In the end, I hope this problem is self-correcting in a few years -- SATA
needs to go away, and the long-running reason it wasn't was lack of SAS
support on desktop motherboards and a market-driven pricing scheme in the
consumer space that had absolutely zero correlation to cost of manufacture
OR cost in bulk in the enterprise space (ever since the introduction of
'NL-SAS' drives, at least). Since SAS support on desktop motherboards seems
to be increasing & improving and consumer price of NL-SAS is approaching
parity with SATA (at least, so I believe from a quick NewEgg search), I
hope we can eventually do away with the SATA altogether. I haven't met
anyone yet who thinks the SATA protocol is BETTER, so hopefully we can all
agree this would ultimately be a good thing.
Post by Gordan Bobic
Post by Andrew Galloway
While this is very unfortunate, I can't help but feel a little
vindicated. I can't tell you how many Linux fanbois (and this hurt me
especially, as I like to think of myself as one) tore me apart over the
past few years, both online and in person, when I would suggest they avoid
SATA drives behind SAS expanders/interposers/etc with ZFS "like the plague"
(as you say), telling me that it /must/ be a Solaris bug because, and I'm
quoting 100's of people here, some of them practically screaming angry at
me like I just shot their dog, "Linux never has problems with SATA over
SAS, I do it all the time". :(
Now with my "I told you so" over, something even less useful: For the
record, just adding that allow-bus-device-reset=0 will not fix all the SATA
over SAS issues we've seen in the field on Solaris-based systems. It helps
significantly in some scenarios, and not at all in others. Only one problem
that falls under what I generally refer to as the "SATA over SAS problem",
of which there were more than a few, is that one bad disk causes entire bus
resets. There are a number of other weirdo scenarios that crop up (and
unfortunately I haven't been dealing with them daily like I used to, so my
memory has gone). It got to the point where we stopped trying to pinpoint
the problems and fix them, and started simply denying use of SATA behind
SAS for commercially supported systems altogether (thus why I haven't had
to deal with it in the past year and a half or more). That point was well
after we'd engaged with hardware vendors, firmware coders, various disk
drive manufacturers, and so on. The general gist was that the combination
is just flat out toxic. We got unofficial confirmation of problems from
nearly every manufacturer we talked to, or so I'm led to believe.
Worse yet, we were actually somewhat homogeneous in terms of hardware
here, despite at the time technically supporting all-comers - it was, 99%
of the time, SATA drives behind LSI-based SAS expanders/interposers,
usually within a small subset of chassis hardware from vendors like Dell
and SuperMicro. Even given only a dozen or so JBOD's from only 2-3 vendors,
and 3-4 HBA cards all LSI in play, on probably only a dozen motherboards or
so (again from only 2-3 vendors), we still couldn't even fully quantify the
problems, much less fix them all. I can't imagine what we may have run into
on a broader hardware set. It was also incredibly intermittent. There are
/still/ customers running hardware setups that bit other customers and yet
have never seen an issue in 3+ years, and others where the problems have
been manageable -- yet with the same hardware, other customers ended up
having to give up and completely replace with NL-SAS. I have long suspects
this to be related to workload, but not having direct and constant access
to all of the systems, I could not prove that out. (And please to anyone
about to yell at me or Dead Horse or otherwsie that THEIR setup works,
please read that last bit -- some people just never had a problem, more
power to them. That doesn't mean it isn't a problem.)
I mention this not to scare off anyone who wants to tackle this issue -
but to suggest that adding in that one bus reset code fix is not going to
magically resolve all problems and make SATA over SAS with ZFS work fine. I
just wanted to put that out there so nobody gets bit. :(
- Andrew
www.nex7.com
Post by Andreas Dilger
Post by Andreas Dilger
I'm also a bit confused (or clueless, SCSI layer isn't really my
thing) why the SG code is under discussion here instead of the SD code? I
thought that if was possible (after many years if waiting) to disable all
of the SCSI and block layer retries? Without that, things like MD
multipath wouldn't be very useful.
I believe that the sg code is the Linux equivalent of the Illumos sd code.
To unsubscribe from this group and stop receiving emails from it, send
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Chris Siebenmann
2013-10-11 20:25:32 UTC
Permalink
| Since SAS support on desktop motherboards seems to be increasing &
| improving and consumer price of NL-SAS is approaching parity with SATA
| (at least, so I believe from a quick NewEgg search), I hope we can
| eventually do away with the SATA altogether.

We recently went through a preliminary pricing exercise for 2TB 7200
RPM disks and as far as we could determine the price difference between
5-year warranty SATA disks and NL-SAS drives was still enough to be
prohibitive in bulk for our budget levels.

The other wrinkle is SSDs. As far as I could see recently (as part of
our pricing exercise), affordable SSDs are still all SATA. SAS SSDs are
all 'enterprise' and therefor very pricy.

It would be great if all of this changed but I'm not holding my breath.
My cynical expectation is that HD and SSD companies desperately want to
hold their price and margins on the 'enterprise' space and will fiercly
resist lowering prices to the consumer SATA level.

- cks

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Marcus Sorensen
2013-10-11 20:36:30 UTC
Permalink
Post by Chris Siebenmann
| Since SAS support on desktop motherboards seems to be increasing &
| improving and consumer price of NL-SAS is approaching parity with SATA
| (at least, so I believe from a quick NewEgg search), I hope we can
| eventually do away with the SATA altogether.
We recently went through a preliminary pricing exercise for 2TB 7200
RPM disks and as far as we could determine the price difference between
5-year warranty SATA disks and NL-SAS drives was still enough to be
prohibitive in bulk for our budget levels.
Yes, if you compare NL-SAS to 'Enterprise' SATA, the cost is almost
negligible ($15 on a ~$350 drive for 4TB), but if you're just
comparing to a SATA disk with a decent feature set like WD Se or Red,
it's still almost double the price (these can be had in 4T for ~$200).
It depends on what your need is though.
Post by Chris Siebenmann
The other wrinkle is SSDs. As far as I could see recently (as part of
our pricing exercise), affordable SSDs are still all SATA. SAS SSDs are
all 'enterprise' and therefor very pricy.
It would be great if all of this changed but I'm not holding my breath.
My cynical expectation is that HD and SSD companies desperately want to
hold their price and margins on the 'enterprise' space and will fiercly
resist lowering prices to the consumer SATA level.
- cks
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Dead Horse
2013-10-11 20:55:49 UTC
Permalink
The irony here is the particular HW setup where I was most easily able to
bring this issue out to debug it was in essence a "fishworks" appliance.
ST31000340 drives behind LSI SS1320 interposers located within a stack of
J4400 hooked up to a LSI SAS 9207-E in a x4170M2 server. Interestingly I
found that the things got worse with Active-Active Mutli-pathing, a bit
less with Active-Passive, and lesser with no multipathing but none the less
ever present with this setup. Running the same on the exact setup except
replacing the Moose drives with some NL-SAS drives yielded a nicely working
setup.

I did a bit of chatting offline with Brian about this. He mentioned that
LLNL to had run into this and avoided using expanders and used NL-SAS to
avoid it.

He also noted they did take a look at making some improvements in the the
error handling within the Linux SCSI mid-layer, but did not have
development time to spare on it.

The relevant error handling code is actually in drivers/scsi/scsi_error.c.
A thread is created for each attached SCSI host for error handling.
scsi_error_handler() function contains the main loop.
It can use for each host:
scsi_unjam_host() --> (generic recovery code)
OR
it can register its own error recovery handler.

I checked and the mpt2sas driver (in my case) uses scsi_unjam_host() .

This is where a "workaround" would have to go much like what was done in
Solaris.

Andrew none the less I do agree with you that a fix or "workaround" in the
kernel is not a true fix, thus IMHO to avoid the nightmare in the first
place. I put this out there as a warning to any unfortunate soul using ZOL
and thinking about using this type of HW setup (or for those using it
already and wondering WTF is going on with their setup).

- DHC

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Dead Horse
2013-10-11 20:58:28 UTC
Permalink
heh heh my latter wording should have been anyone using ZOL but
unforutunately using this type of HW setup...
- DHC


On Fri, Oct 11, 2013 at 3:55 PM, Dead Horse
Post by Dead Horse
The irony here is the particular HW setup where I was most easily able to
bring this issue out to debug it was in essence a "fishworks" appliance.
ST31000340 drives behind LSI SS1320 interposers located within a stack of
J4400 hooked up to a LSI SAS 9207-E in a x4170M2 server. Interestingly I
found that the things got worse with Active-Active Mutli-pathing, a bit
less with Active-Passive, and lesser with no multipathing but none the less
ever present with this setup. Running the same on the exact setup except
replacing the Moose drives with some NL-SAS drives yielded a nicely working
setup.
I did a bit of chatting offline with Brian about this. He mentioned that
LLNL to had run into this and avoided using expanders and used NL-SAS to
avoid it.
He also noted they did take a look at making some improvements in the
the error handling within the Linux SCSI mid-layer, but did not have
development time to spare on it.
The relevant error handling code is actually in drivers/scsi/scsi_error.c.
A thread is created for each attached SCSI host for error handling.
scsi_error_handler() function contains the main loop.
scsi_unjam_host() --> (generic recovery code)
OR
it can register its own error recovery handler.
I checked and the mpt2sas driver (in my case) uses scsi_unjam_host() .
This is where a "workaround" would have to go much like what was done in
Solaris.
Andrew none the less I do agree with you that a fix or "workaround" in the
kernel is not a true fix, thus IMHO to avoid the nightmare in the first
place. I put this out there as a warning to any unfortunate soul using ZOL
and thinking about using this type of HW setup (or for those using it
already and wondering WTF is going on with their setup).
- DHC
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gordan Bobic
2013-10-11 23:30:20 UTC
Permalink
Post by Dead Horse
The irony here is the particular HW setup where I was most easily able
to bring this issue out to debug it was in essence a "fishworks"
appliance. ST31000340 drives behind LSI SS1320interposers located within
a stack of J4400 hooked up to a LSI SAS 9207-E in a x4170M2 server.
ST31000340 series? You are kidding me, right? My experience with those
was that they had a failure rate exceeding 100% within the 5 year
warranty I bought them with (I bought mine in the last week before the
reduced the warranty from 5 years, it expired a few months ago). By more
than 100% I mean to say that out of n drives I bought, I had more than n
replacements (i.e. warranty replacements for warranty replacements). It
would have been even worse had they not started sending me ST31000528
and ST31000524 drives as replacements toward the end.

If you are using those drives, you really are in no position to even
remotely assess the reliability of the rest of your system - until you
get rid of those disks the rest of the issues will most likely disappear
in the noise.

Gordan


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Dead Horse
2013-10-12 02:25:43 UTC
Permalink
"If you are using those drives, you really are in no position to even
remotely assess the reliability of the rest of your system"

You realize of course that the ES2 series Moose drives I am using are the
Sun variants SPECIFIC to Sun used and developed as part of the second
generation Amber Road storage appliances. As are the J4400 JBOD arrays and
LSI expanders, backplanes and interposers. I have heard about the consumer
ES.2 series Moose drives having issues and high failure rates. That being
said the ES.2 Moose drives I am using are Sun specific from a firmware
standpoint and design standpoint and are not the same as the consumer
equivalent. I think you missed the point that the system was completely
reliable with Solaris (AKA: When it was a Fishworks Storage Appliance). The
point I make now is that given a setup that was not specifically engineered
to work with a target OS and version of ZFS (EG: What I have done) the
above is the net result. Additionally these days it is now a Lab system
with non critical data and nothing more then a play toy so reliability is
of no concern. Hence the reason I used as one of the systems for testing
and playing with ZOL.
- DHC

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gordan Bobic
2013-10-11 23:23:29 UTC
Permalink
Post by Gordan Bobic
If you want an easy life, use the on-board SATA ports connected to
the SB (Intel or AMD, never had a single issue with either). Second
least problematic are cheap SATA controllers. Something like
SIL3132. Yes, it's crap on the performance front, but by and large
it'll just work, including with SIL3726 port multipliers. Marvell
SATA controllers I've also not had problems with when driving disks
directly. But try plugging in a port multiplier and things get
weirder - with a single port expander it works fine. Plug in another
one into the second port, and only the devices on the first port
show up. Plug in one expander and a disk directly to the second
port, and they all show up.
As soon as you start using SAS controllers (my experience is mainly
with LSI ones) things start to creak in various intermittent and
hard to reproduce ways.
And frequently the problem is an almost-but-not-quite-duff disk.
This sort of issue takes a while to work out, as it can take some
time to establish and verify that instability follows a disk, rather
than something else.
Agreed, with the caveat that I only see all this nonsense when the
disk(s) in question are SATA. NL-SAS & SAS drives (which are the same
thing really), no problems. They just work. If they break, they break,
and you know it, and they don't take the whole bus with them, for
example. Even when they go especially whacko, they don't generally
infect other drives with their madness.. whereas they nearly always do
when it's SATA back there behind those SAS expanders.
I also can't stress enough your comment on 'takes a while to work out'.
I'd even add 'may never be'. It can be hellishly difficult and
time-consuming to try to discover which SATA drive is your troublemaker.
In a few cases, I've seen the solution be to literally buy or deliver an
entirely new build that no longer has expanders just to plug all the
drives into just to find the bad drive. :(
It's unfortunate. Disks these days are so bad that I'm amazed people
don't lose data a lot more often.
Post by Gordan Bobic
In the end, I hope this problem is self-correcting in a few years --
SATA needs to go away, and the long-running reason it wasn't was lack of
SAS support on desktop motherboards and a market-driven pricing scheme
in the consumer space that had absolutely zero correlation to cost of
manufacture OR cost in bulk in the enterprise space (ever since the
introduction of 'NL-SAS' drives, at least). Since SAS support on desktop
motherboards seems to be increasing & improving and consumer price of
NL-SAS is approaching parity with SATA (at least, so I believe from a
quick NewEgg search), I hope we can eventually do away with the SATA
altogether. I haven't met anyone yet who thinks the SATA protocol is
BETTER, so hopefully we can all agree this would ultimately be a good thing.
While I'm not saying SATA is better, I'm not especially convinced that
it's worse, either. It's more a case of SAS drives being a generation
behind so most of the bugs get worked out, and given the time lag a bit
of extra engineering goes into them to compensate for the more egregious
issues discovered during the bleeding edge SATA deployment. I'm not
convinced the reliability has much to do with the protocol itself.

Gordan

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Luka Morris
2013-10-14 23:48:21 UTC
Permalink
Dead Horse wrote:




History has a way of repeating itself, this time with ZFS on Linux instead of Solaris. In a nutshell using SATA drives + interposers with a SAS expander + ZOL ends up in all kinds of fun and games. I have spent a few hours here and there on a couple of lab servers this past week. (J4400 Array + Seagate Moose Drives + LSI SAS 9207-8E). I looked at some ways to try and mitigate or fix this issue with ZOL. However digging under the hood with this not so new issue + ZOL yields that it is actually far more sinister under Linux then it was with Solaris.


Read on for the gory details. (log file attached for the curious as well)


This issue is fundamentally caused by the SATA --> SAS protocol translation via interposers to a SAS expander. The irony here is that it takes ZFS to supply the IO loading needed to trip it up easily. Using MD+LVM or HW raid + LVM with this type of setup was never able to create the perfect storm. However I should also note some irony here in the fact that I did try using BTRFS on this setup and like ZFS it was also able bring this issue out.

Basically what is happening here is that a hardware error occurs be it a read error or a drive overwhelmed with cache-sync commands a device or bus reset is issued. What happens next is that the single reset turns into *MANY* resets. The initial reset on the SAS expander causes all in-progress IO operations to abort. It should then generate a proper SAS hardware error/sense code. Here is where the problem starts. The interposers instead lose things in protocol translation from SATA --> SAS and instead return a generic hardware error/sense code. Thus now the Linux (sg) driver steps in and tries to be helpful by issuing another reset in an effort to right the ship. Thus given this scenario if you have a lot of IO going on, EG: the kind of IO that ZFS can generate with a disk subsystem like this needless to say the ship never rights itself and instead makes a quick trip to Davey Jones...



Hello,
I don't have really extensive experience with interposers but it seems to me interposers are more likely to do the correct thing than the drives' firmwares. Probably because there are just a few models of interposers all over the world, instead of changing every few months like the drives, and they are aimed at the enterprise storage business so they are probably well tested. When I added interposers to a storage of ours, things improved compared to no interposers (the expander we had could work both ways).

I have the feeling your experience might be another version of "drive X incompatible with interposer Y and controller Z" like there are many, even without interposers, that's why you see hardware compatibility lists around. Admittedly it's difficult to find an HCL that also includes interposers in the middle.

Another possibility is "drive X is faulty". Suppose a drive responds with a generic error code when a reset is issued, and in particular does that if it is still trying to read a defective sector. Considering the Moose drives are not enterprise, I expect not having ERC and hence the max time to recover a sector is usually around 2 minutes.

Now at the first defective sector (or some other kind of error that might generate the first reset), a 30sec Linux default SCSI timeout is reached, a reset is issued --> drive responds with generic fault --> linux responds with another reset --> drive is still trying to recover the sector so responds with another generic fault... and so on.
The firmware of the drive might even "screw up" completely in such unlikely (for a desktop-class drive) situation which is probably scarcely-tested at Seagate labs (because it's a desktop-class drive), If the firmware switches to an inconsistent state it might keep responding nonsense until poweroff and power-on-again. In fact I had a WD drive sporadically starting to respond bad and kept doing so until the next power off and no kind of reset would fix this, while the other drives of the same model never behaved bad in the first place. Simply replacing that one fixed the thing i.e. it was not the cabling.

Now I hate Seagates so it must be their fault for sure :-)

Can you check all drives with " smartctl -x /dev/sd... " looking especially at the two sections named:
"SMART Extended Comprehensive Error Log"
and
"SATA Phy Event Counters (GP Log 0x11)"
so to see if you have a drive significantly different from the others in such 2 sections, showing more errors somehow, which can indicate the culprit.
Also can you run a smart long test in all drives to check if surface is good for all drives? However note that this cannot completely rule out a firmware bug / electronics bug happening on a specific disk.
You might also want to raise the SCSI layer timeout for all drives to a very large value such as 86400 (=24 hours) or anyway much higher than the human intervention time on that storage system, so maybe next time you could see the thing happening live, stuck before the first reset.
Can you determine the first drive which has given errors and caused reset from dmesg or /var/log/dmesg?
Can you confirm NCQ is disabled on such drives? I read here http://en.wikipedia.org/wiki/Seagate_Barracuda that NCQ behaviour is bugged on Moose drives and here https://ata.wiki.kernel.org/index.php/Known_issues#Seagate_harddrives_which_time_out_FLUSH_CACHE_when_NCQ_is_being_used that Linux should automatically disable NCQ on those drives, but it's better to check. Anyway from http://en.wikipedia.org/wiki/Seagate_Barracuda it seems Moose drives are deeply bugged so I wouldn't be surprised if they turn out to be unsuitable for RAID/ZFS but a workaround might be possible.

One thing I don't understand in your story. You write:
"The initial reset on the SAS expander causes all in-progress IO operations to abort. It should then generate a proper SAS hardware error/sense code. Here is where the problem starts. The interposers instead lose things in protocol translation from SATA --> SAS and instead return a generic hardware error/sense code. "
I thought an expander reset should not reach up to the drives, it should stop before, or should it?

Regards
LM

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Dead Horse
2013-10-15 18:50:41 UTC
Permalink
Luka,
Answers inline
"interposers are more likely to do the correct thing than the drives'
firmwares"
Regrettably not always which is why this problems can and does occur. A
quick google search will yield you many tales of woe. Additionally why the
aforementioned fix (workaround was put into illumos/solaris in the first
place)
Garrett from Nexenta does a good job explaining things
here:
http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html
and here:
http://gdamore.blogspot.com/2010/12/update-on-sata-expanders.html

Open Indiana discuss thread
here:
http://openindiana.org/pipermail/openindiana-discuss/2012-June/008496.html

A good quick techinical overview on this generation of interposer can be
found here:
http://www.serialstoragewire.net/Articles/2007_07/developer24.html



"I have the feeling your experience might be another version of "drive X
incompatible with interposer Y and controller Z"
Not the case. Note in my prior replies this setup is the Sun Amber Road
generation 2 hardware (EG: Sun Storage 7000 series). The actual Sun variant
of the Moose drives I have are ST31000NSSUN1.0T and they are *different*
from both a firmware and design standpoint. Additionally the interposer
which is again a Sun variant of the LSISS1320 AAMUX is both firmware and
design different then the consumer equivelents. Both drive and interposer
firmware are also tuned and designed to work with the J-Series arrays and
their LSI expander + backplane. The expander and backplane again being Sun
specific. The original technologies were aqquired from StorageTek and
manufactured for Sun by Quanta.



"Another possibility is drive X is faulty". "Considering the Moose drives
are not enterprise" "Check smart...."
The drives are fine no surface issues or otherwise. I spent quite a bit of
time pouring over the SMART data from all the drives in this setup to rule
this out. The Moose drives are Enterprise SATA drives hence the (ES.2) AKA:
(E)nterprise (S)ATA Generation 2

Quoted from the Solaris ZFS discuss lists: "The J series JBODs aren't
overly expensive, it's the darn drives for
them that break the budget." <-- EG: Engineered drive solution for the
Amber Road storage appliances

Also If you are interested or curious you can peruse the fishworks
changelogs here much of the underlying history is documented there (at
least what is public knowledge) ;-)
Found here: https://wikis.oracle.com/display/FishWorks/Software+Updates
"Can you determine the first drive which has given errors and caused
reset from dmesg or /var/log/dmesg?"
The attachment to my original mail to the mailing list contains example
output. The drive on which is occurs is random no one particular disk is at
fault.



"Can you confirm NCQ is disabled on such drives"
This is a problem on *certain* consumer Moose drives and firmware. This
works fine either on or off on the Sun drives and has not linkage to the
issue at hand. I actually have some of the consumer versions here as well
and interestingly I have flashed some of the reported *affected* firmware
on them for fun and tested the reported NCQ issue. I was not actually able
to observe that reported issue with firmware MA0D but I did reproduce it
with SN04.



"One thing I don't understand in your story. You write"
Quoting Garrett from Nexenta:
"The problem is that when a reset occurs on an expander, it aborts any
in-flight operations, and they fail. Unfortunately, the *way* in which they
fail is to generate a generic "hardware error". The problem is that the
sd(7d) driver's response to this is to ... issue another reset, in a futile
effort to hopefully correct things."

In this case replace Solaris (sd) with Linux (sg).

- DHC

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Loading...