ZOL + SATA + Interposers + SAS Expanders is bad Mkay

Discussion:

Dead Horse

2013-10-10 15:22:19 UTC

History has a way of repeating itself, this time with ZFS on Linux instead
of Solaris. In a nutshell using SATA drives + interposers with a SAS
expander + ZOL ends up in all kinds of fun and games. I have spent a few
hours here and there on a couple of lab servers this past week. (J4400
Array + Seagate Moose Drives + LSI SAS 9207-8E). I looked at some ways to
try and mitigate or fix this issue with ZOL. However digging under the hood
with this not so new issue + ZOL yields that it is actually far more
sinister under Linux then it was with Solaris.

Read on for the gory details. (log file attached for the curious as well)

This issue is fundamentally caused by the SATA --> SAS protocol translation
via interposers to a SAS expander. The irony here is that it takes ZFS to
supply the IO loading needed to trip it up easily. Using MD+LVM or HW raid
+ LVM with this type of setup was never able to create the perfect storm.
However I should also note some irony here in the fact that I did try using
BTRFS on this setup and like ZFS it was also able bring this issue out.

Basically what is happening here is that a hardware error occurs be it a
read error or a drive overwhelmed with cache-sync commands a device or bus
reset is issued. What happens next is that the single reset turns into **
MANY** resets. The initial reset on the SAS expander causes all in-progress
IO operations to abort. It should then generate a proper SAS hardware
error/sense code. Here is where the problem starts. The interposers instead
lose things in protocol translation from SATA --> SAS and instead return a
generic hardware error/sense code. Thus now the Linux (sg) driver steps in
and tries to be helpful by issuing another reset in an effort to right the
ship. Thus given this scenario if you have a lot of IO going on, EG: the
kind of IO that ZFS can generate with a disk subsystem like this needless
to say the ship never rights itself and instead makes a quick trip to Davey
Jones...

In Solaris this could be mitigated or worked around via setting:

allow-bus-device-reset=0; in sd.conf.

Setting this allows bus-wide resets to occur, but it specifically disables
the reset in response to generic hardware and media errors.

code snippet from Solaris sd.c :

if ((un->un_reset_retry_count != 0) &&
(xp->xb_retry_count == un->un_reset_retry_count)) {
mutex_exit(SD_MUTEX(un));
/* Do NOT do a RESET_ALL here: too intrusive. (4112858) */
if (un->un_f_allow_bus_device_reset == TRUE) {

boolean_t try_resetting_target = B_TRUE;

/*
* We need to be able to handle specific ASC when we are
* handling a KEY_HARDWARE_ERROR. In particular
* taking the default action of resetting the target may
* not be the appropriate way to attempt recovery.
* Resetting a target because of a single LUN failure
* victimizes all LUNs on that target.
*
* This is true for the LSI arrays, if an LSI
* array controller returns an ASC of 0x84 (LUN Dead) we
* should trust it.
*/

if (sense_key == KEY_HARDWARE_ERROR) {
switch (asc) {
case 0x84:
if (SD_IS_LSI(un)) {
try_resetting_target = B_FALSE;
}
break;
default:
break;
}
}

if (try_resetting_target == B_TRUE) {
int reset_retval = 0;
if (un->un_f_lun_reset_enabled == TRUE) {
SD_TRACE(SD_LOG_IO_CORE, un,
"sd_sense_key_medium_or_hardware_"
"error: issuing RESET_LUN\n");
reset_retval =
scsi_reset(SD_ADDRESS(un),
RESET_LUN);
}
if (reset_retval == 0) {
SD_TRACE(SD_LOG_IO_CORE, un,
"sd_sense_key_medium_or_hardware_"
"error: issuing RESET_TARGET\n");
(void) scsi_reset(SD_ADDRESS(un),
RESET_TARGET);
}
}
}

Accordingly the code from the Linux kernel (sg) driver

code snipper from sg.c:

*case* SG_SCSI_RESET:

*if* (sdp->detached)

*return* -ENODEV;

*if* (filp->f_flags & O_NONBLOCK) {

*if* (scsi_host_in_recovery(sdp->device->host))

*return* -EBUSY;

} *else if* (!scsi_block_when_processing_errors(sdp->device))

*return* -EBUSY;

result = get_user(val, ip);

*if* (result)

*return* result;

*if* (SG_SCSI_RESET_NOTHING == val)

*return* 0;

*switch* (val) {

*case* SG_SCSI_RESET_DEVICE:

val = SCSI_TRY_RESET_DEVICE;

*break*;

*case* SG_SCSI_RESET_TARGET:

val = SCSI_TRY_RESET_TARGET;

*break*;

*case* SG_SCSI_RESET_BUS:

val = SCSI_TRY_RESET_BUS;

*break*;

*case* SG_SCSI_RESET_HOST:

val = SCSI_TRY_RESET_HOST;

*break*;

*default*:

*return* -EINVAL;

}

*if* (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO))

*return* -EACCES;

*return* (scsi_reset_provider(sdp->device, val) ==

SUCCESS) ? 0 : -EIO;

The difference with Linux is that there appears to be no apparent way to
disable (sg) from resetting a device(s) in response to generic
hardware/media errors/unhandled sense codes. (EG: sysfs attributes) Thus
consequently once the latter sets in and ends up in a bus wide reset it's a
lost cause.

At the end of the day the best thing to do is ultimately to use SAS or
NL-SAS and avoid SATA+Interposers like the plague.

- DHC (AKA: GEHC CST)

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

Marcus Sorensen

2013-10-10 15:35:45 UTC