[zfs-discuss] Pool Causing Hang At Mount

Discussion:

[zfs-discuss] Pool Causing Hang At Mount - Ubuntu 14.04

Bryn

2015-02-18 20:36:07 UTC

Hey folks,

I just did a big overhaul on my ZFS server including moving to a new
mainboard and updating from Ubuntu 12.04 to 14.04. Currently I'm
experiencing an issue where the OS just stops dead when attempting to mount
all the filesystems in one of my pools. Running ZFS 0.6.3 from the Stable
PPA

Here's the timeline:

- Everything totally healthy, running 12.04 on the old mainboard, running
the 14.04 backport kernel
- Upgrade mainboard, this caused my /dev/disk/by-path entries to change, I
unfortunately wasn't smart enough to export the pools first (oops)
- Sorted out the pools, though I had one disk in the pool I'm now having
trouble with that was stuck in an UNAVAIL state
- After trying multiple different ways to sort out the disk I ended up
doing a secure erase on it and then running a zpool replace and letting it
resliver as if it were new, pool returned to normal after that
- Ran a scrub on both pools, everything is healthy now

- Upgraded the OS from Ubuntu 12.04 to Ubuntu 14.04
- After upgrade completed (but before rebooting) ensured that all the ZFS
packages were reinstalled, modules rebuilt, etc
- Upon boot in 14.04 the system hung when mounting filesystems. The OS
mounts are on a separate non-ZFS disk, I got the console messages
indicating they mounted
- Absolutely no disk activity of any sort, the system is not actually hung
(you can still hit 'enter' and the console text scrolls a line) but it
isn't doing anything or going anywhere

- Reboot, go in to Rescue mode, system boots to Rescue screen, go in to
Root shell, no issue importing pools or mounting there
- Reboot, try and start up normally, experience the same freeze issue
- Reboot back to Rescue mode, discover that if I drop to the root shell and
then just resume booting I still get stuck the same way
- Reboot back to Rescue mode, export both zpools
- System now boots normally, but of course without the ZFS storage present
- Import each pool separately, rebooting in between. Isolate the issue to
my 'big' pool (18 drives split across 3 x 6 RAIDZ2 vdevs)
- Try removing cache/log devices just to confirm there's no issue

So where I'm at now:

- The system will boot perfectly and mount my 'small' ZFS pool without
issue.
- The system will NOT continue to boot when trying to mount the 'big' ZFS
pool

I've tried adding delays and things in to the mountall script as suggested
in the FAQ but it hasn't helped at all. When the ZFS/SPL modules load I
can see lots of disk activity on the drives belonging to the 'trouble' pool
so I know it is picking them up, but when mountall actually runs it gets
stuck.

One WEIRD thing I've noticed, if I boot in recovery mode, get to the
recovery menu and choose 'Enable Networking' there's a mountall-type job
that gets kicked off. That one completes successfully, and the system will
boot up normally after that. However if I don't kick off the 'Enable
Networking' task I can wait forever at the Recovery screen and then choose
'Resume' and it'll still be stuck. To me this suggests it's not a timing
issue?

I've kicked off another scrub on the 'Big' volume just to make sure
everything is ok. Oh and yes, I do have ECC RAM.

dmesg|grep -i "ECC Enabled"
[ 27.071888] EDAC amd64: DRAM ECC enabled.
[ 27.074908] EDAC amd64: DRAM ECC enabled.

Any help much appreciated!!

Bryn

Specs:

- RocketRaid 2760 (basically 3 x Marvell 88SE9485s on a PCIe switch),
firmware at latest, running 'mvsas' native kernel driver

- New mainboard is an Asus KGP(M)E-D16 with a pair of Opteron 6128s

- Kernel is 3.13.0-45-generic (same as it was under 12.04)

- apt-show-versions | grep zfs
dkms:all/trusty 2.2.0.3-1.1ubuntu5.14.04+zfs9~trusty uptodate
libzfs2:amd64/trusty 0.6.3-5~trusty uptodate
mountall:amd64/trusty 2.53-zfs1 uptodate
ubuntu-zfs:amd64/trusty 8~trusty uptodate
zfs-auto-snapshot:all/trusty 1.1.0-0ubuntu1~trusty uptodate
zfs-dkms:amd64/trusty 0.6.3-5~trusty uptodate
zfs-doc:amd64/trusty 0.6.3-5~trusty uptodate
zfsutils:amd64/trusty 0.6.3-5~trusty uptodate

- which zpool
/sbin/zpool

- which zfs
/sbin/zfs

- dmesg|grep -i zfs
[ 48.094235] ZFS: Loaded module v0.6.3-5~trusty, ZFS pool version 5000,
ZFS filesystem version 5

- dmesg|grep -i spl
[ 48.012461] spl: module verification failed: signature and/or required
key missing - tainting kernel
[ 48.047408] SPL: Loaded module v0.6.3-3~trusty
[ 49.106551] SPL: using hostid 0x007f0101

- zpool status
pool: ahp.pool
state: ONLINE
scan: scrub in progress since Tue Feb 17 20:39:38 2015
23.6T scanned out of 32.4T at 476M/s, 5h23m to go
0 repaired, 72.83% done
config:

NAME STATE READ WRITE CKSUM
ahp.pool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
pci-0000:07:00.0-sas-0x0600000000000000-lun-0 ONLINE 0
0 0
pci-0000:06:00.0-sas-0x0300000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0200000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0700000000000000-lun-0 ONLINE 0
0 0
pci-0000:06:00.0-sas-0x0200000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0300000000000000-lun-0 ONLINE 0
0 0
raidz2-1 ONLINE 0 0 0
pci-0000:06:00.0-sas-0x0000000000000000-lun-0 ONLINE 0
0 0
pci-0000:06:00.0-sas-0x0100000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0000000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0100000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0400000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0500000000000000-lun-0 ONLINE 0
0 0
raidz2-2 ONLINE 0 0 0
pci-0000:06:00.0-sas-0x0400000000000000-lun-0 ONLINE 0
0 0
pci-0000:06:00.0-sas-0x0500000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0000000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0100000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0400000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0500000000000000-lun-0 ONLINE 0
0 0
logs
ahp.log ONLINE 0 0 0
cache
ahp.cache ONLINE 0 0 0

errors: No known data errors

pool: video.pool
state: ONLINE
scan: scrub repaired 0 in 11h43m with 0 errors on Sat Feb 14 03:19:42
2015
config:

NAME STATE READ WRITE CKSUM
video.pool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
pci-0000:06:00.0-sas-0x0700000000000000-lun-0 ONLINE 0
0 0
pci-0000:06:00.0-sas-0x0600000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0300000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0200000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0700000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0600000000000000-lun-0 ONLINE 0
0 0
logs
sys.sata.ssd-video.log ONLINE 0 0 0
cache
sys.sata.ssd-video.cache ONLINE 0 0 0

errors: No known data errors

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

Michael Kjörling

2015-02-18 20:58:36 UTC

Permalink

Post by Bryn
- apt-show-versions | grep zfs
dkms:all/trusty 2.2.0.3-1.1ubuntu5.14.04+zfs9~trusty uptodate
libzfs2:amd64/trusty 0.6.3-5~trusty uptodate
mountall:amd64/trusty 2.53-zfs1 uptodate
ubuntu-zfs:amd64/trusty 8~trusty uptodate
zfs-auto-snapshot:all/trusty 1.1.0-0ubuntu1~trusty uptodate
zfs-dkms:amd64/trusty 0.6.3-5~trusty uptodate
zfs-doc:amd64/trusty 0.6.3-5~trusty uptodate
zfsutils:amd64/trusty 0.6.3-5~trusty uptodate
- which zpool
/sbin/zpool
- which zfs
/sbin/zfs
- dmesg|grep -i zfs
[ 48.094235] ZFS: Loaded module v0.6.3-5~trusty, ZFS pool version 5000,
ZFS filesystem version 5
- dmesg|grep -i spl
[ 48.012461] spl: module verification failed: signature and/or required
key missing - tainting kernel
[ 48.047408] SPL: Loaded module v0.6.3-3~trusty
[ 49.106551] SPL: using hostid 0x007f0101

Not sure if that's it, but I notice that you have _slightly_ different
versions of ZFS and SPL modules. Intuitively I have a hard time seeing
how it could be causing the behavior you are seeing, and it might be a
false lead, but at the same time I want to point it out just in case
it really is something.

That said, have you tried removing any zpool.cache files
(/etc/zfs/zpool.cache would be the place to start), _also in any
initrds / making sure to rebuild your initrd?_

What happens if you disable ZFS file system mounting? On the
assumption that the Ubuntu packages are similar to those for Debian,
you might want to try setting ZFS_MOUNT='no' in /etc/default/zfs or
thereabouts. You could also go one step further and edit
/etc/init.d/zfs-mount to pass the -N parameter to zpool import. Those
two together should ensure that only the import, and no mounts, run
during the boot process. That should help isolate whether the problem
is a pool import, or file system mounting. You could also change the
#! in /etc/init.d/zfs-mount to be "#!/bin/bash -x" for some additional
diagnostic output (specifically, what exact commands are being
evaluated and executed by the shell).

I would probably also cancel the running scrub for the time being, if
it hasn't already run to completion by the time you read this. It
probably doesn't have any real impact on the problem you're seeing,
but in cases like this, I believe in reducing complexity as far as
possible. A running scrub is a potential complexity that doesn't need
to be there while diagnosing an unrelated problem; it can always be
restarted (though of course unfortunately not resumed) later.

--
Michael Kjörling • https://michael.kjorling.se • ***@kjorling.se
OpenPGP B501AC6429EF4514 https://michael.kjorling.se/public-keys/pgp
“People who think they know everything really annoy
those of us who know we don’t.” (Bjarne Stroustrup)

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

Hajo Möller

2015-02-18 22:23:13 UTC

Permalink

Post by Michael KjÃ¶rling
Not sure if that's it, but I notice that you have _slightly_ different
versions of ZFS and SPL modules.

That's normal for the current "daily" PPA:

***@interzone ~ # dmesg| egrep 'SPL|ZFS'
[...]
[ 9.490164] SPL: Loaded module v0.6.3-3~trusty
[ 9.785933] ZFS: Loaded module v0.6.3-5~trusty, ZFS pool version
5000, ZFS filesystem version 5

--
Regards,
Hajo Möller

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

Bryn Hughes

2015-02-19 20:23:15 UTC

Permalink

Huh, I think I have it sorted...

My ZFS mounts looked like this:

ZFS Name Mountpoint
pool/archive /archive
pool/Images /archive/Images
pool/Images/2007 /archive/Images/2007
pool/Images/2008 /archive/Images/2008
pool/Images/2009 /archive/Images/2009
pool/Images/2010 /archive/Images/2010
pool/Images/2011 /archive/Images/2011
<etc>

However, if you did a ZFS list, it was coming back like this:

ZFS Name Mountpoint
pool/Images /archive/Images
pool/Images/2007 /archive/Images/2007
pool/Images/2008 /archive/Images/2008
pool/Images/2009 /archive/Images/2009
pool/Images/2010 /archive/Images/2010
pool/Images/2011 /archive/Images/2011
<etc>
pool/archive /archive

What appears to have been happening is the 'mountall' program was trying to
mount things out of order, probably following the order you'd see in the
'zfs list' command.

The 'fix' was to rename 'pool/Images' to 'pool/archive/Images'. After that
'zfs list' showed the filesystems in the order they needed to be mounted in
and 'mountall' seems to do its job fine.

Interesting that this did NOT happen on 12.04 but DOES happen on 14.04. I
guess there is a slight difference to the mountall program or I was just
lucky about the order the filesystems were being discovered previously.

Bryn

Post by Bryn
Hey folks,
I just did a big overhaul on my ZFS server including moving to a new
mainboard and updating from Ubuntu 12.04 to 14.04. Currently I'm
experiencing an issue where the OS just stops dead when attempting to mount
all the filesystems in one of my pools. Running ZFS 0.6.3 from the Stable
PPA
- Everything totally healthy, running 12.04 on the old mainboard, running
the 14.04 backport kernel
- Upgrade mainboard, this caused my /dev/disk/by-path entries to change, I
unfortunately wasn't smart enough to export the pools first (oops)
- Sorted out the pools, though I had one disk in the pool I'm now having
trouble with that was stuck in an UNAVAIL state
- After trying multiple different ways to sort out the disk I ended up
doing a secure erase on it and then running a zpool replace and letting it
resliver as if it were new, pool returned to normal after that
- Ran a scrub on both pools, everything is healthy now
- Upgraded the OS from Ubuntu 12.04 to Ubuntu 14.04
- After upgrade completed (but before rebooting) ensured that all the ZFS
packages were reinstalled, modules rebuilt, etc
- Upon boot in 14.04 the system hung when mounting filesystems. The OS
mounts are on a separate non-ZFS disk, I got the console messages
indicating they mounted
- Absolutely no disk activity of any sort, the system is not actually hung
(you can still hit 'enter' and the console text scrolls a line) but it
isn't doing anything or going anywhere
- Reboot, go in to Rescue mode, system boots to Rescue screen, go in to
Root shell, no issue importing pools or mounting there
- Reboot, try and start up normally, experience the same freeze issue
- Reboot back to Rescue mode, discover that if I drop to the root shell
and then just resume booting I still get stuck the same way
- Reboot back to Rescue mode, export both zpools
- System now boots normally, but of course without the ZFS storage present
- Import each pool separately, rebooting in between. Isolate the issue to
my 'big' pool (18 drives split across 3 x 6 RAIDZ2 vdevs)
- Try removing cache/log devices just to confirm there's no issue
- The system will boot perfectly and mount my 'small' ZFS pool without
issue.
- The system will NOT continue to boot when trying to mount the 'big' ZFS
pool
I've tried adding delays and things in to the mountall script as suggested
in the FAQ but it hasn't helped at all. When the ZFS/SPL modules load I
can see lots of disk activity on the drives belonging to the 'trouble' pool
so I know it is picking them up, but when mountall actually runs it gets
stuck.
One WEIRD thing I've noticed, if I boot in recovery mode, get to the
recovery menu and choose 'Enable Networking' there's a mountall-type job
that gets kicked off. That one completes successfully, and the system will
boot up normally after that. However if I don't kick off the 'Enable
Networking' task I can wait forever at the Recovery screen and then choose
'Resume' and it'll still be stuck. To me this suggests it's not a timing
issue?
I've kicked off another scrub on the 'Big' volume just to make sure
everything is ok. Oh and yes, I do have ECC RAM.
dmesg|grep -i "ECC Enabled"
[ 27.071888] EDAC amd64: DRAM ECC enabled.
[ 27.074908] EDAC amd64: DRAM ECC enabled.
Any help much appreciated!!
Bryn
- RocketRaid 2760 (basically 3 x Marvell 88SE9485s on a PCIe switch),
firmware at latest, running 'mvsas' native kernel driver
- New mainboard is an Asus KGP(M)E-D16 with a pair of Opteron 6128s
- Kernel is 3.13.0-45-generic (same as it was under 12.04)
- apt-show-versions | grep zfs
dkms:all/trusty 2.2.0.3-1.1ubuntu5.14.04+zfs9~trusty uptodate
libzfs2:amd64/trusty 0.6.3-5~trusty uptodate
mountall:amd64/trusty 2.53-zfs1 uptodate
ubuntu-zfs:amd64/trusty 8~trusty uptodate
zfs-auto-snapshot:all/trusty 1.1.0-0ubuntu1~trusty uptodate
zfs-dkms:amd64/trusty 0.6.3-5~trusty uptodate
zfs-doc:amd64/trusty 0.6.3-5~trusty uptodate
zfsutils:amd64/trusty 0.6.3-5~trusty uptodate
- which zpool
/sbin/zpool
- which zfs
/sbin/zfs
- dmesg|grep -i zfs
[ 48.094235] ZFS: Loaded module v0.6.3-5~trusty, ZFS pool version 5000,
ZFS filesystem version 5
- dmesg|grep -i spl
[ 48.012461] spl: module verification failed: signature and/or required
key missing - tainting kernel
[ 48.047408] SPL: Loaded module v0.6.3-3~trusty
[ 49.106551] SPL: using hostid 0x007f0101
- zpool status
pool: ahp.pool
state: ONLINE
scan: scrub in progress since Tue Feb 17 20:39:38 2015
23.6T scanned out of 32.4T at 476M/s, 5h23m to go
0 repaired, 72.83% done
NAME STATE READ WRITE CKSUM
ahp.pool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
pci-0000:07:00.0-sas-0x0600000000000000-lun-0 ONLINE 0
0 0
pci-0000:06:00.0-sas-0x0300000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0200000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0700000000000000-lun-0 ONLINE 0
0 0
pci-0000:06:00.0-sas-0x0200000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0300000000000000-lun-0 ONLINE 0
0 0
raidz2-1 ONLINE 0 0 0
pci-0000:06:00.0-sas-0x0000000000000000-lun-0 ONLINE 0
0 0
pci-0000:06:00.0-sas-0x0100000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0000000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0100000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0400000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0500000000000000-lun-0 ONLINE 0
0 0
raidz2-2 ONLINE 0 0 0
pci-0000:06:00.0-sas-0x0400000000000000-lun-0 ONLINE 0
0 0
pci-0000:06:00.0-sas-0x0500000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0000000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0100000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0400000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0500000000000000-lun-0 ONLINE 0
0 0
logs
ahp.log ONLINE 0 0 0
cache
ahp.cache ONLINE 0 0 0
errors: No known data errors
pool: video.pool
state: ONLINE
scan: scrub repaired 0 in 11h43m with 0 errors on Sat Feb 14 03:19:42
2015
NAME STATE READ WRITE CKSUM
video.pool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
pci-0000:06:00.0-sas-0x0700000000000000-lun-0 ONLINE 0
0 0
pci-0000:06:00.0-sas-0x0600000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0300000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0200000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0700000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0600000000000000-lun-0 ONLINE 0
0 0
logs
sys.sata.ssd-video.log ONLINE 0 0 0
cache
sys.sata.ssd-video.cache ONLINE 0 0 0
errors: No known data errors

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.