Bryn
2015-02-18 20:36:07 UTC
Hey folks,
I just did a big overhaul on my ZFS server including moving to a new
mainboard and updating from Ubuntu 12.04 to 14.04. Currently I'm
experiencing an issue where the OS just stops dead when attempting to mount
all the filesystems in one of my pools. Running ZFS 0.6.3 from the Stable
PPA
Here's the timeline:
- Everything totally healthy, running 12.04 on the old mainboard, running
the 14.04 backport kernel
- Upgrade mainboard, this caused my /dev/disk/by-path entries to change, I
unfortunately wasn't smart enough to export the pools first (oops)
- Sorted out the pools, though I had one disk in the pool I'm now having
trouble with that was stuck in an UNAVAIL state
- After trying multiple different ways to sort out the disk I ended up
doing a secure erase on it and then running a zpool replace and letting it
resliver as if it were new, pool returned to normal after that
- Ran a scrub on both pools, everything is healthy now
- Upgraded the OS from Ubuntu 12.04 to Ubuntu 14.04
- After upgrade completed (but before rebooting) ensured that all the ZFS
packages were reinstalled, modules rebuilt, etc
- Upon boot in 14.04 the system hung when mounting filesystems. The OS
mounts are on a separate non-ZFS disk, I got the console messages
indicating they mounted
- Absolutely no disk activity of any sort, the system is not actually hung
(you can still hit 'enter' and the console text scrolls a line) but it
isn't doing anything or going anywhere
- Reboot, go in to Rescue mode, system boots to Rescue screen, go in to
Root shell, no issue importing pools or mounting there
- Reboot, try and start up normally, experience the same freeze issue
- Reboot back to Rescue mode, discover that if I drop to the root shell and
then just resume booting I still get stuck the same way
- Reboot back to Rescue mode, export both zpools
- System now boots normally, but of course without the ZFS storage present
- Import each pool separately, rebooting in between. Isolate the issue to
my 'big' pool (18 drives split across 3 x 6 RAIDZ2 vdevs)
- Try removing cache/log devices just to confirm there's no issue
So where I'm at now:
- The system will boot perfectly and mount my 'small' ZFS pool without
issue.
- The system will NOT continue to boot when trying to mount the 'big' ZFS
pool
I've tried adding delays and things in to the mountall script as suggested
in the FAQ but it hasn't helped at all. When the ZFS/SPL modules load I
can see lots of disk activity on the drives belonging to the 'trouble' pool
so I know it is picking them up, but when mountall actually runs it gets
stuck.
One WEIRD thing I've noticed, if I boot in recovery mode, get to the
recovery menu and choose 'Enable Networking' there's a mountall-type job
that gets kicked off. That one completes successfully, and the system will
boot up normally after that. However if I don't kick off the 'Enable
Networking' task I can wait forever at the Recovery screen and then choose
'Resume' and it'll still be stuck. To me this suggests it's not a timing
issue?
I've kicked off another scrub on the 'Big' volume just to make sure
everything is ok. Oh and yes, I do have ECC RAM.
dmesg|grep -i "ECC Enabled"
[ 27.071888] EDAC amd64: DRAM ECC enabled.
[ 27.074908] EDAC amd64: DRAM ECC enabled.
Any help much appreciated!!
Bryn
Specs:
- RocketRaid 2760 (basically 3 x Marvell 88SE9485s on a PCIe switch),
firmware at latest, running 'mvsas' native kernel driver
- New mainboard is an Asus KGP(M)E-D16 with a pair of Opteron 6128s
- Kernel is 3.13.0-45-generic (same as it was under 12.04)
- apt-show-versions | grep zfs
dkms:all/trusty 2.2.0.3-1.1ubuntu5.14.04+zfs9~trusty uptodate
libzfs2:amd64/trusty 0.6.3-5~trusty uptodate
mountall:amd64/trusty 2.53-zfs1 uptodate
ubuntu-zfs:amd64/trusty 8~trusty uptodate
zfs-auto-snapshot:all/trusty 1.1.0-0ubuntu1~trusty uptodate
zfs-dkms:amd64/trusty 0.6.3-5~trusty uptodate
zfs-doc:amd64/trusty 0.6.3-5~trusty uptodate
zfsutils:amd64/trusty 0.6.3-5~trusty uptodate
- which zpool
/sbin/zpool
- which zfs
/sbin/zfs
- dmesg|grep -i zfs
[ 48.094235] ZFS: Loaded module v0.6.3-5~trusty, ZFS pool version 5000,
ZFS filesystem version 5
- dmesg|grep -i spl
[ 48.012461] spl: module verification failed: signature and/or required
key missing - tainting kernel
[ 48.047408] SPL: Loaded module v0.6.3-3~trusty
[ 49.106551] SPL: using hostid 0x007f0101
- zpool status
pool: ahp.pool
state: ONLINE
scan: scrub in progress since Tue Feb 17 20:39:38 2015
23.6T scanned out of 32.4T at 476M/s, 5h23m to go
0 repaired, 72.83% done
config:
NAME STATE READ WRITE CKSUM
ahp.pool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
pci-0000:07:00.0-sas-0x0600000000000000-lun-0 ONLINE 0
0 0
pci-0000:06:00.0-sas-0x0300000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0200000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0700000000000000-lun-0 ONLINE 0
0 0
pci-0000:06:00.0-sas-0x0200000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0300000000000000-lun-0 ONLINE 0
0 0
raidz2-1 ONLINE 0 0 0
pci-0000:06:00.0-sas-0x0000000000000000-lun-0 ONLINE 0
0 0
pci-0000:06:00.0-sas-0x0100000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0000000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0100000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0400000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0500000000000000-lun-0 ONLINE 0
0 0
raidz2-2 ONLINE 0 0 0
pci-0000:06:00.0-sas-0x0400000000000000-lun-0 ONLINE 0
0 0
pci-0000:06:00.0-sas-0x0500000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0000000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0100000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0400000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0500000000000000-lun-0 ONLINE 0
0 0
logs
ahp.log ONLINE 0 0 0
cache
ahp.cache ONLINE 0 0 0
errors: No known data errors
pool: video.pool
state: ONLINE
scan: scrub repaired 0 in 11h43m with 0 errors on Sat Feb 14 03:19:42
2015
config:
NAME STATE READ WRITE CKSUM
video.pool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
pci-0000:06:00.0-sas-0x0700000000000000-lun-0 ONLINE 0
0 0
pci-0000:06:00.0-sas-0x0600000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0300000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0200000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0700000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0600000000000000-lun-0 ONLINE 0
0 0
logs
sys.sata.ssd-video.log ONLINE 0 0 0
cache
sys.sata.ssd-video.cache ONLINE 0 0 0
errors: No known data errors
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
I just did a big overhaul on my ZFS server including moving to a new
mainboard and updating from Ubuntu 12.04 to 14.04. Currently I'm
experiencing an issue where the OS just stops dead when attempting to mount
all the filesystems in one of my pools. Running ZFS 0.6.3 from the Stable
PPA
Here's the timeline:
- Everything totally healthy, running 12.04 on the old mainboard, running
the 14.04 backport kernel
- Upgrade mainboard, this caused my /dev/disk/by-path entries to change, I
unfortunately wasn't smart enough to export the pools first (oops)
- Sorted out the pools, though I had one disk in the pool I'm now having
trouble with that was stuck in an UNAVAIL state
- After trying multiple different ways to sort out the disk I ended up
doing a secure erase on it and then running a zpool replace and letting it
resliver as if it were new, pool returned to normal after that
- Ran a scrub on both pools, everything is healthy now
- Upgraded the OS from Ubuntu 12.04 to Ubuntu 14.04
- After upgrade completed (but before rebooting) ensured that all the ZFS
packages were reinstalled, modules rebuilt, etc
- Upon boot in 14.04 the system hung when mounting filesystems. The OS
mounts are on a separate non-ZFS disk, I got the console messages
indicating they mounted
- Absolutely no disk activity of any sort, the system is not actually hung
(you can still hit 'enter' and the console text scrolls a line) but it
isn't doing anything or going anywhere
- Reboot, go in to Rescue mode, system boots to Rescue screen, go in to
Root shell, no issue importing pools or mounting there
- Reboot, try and start up normally, experience the same freeze issue
- Reboot back to Rescue mode, discover that if I drop to the root shell and
then just resume booting I still get stuck the same way
- Reboot back to Rescue mode, export both zpools
- System now boots normally, but of course without the ZFS storage present
- Import each pool separately, rebooting in between. Isolate the issue to
my 'big' pool (18 drives split across 3 x 6 RAIDZ2 vdevs)
- Try removing cache/log devices just to confirm there's no issue
So where I'm at now:
- The system will boot perfectly and mount my 'small' ZFS pool without
issue.
- The system will NOT continue to boot when trying to mount the 'big' ZFS
pool
I've tried adding delays and things in to the mountall script as suggested
in the FAQ but it hasn't helped at all. When the ZFS/SPL modules load I
can see lots of disk activity on the drives belonging to the 'trouble' pool
so I know it is picking them up, but when mountall actually runs it gets
stuck.
One WEIRD thing I've noticed, if I boot in recovery mode, get to the
recovery menu and choose 'Enable Networking' there's a mountall-type job
that gets kicked off. That one completes successfully, and the system will
boot up normally after that. However if I don't kick off the 'Enable
Networking' task I can wait forever at the Recovery screen and then choose
'Resume' and it'll still be stuck. To me this suggests it's not a timing
issue?
I've kicked off another scrub on the 'Big' volume just to make sure
everything is ok. Oh and yes, I do have ECC RAM.
dmesg|grep -i "ECC Enabled"
[ 27.071888] EDAC amd64: DRAM ECC enabled.
[ 27.074908] EDAC amd64: DRAM ECC enabled.
Any help much appreciated!!
Bryn
Specs:
- RocketRaid 2760 (basically 3 x Marvell 88SE9485s on a PCIe switch),
firmware at latest, running 'mvsas' native kernel driver
- New mainboard is an Asus KGP(M)E-D16 with a pair of Opteron 6128s
- Kernel is 3.13.0-45-generic (same as it was under 12.04)
- apt-show-versions | grep zfs
dkms:all/trusty 2.2.0.3-1.1ubuntu5.14.04+zfs9~trusty uptodate
libzfs2:amd64/trusty 0.6.3-5~trusty uptodate
mountall:amd64/trusty 2.53-zfs1 uptodate
ubuntu-zfs:amd64/trusty 8~trusty uptodate
zfs-auto-snapshot:all/trusty 1.1.0-0ubuntu1~trusty uptodate
zfs-dkms:amd64/trusty 0.6.3-5~trusty uptodate
zfs-doc:amd64/trusty 0.6.3-5~trusty uptodate
zfsutils:amd64/trusty 0.6.3-5~trusty uptodate
- which zpool
/sbin/zpool
- which zfs
/sbin/zfs
- dmesg|grep -i zfs
[ 48.094235] ZFS: Loaded module v0.6.3-5~trusty, ZFS pool version 5000,
ZFS filesystem version 5
- dmesg|grep -i spl
[ 48.012461] spl: module verification failed: signature and/or required
key missing - tainting kernel
[ 48.047408] SPL: Loaded module v0.6.3-3~trusty
[ 49.106551] SPL: using hostid 0x007f0101
- zpool status
pool: ahp.pool
state: ONLINE
scan: scrub in progress since Tue Feb 17 20:39:38 2015
23.6T scanned out of 32.4T at 476M/s, 5h23m to go
0 repaired, 72.83% done
config:
NAME STATE READ WRITE CKSUM
ahp.pool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
pci-0000:07:00.0-sas-0x0600000000000000-lun-0 ONLINE 0
0 0
pci-0000:06:00.0-sas-0x0300000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0200000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0700000000000000-lun-0 ONLINE 0
0 0
pci-0000:06:00.0-sas-0x0200000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0300000000000000-lun-0 ONLINE 0
0 0
raidz2-1 ONLINE 0 0 0
pci-0000:06:00.0-sas-0x0000000000000000-lun-0 ONLINE 0
0 0
pci-0000:06:00.0-sas-0x0100000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0000000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0100000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0400000000000000-lun-0 ONLINE 0
0 0
pci-0000:07:00.0-sas-0x0500000000000000-lun-0 ONLINE 0
0 0
raidz2-2 ONLINE 0 0 0
pci-0000:06:00.0-sas-0x0400000000000000-lun-0 ONLINE 0
0 0
pci-0000:06:00.0-sas-0x0500000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0000000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0100000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0400000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0500000000000000-lun-0 ONLINE 0
0 0
logs
ahp.log ONLINE 0 0 0
cache
ahp.cache ONLINE 0 0 0
errors: No known data errors
pool: video.pool
state: ONLINE
scan: scrub repaired 0 in 11h43m with 0 errors on Sat Feb 14 03:19:42
2015
config:
NAME STATE READ WRITE CKSUM
video.pool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
pci-0000:06:00.0-sas-0x0700000000000000-lun-0 ONLINE 0
0 0
pci-0000:06:00.0-sas-0x0600000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0300000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0200000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0700000000000000-lun-0 ONLINE 0
0 0
pci-0000:08:00.0-sas-0x0600000000000000-lun-0 ONLINE 0
0 0
logs
sys.sata.ssd-video.log ONLINE 0 0 0
cache
sys.sata.ssd-video.cache ONLINE 0 0 0
errors: No known data errors
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.