Discussion:
zpool replace/add mess up
(too old to reply)
devsk
2012-07-03 03:44:54 UTC
Permalink
Folks,

I did something really dumb. I need a bit of a help. I wanted to replace a
vdev inside a raidz with a raidz vdev but instead of replacing, I added it.
Of course, it created a top-level stripe across the existing raidz and the
new one.

So, here is how it looks like now:

config:

NAME STATE READ
WRITE CKSUM
backup DEGRADED
0 0 0
-0 DEGRADED
0 0 0
sdg ONLINE
0 0 0
sdl ONLINE
0 0 0
/vmware/1tbfile OFFLINE
0 0 0
-1 ONLINE
0 0 0
ata-ST31000333AS_9TE14ZBS-part3 ONLINE
0 0 0
ata-ST31000333AS_9TE18L9P-part3 ONLINE
0 0 0
ata-WDC_WD1002FAEX-00Z3A0_WD-WCATR2800078-part3 ONLINE
0 0 0

The goal was to replace the device /vmware/1tbfile (a dummy sparse file)
with the RAIDZ vdev of the 3 devices now shown under the vdev marked -1.

Now, the question: How do I delete the second vdev at the top-level?

zpool remove won't let me remove the devices.

I can't afford to destroy this pool.

Thanks for your help.
-devsk
Fajar A. Nugraha
2012-07-03 03:54:56 UTC
Permalink
Received: by 10.101.46.13 with SMTP id y13mr7484784anj.22.1341287697849;
Mon, 02 Jul 2012 20:54:57 -0700 (PDT)
X-BeenThere: zfs-discuss-VKpPRiiRko4/***@public.gmane.org
Received: by 10.236.121.34 with SMTP id q22ls12465874yhh.7.gmail; Mon, 02 Jul
2012 20:54:56 -0700 (PDT)
Received: by 10.236.177.1 with SMTP id c1mr18600540yhm.41.1341287696738;
Mon, 02 Jul 2012 20:54:56 -0700 (PDT)
Received: by 10.236.177.1 with SMTP id c1mr18600539yhm.41.1341287696724;
Mon, 02 Jul 2012 20:54:56 -0700 (PDT)
Received: from mail-gh0-f176.google.com (mail-gh0-f176.google.com [209.85.160.176])
by mx.google.com with ESMTPS id e15si5084214ann.30.2012.07.02.20.54.56
(version=TLSv1/SSLv3 cipher=OTHER);
Mon, 02 Jul 2012 20:54:56 -0700 (PDT)
Received-SPF: neutral (google.com: 209.85.160.176 is neither permitted nor denied by best guess record for domain of fajar-***@public.gmane.org) client-ip=209.85.160.176;
Received: by ghbz10 with SMTP id z10so5097023ghb.7
for <zfs-discuss-VKpPRiiRko4/***@public.gmane.org>; Mon, 02 Jul 2012 20:54:56 -0700 (PDT)
Received: by 10.50.95.132 with SMTP id dk4mr7042834igb.67.1341287696330; Mon,
02 Jul 2012 20:54:56 -0700 (PDT)
Received: by 10.231.174.199 with HTTP; Mon, 2 Jul 2012 20:54:56 -0700 (PDT)
In-Reply-To: <5f3cb991-b6f1-498d-81c3-a71fb1e1fa2a-VKpPRiiRko4/***@public.gmane.org>
X-Gm-Message-State: ALoCoQkjkqbHgp3E9/vqtScSra4vtU5xiBDv24kZMUUOm7uZMcFAizgxc0jSLArQxUoxMTFPIRfB
X-Original-Sender: list-***@public.gmane.org
X-Original-Authentication-Results: mx.google.com; spf=neutral (google.com:
209.85.160.176 is neither permitted nor denied by best guess record for
domain of fajar-***@public.gmane.org) smtp.mail=fajar-***@public.gmane.org
Precedence: list
Mailing-list: list zfs-discuss-VKpPRiiRko4/***@public.gmane.org; contact zfs-discuss+owners-VKpPRiiRko4/***@public.gmane.org
List-ID: <zfs-discuss.zfsonlinux.org>
X-Google-Group-Id: 321364807731
List-Post: <http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/post?hl=en_US>,
<mailto:zfs-discuss-VKpPRiiRko4/***@public.gmane.org>
List-Help: <http://support.google.com/a/zfsonlinux.org/bin/topic.py?hl=en_US&topic=25838>,
<mailto:zfs-discuss+help-VKpPRiiRko4/***@public.gmane.org>
List-Archive: <http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/?hl=en_US>
List-Subscribe: <http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/subscribe?hl=en_US>,
<mailto:zfs-discuss+subscribe-VKpPRiiRko4/***@public.gmane.org>
List-Unsubscribe: <http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/subscribe?hl=en_US>,
<mailto:googlegroups-manage+321364807731+unsubscribe-/***@public.gmane.org>
Archived-At: <http://permalink.gmane.org/gmane.linux.file-systems.zfs.user/3672>
Post by devsk
Folks,
I did something really dumb. I need a bit of a help. I wanted to replace a
vdev inside a raidz with a raidz vdev but instead of replacing, I added it.
Of course, it created a top-level stripe across the existing raidz and the
new one.
NAME STATE READ
WRITE CKSUM
backup DEGRADED 0
0 0
-0 DEGRADED 0
0 0
sdg ONLINE 0
0 0
sdl ONLINE 0
0 0
/vmware/1tbfile OFFLINE 0
0 0
-1 ONLINE 0
0 0
ata-ST31000333AS_9TE14ZBS-part3 ONLINE 0
0 0
ata-ST31000333AS_9TE18L9P-part3 ONLINE 0
0 0
ata-WDC_WD1002FAEX-00Z3A0_WD-WCATR2800078-part3 ONLINE 0
0 0
The goal was to replace the device /vmware/1tbfile (a dummy sparse file)
with the RAIDZ vdev of the 3 devices now shown under the vdev marked -1.
CMIIW, I don't think you can have a nested raidz. You can only have
stripes of top level vdevs, where each top level can be a single vdev,
raid1, or raidz.
Post by devsk
Now, the question: How do I delete the second vdev at the top-level?
You can't.

If the new disks are big-enough, and your new and old-disks are
same-sized, your option is probably to:
- replace the new disks with sparse files, located outside your pool
- create new pool with the disks, possibly also using some sparse
files if you want a raidz setup
- send/receive
- destroy old pool
- replace the sparse files with old disks
--
Fajar
devsk
2012-07-03 04:09:27 UTC
Permalink
you mean I couldn't have done what I intended to do anyway?

Basically, my choice is to back it up, destroy the pool and recreate? Man,
that's a lot work!

-devsk
Post by Fajar A. Nugraha
Post by devsk
Folks,
I did something really dumb. I need a bit of a help. I wanted to replace
a
Post by devsk
vdev inside a raidz with a raidz vdev but instead of replacing, I added
it.
Post by devsk
Of course, it created a top-level stripe across the existing raidz and
the
Post by devsk
new one.
NAME STATE
READ
Post by devsk
WRITE CKSUM
backup DEGRADED
0
Post by devsk
0 0
-0 DEGRADED
0
Post by devsk
0 0
sdg ONLINE
0
Post by devsk
0 0
sdl ONLINE
0
Post by devsk
0 0
/vmware/1tbfile OFFLINE
0
Post by devsk
0 0
-1 ONLINE
0
Post by devsk
0 0
ata-ST31000333AS_9TE14ZBS-part3 ONLINE
0
Post by devsk
0 0
ata-ST31000333AS_9TE18L9P-part3 ONLINE
0
Post by devsk
0 0
ata-WDC_WD1002FAEX-00Z3A0_WD-WCATR2800078-part3 ONLINE
0
Post by devsk
0 0
The goal was to replace the device /vmware/1tbfile (a dummy sparse file)
with the RAIDZ vdev of the 3 devices now shown under the vdev marked -1.
CMIIW, I don't think you can have a nested raidz. You can only have
stripes of top level vdevs, where each top level can be a single vdev,
raid1, or raidz.
Post by devsk
Now, the question: How do I delete the second vdev at the top-level?
You can't.
If the new disks are big-enough, and your new and old-disks are
- replace the new disks with sparse files, located outside your pool
- create new pool with the disks, possibly also using some sparse
files if you want a raidz setup
- send/receive
- destroy old pool
- replace the sparse files with old disks
--
Fajar
Fajar A. Nugraha
2012-07-03 04:26:27 UTC
Permalink
Received: by 10.236.157.1 with SMTP id n1mr26892702yhk.1.1341289589435;
Mon, 02 Jul 2012 21:26:29 -0700 (PDT)
X-BeenThere: zfs-discuss-VKpPRiiRko4/***@public.gmane.org
Received: by 10.236.113.208 with SMTP id a56ls12437137yhh.8.gmail; Mon, 02 Jul
2012 21:26:27 -0700 (PDT)
Received: by 10.236.146.104 with SMTP id q68mr18347674yhj.28.1341289587720;
Mon, 02 Jul 2012 21:26:27 -0700 (PDT)
Received: by 10.236.146.104 with SMTP id q68mr18347672yhj.28.1341289587711;
Mon, 02 Jul 2012 21:26:27 -0700 (PDT)
Received: from mail-gh0-f176.google.com (mail-gh0-f176.google.com [209.85.160.176])
by mx.google.com with ESMTPS id y22si11595750yhd.16.2012.07.02.21.26.27
(version=TLSv1/SSLv3 cipher=OTHER);
Mon, 02 Jul 2012 21:26:27 -0700 (PDT)
Received-SPF: neutral (google.com: 209.85.160.176 is neither permitted nor denied by best guess record for domain of fajar-***@public.gmane.org) client-ip=209.85.160.176;
Received: by ghbz10 with SMTP id z10so5113812ghb.7
for <zfs-discuss-VKpPRiiRko4/***@public.gmane.org>; Mon, 02 Jul 2012 21:26:27 -0700 (PDT)
Received: by 10.50.104.170 with SMTP id gf10mr7100122igb.52.1341289587337;
Mon, 02 Jul 2012 21:26:27 -0700 (PDT)
Received: by 10.231.174.199 with HTTP; Mon, 2 Jul 2012 21:26:27 -0700 (PDT)
In-Reply-To: <60048175-31ae-4ab1-97e8-29dc5b617263-VKpPRiiRko4/***@public.gmane.org>
X-Gm-Message-State: ALoCoQkTsVcrN6g2mUwDvydi8gkdtn0CY0UvpfVItUTRm6FldYlNgWzXoCam5XMg05cv122SjCKc
X-Original-Sender: list-***@public.gmane.org
X-Original-Authentication-Results: mx.google.com; spf=neutral (google.com:
209.85.160.176 is neither permitted nor denied by best guess record for
domain of fajar-***@public.gmane.org) smtp.mail=fajar-***@public.gmane.org
Precedence: list
Mailing-list: list zfs-discuss-VKpPRiiRko4/***@public.gmane.org; contact zfs-discuss+owners-VKpPRiiRko4/***@public.gmane.org
List-ID: <zfs-discuss.zfsonlinux.org>
X-Google-Group-Id: 321364807731
List-Post: <http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/post?hl=en_US>,
<mailto:zfs-discuss-VKpPRiiRko4/***@public.gmane.org>
List-Help: <http://support.google.com/a/zfsonlinux.org/bin/topic.py?hl=en_US&topic=25838>,
<mailto:zfs-discuss+help-VKpPRiiRko4/***@public.gmane.org>
List-Archive: <http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/?hl=en_US>
List-Subscribe: <http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/subscribe?hl=en_US>,
<mailto:zfs-discuss+subscribe-VKpPRiiRko4/***@public.gmane.org>
List-Unsubscribe: <http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/subscribe?hl=en_US>,
<mailto:googlegroups-manage+321364807731+unsubscribe-/***@public.gmane.org>
Archived-At: <http://permalink.gmane.org/gmane.linux.file-systems.zfs.user/3674>
Post by devsk
you mean I couldn't have done what I intended to do anyway?
Yep
Post by devsk
Basically, my choice is to back it up, destroy the pool and recreate? Man,
that's a lot work!
Yep

You can't remove top-level vdev, or have nested vdev, or reshape an
existing vdev (e.g. to increase the number of disks in a raidz).

A hardware vendor that I met actually used zfs-unable-to-reshape
characteristic as a selling point on how their storage was "the next
generation", and the way zfs did it was "1st generation" :P
--
Fajar
Uncle Stoatwarbler
2012-07-04 00:38:55 UTC
Permalink
Post by Fajar A. Nugraha
A hardware vendor that I met actually used zfs-unable-to-reshape
characteristic as a selling point on how their storage was "the next
generation", and the way zfs did it was "1st generation" :P
From my point of view, I agree with them.

Having said that, enterprise users will simply add a new array of disks
as a vdev, so there's no commercial value in making vdevs reshapable.

Hobbyists don't count (unless they provide the code themselves)
Gordan Bobic
2012-07-04 08:02:14 UTC
Permalink
Post by Uncle Stoatwarbler
Post by Fajar A. Nugraha
A hardware vendor that I met actually used zfs-unable-to-reshape
characteristic as a selling point on how their storage was "the next
generation", and the way zfs did it was "1st generation" :P
From my point of view, I agree with them.
Having said that, enterprise users will simply add a new array of disks
as a vdev, so there's no commercial value in making vdevs reshapable.
Hobbyists don't count (unless they provide the code themselves)
The problem with reshaping arrays is that if you did your homework and
aligned everything up correctly and tuned every layer into an optimal
whole (*1), you are going to suffer a major performance drop after you
reshape your arrays unless you are very, very careful, and even then it
will only result in non-degraded performance if you planned for it when
you originally built the array.

(*1) Hardly anyone ever does, but there is a substantial performance
benefit to doing so. As an aside, that is also one of the key problems
with SANs because the physical disk geometry is completely abstracted
behind multiple opaque layers. It is scarily easy to reduce the
performance of a disk array to that of a single disk.

I covered some of this here:
http://www.altechnative.net/2010/12/31/disk-and-file-system-optimisation/

This is why disk-for-disk the limit of achievable performance is much
lower on an opaque SAN than what you can achieve if you have all the
available disks directly exposed. You can get away with some level of
abstraction without scoring an own-goal if you are careful and have
reasonable visibility/control at each layer in the stack, but I have
never actually seen this applied in the wild. Too many admins nowdays
are of the quality level of only knowing how to click on pictures on the
web management interface and call the vendor when something breaks or
doesn't perform as well as the salesman implied it would.

Gordan
Uncle Stoatwarbler
2012-07-04 09:16:25 UTC
Permalink
Post by Gordan Bobic
The problem with reshaping arrays is that if you did your homework and
aligned everything up correctly and tuned every layer into an optimal
whole (*1), you are going to suffer a major performance drop after you
reshape your arrays unless you are very, very careful, and even then it
will only result in non-degraded performance if you planned for it when
you originally built the array.
Agreed, but in a hobbyist environment people are likely to be reshaping
into a "better than it was" layout.

In any other environment, "disks are cheap" and it makes more sense to
simply do it right to start with, or make a new vdev and migrate the
existing data to it.
Post by Gordan Bobic
(*1) Hardly anyone ever does, but there is a substantial performance
benefit to doing so. As an aside, that is also one of the key problems
with SANs because the physical disk geometry is completely abstracted
behind multiple opaque layers. It is scarily easy to reduce the
performance of a disk array to that of a single disk.
Even when taking care, it's possible to have management screw things up
badly by doing things like insisting that all storage is on the same
arrays. Network /home has vastly different requirements to 30Tb of
near-archival data and putting them both on the same near-line array for
cost reasons is a recipe for pain (My employer is a university and this
is exactly what happened for budgetary reasons - the result is that the
storage cluster and the centrally managed desktops have a very bad
reputation among users)

That's before you even consider the problems caused by beta-quality
clustering software like Redhat's GFS and Linux's badly broken NFS
implementation.
Gordan Bobic
2012-07-04 09:37:53 UTC
Permalink
Post by Uncle Stoatwarbler
Post by Gordan Bobic
The problem with reshaping arrays is that if you did your homework and
aligned everything up correctly and tuned every layer into an optimal
whole (*1), you are going to suffer a major performance drop after you
reshape your arrays unless you are very, very careful, and even then it
will only result in non-degraded performance if you planned for it when
you originally built the array.
Agreed, but in a hobbyist environment people are likely to be reshaping
into a "better than it was" layout.
Only if there is a capability to do so. If you think about it in the
ext* terms, for example, you cannot change things like stride and
stripe-width once the FS is created, even though you can reshape the
underlying MD RAID.

Now, I accept that we are talking about adding a new feature to ZFS
here, and in that sense, you would obviously make sure that if you're
going to be re-writing all the data to re-shape the array, you also have
the ability to change any of the FS alignment parameters available. The
other problem with this is that:

1) A power failure can be fatal for your data during reshaping
2) A disk error/failure can be fatal for your data during reshaping
3) With a large array, probability of either is relatively high, so to
play it safe you are again in the boat of it being safer to backup,
re-create and then restore. Not to mention that the performance during
the reshape is going to be pretty awful and it may take longer than
re-mirroring from a peer (OK, maybe I'm biased by having my data always
distributed to more than one host).
Post by Uncle Stoatwarbler
In any other environment, "disks are cheap" and it makes more sense to
simply do it right to start with, or make a new vdev and migrate the
existing data to it.
Indeed. But in reality there are other considerations. If you want to
future proof yourself against disk size increases and plan to upgrade in
situ you really don't want more than 4+2 disks in RAIDZ2. You might have
only 1TB disks now, but once you up that to 4TB disks even 4+2 is going
to be borderline in terms of data safety. And if you're in that boat,
you might as well just add another 4+2 pool rather than mucking round
with re-shaping.

Yes, reshaping is a cool feature, but it is only really for those that
can't handle thinking ahead, but that way lies pain, poor performance
and data loss. Thinking ahead is something that should really be encouraged.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
(*1) Hardly anyone ever does, but there is a substantial performance
benefit to doing so. As an aside, that is also one of the key problems
with SANs because the physical disk geometry is completely abstracted
behind multiple opaque layers. It is scarily easy to reduce the
performance of a disk array to that of a single disk.
Even when taking care, it's possible to have management screw things up
badly by doing things like insisting that all storage is on the same
arrays.
That goes without saying. But that is what you get when you have
technical decisions made from the standpoint of politics and ignorance.
Addressing that issue is generally deemed to be beyond the scope of what
we can address by technical means. :)
Post by Uncle Stoatwarbler
Network /home has vastly different requirements to 30Tb of
near-archival data and putting them both on the same near-line array for
cost reasons is a recipe for pain (My employer is a university and this
is exactly what happened for budgetary reasons - the result is that the
storage cluster and the centrally managed desktops have a very bad
reputation among users)
ROFL! Near-line for /home? Seriously?
Post by Uncle Stoatwarbler
That's before you even consider the problems caused by beta-quality
clustering software like Redhat's GFS and Linux's badly broken NFS
implementation.
Oh come on - GFS (GFS1 at least, the two times I used GFS2 I reverted
back to GFS1 because I could reliably break it within a few hours of
operation) is pretty damn solid, and it's been a few years since I
actually had any NFS problems (and most of my networks have /home on
NFS, and a number of farms with / on NFS).

Gordan
Uncle Stoatwarbler
2012-07-05 19:28:05 UTC
Permalink
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Agreed, but in a hobbyist environment people are likely to be reshaping
into a "better than it was" layout.
Only if there is a capability to do so. If you think about it in the
ext* terms, for example, you cannot change things like stride and
stripe-width once the FS is created, even though you can reshape the
underlying MD RAID.
Yes and if it was created today there might be different design
considerations. Btrfs is quite different in a number of areas aimed at
addressing ext* shortcomings.
Post by Gordan Bobic
Now, I accept that we are talking about adding a new feature to ZFS
here, and in that sense, you would obviously make sure that if you're
going to be re-writing all the data to re-shape the array, you also have
the ability to change any of the FS alignment parameters available. The
1) A power failure can be fatal for your data during reshaping
Only if shortcuts are taken. Reshaping has to be a completely
synchronous operation.
Post by Gordan Bobic
2) A disk error/failure can be fatal for your data during reshaping
Ditto.
Post by Gordan Bobic
3) With a large array, probability of either is relatively high, so to
play it safe you are again in the boat of it being safer to backup,
re-create and then restore.
Raid is no substitute for backups and backups are not there for data
migration. I've seen backups get trashed too (or someone's BOFHed them
to /dev/null)
Post by Gordan Bobic
Not to mention that the performance during
the reshape is going to be pretty awful and it may take longer than
re-mirroring from a peer (OK, maybe I'm biased by having my data always
distributed to more than one host).
Given a choice of rotten performance for 7 days or being down for 3
days, most people will choose the poor performance option.

I regularly migrate multi-Tb datasets data from old to new storage in
enterprise environments. It often takes 2-3 weeks to get everything
mirrored before I can cutover and even then the 2 hours down for final
checks causes outrage.
Post by Gordan Bobic
Indeed. But in reality there are other considerations. If you want to
future proof yourself against disk size increases and plan to upgrade in
situ you really don't want more than 4+2 disks in RAIDZ2. You might have
only 1TB disks now, but once you up that to 4TB disks even 4+2 is going
to be borderline in terms of data safety. And if you're in that boat,
you might as well just add another 4+2 pool rather than mucking round
with re-shaping.
That's debatable. I think 12 drive raidz2 is acceptable for large drive
pools where speed isn't critical and to be honest I suspect that SSD
will eat Seagate/WD's lunch in the next 3 years - if not sooner.
Post by Gordan Bobic
Yes, reshaping is a cool feature, but it is only really for those that
can't handle thinking ahead, but that way lies pain, poor performance
and data loss. Thinking ahead is something that should really be encouraged.
Of course but in a hobby environment thinking ahead often comes second
to "what can I afford to buy this month?". Being able to reshape doesn't
mean you have to do it, but being able to means those on tight budgets
aren't put through copying-hell periodically.
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Network /home has vastly different requirements to 30Tb of
near-archival data and putting them both on the same near-line array for
cost reasons is a recipe for pain (My employer is a university and this
is exactly what happened for budgetary reasons - the result is that the
storage cluster and the centrally managed desktops have a very bad
reputation among users)
ROFL! Near-line for /home? Seriously?
Seriously. It's only taken 7 years to convince the academics with the
purse strings that this is a bad idea. As far as they're concerned I'm a
blithering idiot wanting to spend a lot of money on "Not very much storage".

Let's not even go into the suggestions being put forward that backups of
~300Tb martian imaging data be performed on Bluray disks by a PhD student.

There are a hell of a lot of people (not just academics) who believe
that because they can use a mouse, they're computer experts and the IT
staff are mostly bigging things up in order to get a pay rise.
Post by Gordan Bobic
Post by Uncle Stoatwarbler
That's before you even consider the problems caused by beta-quality
clustering software like Redhat's GFS and Linux's badly broken NFS
implementation.
Oh come on - GFS (GFS1 at least, the two times I used GFS2 I reverted
back to GFS1 because I could reliably break it within a few hours of
operation) is pretty damn solid, and it's been a few years since I
actually had any NFS problems (and most of my networks have /home on
NFS, and a number of farms with / on NFS).
NFS doesn't play nice with anything else accessing the exported disks,
including server-local processes (locking...) and there's a slight but
provable chance of file/data corruption if anything else does (We've had
this happen).

NFS + GFS(1/2) + slowish drives + heavy use = occasional kernel panics.
We're one of Redhat's "regular phone chats with RH management" customers
as a result of the number of bugs that have been shown up.

As you say: You can break GFS. We can break it faster.

Linux NFS serving was incorporated into the kernel nearly 20 years ago
for performance reasons, without the people involved (I was one of them)
realising the implications for locking. At the time we were - literally
- "kids in bedrooms". It needs to be in userspace, which is where it is
in every other OS and it needs to be there for good reasons.

Because Linux NFS is in kernel space, locks can't be passed to other
cluster nodes. That means the only way to serve any given filesystem is
strictly "one node only" - and even the best NFS server struggles when
clients are pulling a lot of data - the old 30:1 rule of thumb was based
on low duty cycles.

Luster and Gluster have promise but the university's central IT group
recently returned a $5million fileserving setup based on them to the
vendor after failing to have it work reliably in the 18 months since it
had been installed and setup by that vendor. When used as a central
filestore it spent more time down or impaired than up and repeated
vendor failure is one of the factors which led to email being outsourced
to Microsoft Live (Shudder).


A lot of Linux stuff works well in test environments and then breaks
badly under heavy load. ZFS - so far - has been one of the nice exceptions.
Gordan Bobic
2012-07-05 20:01:43 UTC
Permalink
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Agreed, but in a hobbyist environment people are likely to be reshaping
into a "better than it was" layout.
Only if there is a capability to do so. If you think about it in the
ext* terms, for example, you cannot change things like stride and
stripe-width once the FS is created, even though you can reshape the
underlying MD RAID.
Yes and if it was created today there might be different design
considerations. Btrfs is quite different in a number of areas aimed at
addressing ext* shortcomings.
Please, don't use the b-word on this mailing list. Not sure about
others, but if BTRFS was any good I wouldn't be on this list, and I
wouldn't be using ZFS.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Now, I accept that we are talking about adding a new feature to ZFS
here, and in that sense, you would obviously make sure that if you're
going to be re-writing all the data to re-shape the array, you also have
the ability to change any of the FS alignment parameters available. The
1) A power failure can be fatal for your data during reshaping
Only if shortcuts are taken. Reshaping has to be a completely
synchronous operation.
Post by Gordan Bobic
2) A disk error/failure can be fatal for your data during reshaping
Ditto.
You are assuming that you can reshape things while maintaing the
redundancy at all times. That does not necessarily follow (especially if
you are reshaping to an array with no redundancy, but that is a bit of a
straw-man case).
Post by Uncle Stoatwarbler
Post by Gordan Bobic
3) With a large array, probability of either is relatively high, so to
play it safe you are again in the boat of it being safer to backup,
re-create and then restore.
Raid is no substitute for backups and backups are not there for data
migration. I've seen backups get trashed too (or someone's BOFHed them
to /dev/null)
Sure, but if you have backups the chances are that it'll be much faster
to restore than to reshape.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Not to mention that the performance during
the reshape is going to be pretty awful and it may take longer than
re-mirroring from a peer (OK, maybe I'm biased by having my data always
distributed to more than one host).
Given a choice of rotten performance for 7 days or being down for 3
days, most people will choose the poor performance option.
That does not follow AT ALL. I have seen many cases where a loss of 50%
of performance would essentially result in the system not working for
all intents and purposes.
Post by Uncle Stoatwarbler
I regularly migrate multi-Tb datasets data from old to new storage in
enterprise environments. It often takes 2-3 weeks to get everything
mirrored before I can cutover and even then the 2 hours down for final
checks causes outrage.
There are ways and means to work around that. I use lsyncd for such
things. It's a nice, clean way to keep things replicated and up to date
during migration without crippling the system or causing downtime.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Indeed. But in reality there are other considerations. If you want to
future proof yourself against disk size increases and plan to upgrade in
situ you really don't want more than 4+2 disks in RAIDZ2. You might have
only 1TB disks now, but once you up that to 4TB disks even 4+2 is going
to be borderline in terms of data safety. And if you're in that boat,
you might as well just add another 4+2 pool rather than mucking round
with re-shaping.
That's debatable. I think 12 drive raidz2 is acceptable for large drive
pools where speed isn't critical and to be honest I suspect that SSD
will eat Seagate/WD's lunch in the next 3 years - if not sooner.
Depends on how big those 12 drives are. That's OK with 1TB disks, but
with 4TB disks that would very much not be OK.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Yes, reshaping is a cool feature, but it is only really for those that
can't handle thinking ahead, but that way lies pain, poor performance
and data loss. Thinking ahead is something that should really be encouraged.
Of course but in a hobby environment thinking ahead often comes second
to "what can I afford to buy this month?". Being able to reshape doesn't
mean you have to do it, but being able to means those on tight budgets
aren't put through copying-hell periodically.
Instead they can be put through unusable-slowness hell...
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Network /home has vastly different requirements to 30Tb of
near-archival data and putting them both on the same near-line array for
cost reasons is a recipe for pain (My employer is a university and this
is exactly what happened for budgetary reasons - the result is that the
storage cluster and the centrally managed desktops have a very bad
reputation among users)
ROFL! Near-line for /home? Seriously?
Seriously. It's only taken 7 years to convince the academics with the
purse strings that this is a bad idea. As far as they're concerned I'm a
blithering idiot wanting to spend a lot of money on "Not very much storage".
OK, when you say near-line, what do you actually mean? Big tape library
with some robots? Or are we actually talking spinning disks here? I'm
assuming you mean the former. But for the life of me I cannot figure out
how you might cope with "ls ~" taking minutes rather than milliseconds...
Post by Uncle Stoatwarbler
Let's not even go into the suggestions being put forward that backups of
~300Tb martian imaging data be performed on Bluray disks by a PhD student.
*does the back of a fag packet calculation*
So that is 15,000 BR discs per backup... So, presumably this is once
every 5 years backup? I didn't think you can get a PhD in burning stuff
onto optical disks.
Post by Uncle Stoatwarbler
There are a hell of a lot of people (not just academics) who believe
that because they can use a mouse, they're computer experts and the IT
staff are mostly bigging things up in order to get a pay rise.
There are also a lot of people who think they are programmers because
they can write "Hello world" in HTML. I expect this is related to
reasons why it is today deemed perfectly normal to have a web browser
with a 50MB footprint just to display a blank page. The depressing thing
is that today even lynx takes 8MB to do that.

And yet people have implemented OS-es with web browsers that run on the
likes of C64 in 64KB of RAM - including the OS, IP stack, and a web
browser (that uses clever trickery to load/reload pages since most are
bigger than 64KB. It sure puts modern software into perspective.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
That's before you even consider the problems caused by beta-quality
clustering software like Redhat's GFS and Linux's badly broken NFS
implementation.
Oh come on - GFS (GFS1 at least, the two times I used GFS2 I reverted
back to GFS1 because I could reliably break it within a few hours of
operation) is pretty damn solid, and it's been a few years since I
actually had any NFS problems (and most of my networks have /home on
NFS, and a number of farms with / on NFS).
NFS doesn't play nice with anything else accessing the exported disks,
including server-local processes (locking...) and there's a slight but
provable chance of file/data corruption if anything else does (We've had
this happen).
Sure, but NFS was never designed for concurrent access. You can mostly
get away with it if you completely disable all the caching (including
and especially metadata), though, but the performance is going to be
_painful_.
Post by Uncle Stoatwarbler
NFS + GFS(1/2) + slowish drives + heavy use = occasional kernel panics.
GFS2, sure. But I have not seen that happen on GFS1 at least since RHEL5
was released.
Post by Uncle Stoatwarbler
We're one of Redhat's "regular phone chats with RH management" customers
as a result of the number of bugs that have been shown up.
Yeah, I've mostly given up on filing bugs. They just sit there and
fester until somebody closes them with "see if it works in the latest
release" comment, which is a lame way of saying "we can't even be
bothered to look into it".
Post by Uncle Stoatwarbler
As you say: You can break GFS. We can break it faster.
Well, when I used it I could break GFS2 pretty reliably. But I've not
actually seen GFS1 go wrong.
Post by Uncle Stoatwarbler
Linux NFS serving was incorporated into the kernel nearly 20 years ago
for performance reasons, without the people involved (I was one of them)
realising the implications for locking. At the time we were - literally
- "kids in bedrooms". It needs to be in userspace, which is where it is
in every other OS and it needs to be there for good reasons.
Cheap, fast, reliable - pick any two. :)
Post by Uncle Stoatwarbler
Because Linux NFS is in kernel space, locks can't be passed to other
cluster nodes. That means the only way to serve any given filesystem is
strictly "one node only" - and even the best NFS server struggles when
clients are pulling a lot of data - the old 30:1 rule of thumb was based
on low duty cycles.
There is nothing to stop something like DLM being used to sync the locks
on NFS the way it happens on GFS. But the performance penalty is
_substantial_, especially when you have concurrent access to same files.
Post by Uncle Stoatwarbler
Luster and Gluster have promise
Not sure about Lustre, but you may want to look at this and do extensive
performance testing before you go the Gluster route:

http://lists.gnu.org/archive/html/gluster-devel/2010-01/msg00043.html

The other issue with it is that it has no notion of fencing and is thus
liable to splitbrain if there is a network outage between the nodes.
Post by Uncle Stoatwarbler
but the university's central IT group
recently returned a $5million fileserving setup based on them to the
vendor after failing to have it work reliably in the 18 months since it
had been installed and setup by that vendor. When used as a central
filestore it spent more time down or impaired than up and repeated
vendor failure is one of the factors which led to email being outsourced
to Microsoft Live (Shudder).
Outsourcing email is actually something I'm seeing a lot of at
universities. The place where I went to university did the same thing.
Post by Uncle Stoatwarbler
A lot of Linux stuff works well in test environments and then breaks
badly under heavy load. ZFS - so far - has been one of the nice exceptions.
Indeed.

Gordan
Uncle Stoatwarbler
2012-07-05 21:32:01 UTC
Permalink
Post by Gordan Bobic
Please, don't use the b-word on this mailing list. Not sure about
others, but if BTRFS was any good I wouldn't be on this list, and I
wouldn't be using ZFS.
Heh. I know it's currently no good (I was able to trash a btrfs
partition completely without trying very hard), but that may change.

ZFS does different things and it does what I need, much better :)
Post by Gordan Bobic
Post by Uncle Stoatwarbler
I regularly migrate multi-Tb datasets data from old to new storage in
enterprise environments. It often takes 2-3 weeks to get everything
mirrored before I can cutover and even then the 2 hours down for final
checks causes outrage.
There are ways and means to work around that. I use lsyncd for such
things. It's a nice, clean way to keep things replicated and up to date
during migration without crippling the system or causing downtime.
I do too, but there's no substitute for putting the FS into readonly
mode for the final pass.
Post by Gordan Bobic
Post by Uncle Stoatwarbler
That's debatable. I think 12 drive raidz2 is acceptable for large drive
pools where speed isn't critical and to be honest I suspect that SSD
will eat Seagate/WD's lunch in the next 3 years - if not sooner.
Depends on how big those 12 drives are. That's OK with 1TB disks, but
with 4TB disks that would very much not be OK.
While that might be true with 512byte sectors, the ECC depth is _much_
better on 4kb sectors and silent error rates are a lot lower. There are
no drives above 2Tb which are 512byte devices.

OTOH I wouldn't trust a >2Tb drive, simply because of the sector failure
rates we're seeing.
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Of course but in a hobby environment thinking ahead often comes second
to "what can I afford to buy this month?". Being able to reshape doesn't
mean you have to do it, but being able to means those on tight budgets
aren't put through copying-hell periodically.
Instead they can be put through unusable-slowness hell...
That depends on ram and cpu. I can recall when MD-raid was painfully
slow. :)
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Seriously. It's only taken 7 years to convince the academics with the
purse strings that this is a bad idea. As far as they're concerned I'm a
blithering idiot wanting to spend a lot of money on "Not very much storage".
OK, when you say near-line, what do you actually mean? Big tape library
with some robots? Or are we actually talking spinning disks here? I'm
assuming you mean the former. But for the life of me I cannot figure out
how you might cope with "ls ~" taking minutes rather than milliseconds...
Spinning arrays such as Nexsan Satabeasts and Zyratex Sumos. 96 drives
of goodness in 8U

The "fast" array is planned to be ~3Tb of RAID6 SSD
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Let's not even go into the suggestions being put forward that backups of
~300Tb martian imaging data be performed on Bluray disks by a PhD student.
*does the back of a fag packet calculation*
So that is 15,000 BR discs per backup... So, presumably this is once
every 5 years backup? I didn't think you can get a PhD in burning stuff
onto optical disks.
This factor did get mentioned...
Post by Gordan Bobic
And yet people have implemented OS-es with web browsers that run on the
likes of C64 in 64KB of RAM - including the OS, IP stack, and a web
browser (that uses clever trickery to load/reload pages since most are
bigger than 64KB. It sure puts modern software into perspective.
Indeed...
Post by Gordan Bobic
Sure, but NFS was never designed for concurrent access. You can mostly
get away with it if you completely disable all the caching (including
and especially metadata), though, but the performance is going to be
_painful_.
The problem is that just about everyone (including senior IT staff)
believe it is and can't believe it when you go through the reasons why
not. NFS is just too ingrained in the mind.
Post by Gordan Bobic
Post by Uncle Stoatwarbler
NFS + GFS(1/2) + slowish drives + heavy use = occasional kernel panics.
GFS2, sure. But I have not seen that happen on GFS1 at least since RHEL5
was released.
Moving from RHEL4 to 5 was a hellish experience for us. Redhat broke a
lot of GFS stuff at about 5.2 and GFS1 copes poorly with Tb-size
partitions that have 6 million files in them in any case (deep sky
galactic surveys, in case you were wondering)
Post by Gordan Bobic
Yeah, I've mostly given up on filing bugs. They just sit there and
fester until somebody closes them with "see if it works in the latest
release" comment, which is a lame way of saying "we can't even be
bothered to look into it".
Or "Engineering have declined your request" - this happens more often
than not.
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Linux NFS serving was incorporated into the kernel nearly 20 years ago
for performance reasons, without the people involved (I was one of them)
realising the implications for locking. At the time we were - literally
- "kids in bedrooms". It needs to be in userspace, which is where it is
in every other OS and it needs to be there for good reasons.
Cheap, fast, reliable - pick any two. :)
Exactly. NFS is unsuitable for enterprise deployments, but it's
difficult to convince people of that.
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Because Linux NFS is in kernel space, locks can't be passed to other
cluster nodes. That means the only way to serve any given filesystem is
strictly "one node only" - and even the best NFS server struggles when
clients are pulling a lot of data - the old 30:1 rule of thumb was based
on low duty cycles.
There is nothing to stop something like DLM being used to sync the locks
on NFS the way it happens on GFS. But the performance penalty is
_substantial_, especially when you have concurrent access to same files.
Some of the locks aren't passed out of the kernel for DLM to get hold of.
Post by Gordan Bobic
The other issue with it is that it has no notion of fencing and is thus
liable to splitbrain if there is a network outage between the nodes.
I know, but a standalone gluster server has far better performance than
a similar NFS one. :)
Post by Gordan Bobic
Outsourcing email is actually something I'm seeing a lot of at
universities. The place where I went to university did the same thing.
Google, Yahoo and Microsoft are all offering to do it for free and
universities love "free"

The existing core mailsystem was rotten, mainly down to not having had
anything spent on it for 5 years. Our departmental server was in a lot
better shape. :)

Apart from the fact that so far the university has spent about $3
million and 2 years (was supposed to be 3 months) in preparatory work
(which would have made a pretty kickass system for 100k mail users),
there's no such thing as a free lunch and I am highly suspicious of the
motives behind the offers.
Gordan Bobic
2012-07-05 22:11:00 UTC
Permalink
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Please, don't use the b-word on this mailing list. Not sure about
others, but if BTRFS was any good I wouldn't be on this list, and I
wouldn't be using ZFS.
Heh. I know it's currently no good (I was able to trash a btrfs
partition completely without trying very hard), but that may change.
ZFS does different things and it does what I need, much better :)
ZFS does more things and does them better. :)
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
I regularly migrate multi-Tb datasets data from old to new storage in
enterprise environments. It often takes 2-3 weeks to get everything
mirrored before I can cutover and even then the 2 hours down for final
checks causes outrage.
There are ways and means to work around that. I use lsyncd for such
things. It's a nice, clean way to keep things replicated and up to date
during migration without crippling the system or causing downtime.
I do too, but there's no substitute for putting the FS into readonly
mode for the final pass.
That's kind of the point - you don't have to do that if you have lsyncd
in place. If it's exported over NFS with lsyncd running underneath, just
stop the NFS service and wait a few seconds for lsyncd queue to drain.
Switch the new NAS IP over and you're good to go.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
That's debatable. I think 12 drive raidz2 is acceptable for large drive
pools where speed isn't critical and to be honest I suspect that SSD
will eat Seagate/WD's lunch in the next 3 years - if not sooner.
Depends on how big those 12 drives are. That's OK with 1TB disks, but
with 4TB disks that would very much not be OK.
While that might be true with 512byte sectors, the ECC depth is _much_
better on 4kb sectors and silent error rates are a lot lower. There are
no drives above 2Tb which are 512byte devices.
OTOH I wouldn't trust a >2Tb drive, simply because of the sector failure
rates we're seeing.
I'm seeing pretty disappointing failure rates on 1TB drives, not just on
sectors but on complete disks. 3 out of 13 failures within the warranty
period, and there's sill a fair bit of the warranty period to go. And
that doesn't count the disks that were due for the RMA that were
resurrected by doing a full secure erase to get those unbudging pending
sectors to remap (I can usually get away with that once or twice before
the disk gives up and bricks itself during the secure erase - but at
least then I know they won't send me the same duff disk back as "no
fault found").
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Of course but in a hobby environment thinking ahead often comes second
to "what can I afford to buy this month?". Being able to reshape doesn't
mean you have to do it, but being able to means those on tight budgets
aren't put through copying-hell periodically.
Instead they can be put through unusable-slowness hell...
That depends on ram and cpu. I can recall when MD-raid was painfully
slow. :)
Actually, if the ZFS is quite full (over 50% or so), resilvering with MD
RAID is actually quicker because it does the whole resilvering pass
linearly. Mind you, with reshaping I'm not sure how it compares.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Seriously. It's only taken 7 years to convince the academics with the
purse strings that this is a bad idea. As far as they're concerned I'm a
blithering idiot wanting to spend a lot of money on "Not very much storage".
OK, when you say near-line, what do you actually mean? Big tape library
with some robots? Or are we actually talking spinning disks here? I'm
assuming you mean the former. But for the life of me I cannot figure out
how you might cope with "ls ~" taking minutes rather than milliseconds...
Spinning arrays such as Nexsan Satabeasts and Zyratex Sumos. 96 drives
of goodness in 8U
Ah, OK. That's not all that "near-line" then. Just slow on-line. But
with 96 disks you should be able to squeeze some serious IOPS (maybe 10K
or so) out of it assuming the layout and conviguration are sensible.
Post by Uncle Stoatwarbler
The "fast" array is planned to be ~3Tb of RAID6 SSD
Not really that big a deal any more now that you can get 500-1000GB
SSDs. Kingston onesa are _really_ good. Extremely compatible. Unlike all
other SSDs I've tried (Intel, OCZ, Integral) they actually work in
proper server enclosures, e.g. HP MSA70. Most others would error out in
seconds. I've not managed to shake the Kingstons loose. And the
performance is pretty decent, too.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Sure, but NFS was never designed for concurrent access. You can mostly
get away with it if you completely disable all the caching (including
and especially metadata), though, but the performance is going to be
_painful_.
The problem is that just about everyone (including senior IT staff)
believe it is and can't believe it when you go through the reasons why
not. NFS is just too ingrained in the mind.
And then there are consultants. There are two definitions I like:

1) People who give you good advice about things they know nothing about.
2) The word consultant came about when a journalist didn't know how to
spell charlatan.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
NFS + GFS(1/2) + slowish drives + heavy use = occasional kernel panics.
GFS2, sure. But I have not seen that happen on GFS1 at least since RHEL5
was released.
Moving from RHEL4 to 5 was a hellish experience for us. Redhat broke a
lot of GFS stuff at about 5.2 and GFS1 copes poorly with Tb-size
partitions that have 6 million files in them in any case (deep sky
galactic surveys, in case you were wondering)
Ah - I never tried partitions that big. I always used multiple smaller
ones. I completely skipped RHEL4 (went straight from 3 to 5, 4 just
seemed too flaky in too many places until very late in the production
run). I think you are talking about the switch of default from GFS1 to
GFS2 in RHEL5.2. That did indeed cause problems, but I knew better than
to touch GFS2 with a barge pole so I never really saw any problems. All
my OSR clusters went through the update swimmingly.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Yeah, I've mostly given up on filing bugs. They just sit there and
fester until somebody closes them with "see if it works in the latest
release" comment, which is a lame way of saying "we can't even be
bothered to look into it".
Or "Engineering have declined your request" - this happens more often
than not.
Indeed, but that typically only kicks it up to the next release rather
than can the bug report completely. It usually takes 2-3 of those
postponements before somebody does an audit, finds there are enough open
bugs to make them look bad and goes and closes them without fixing them.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Linux NFS serving was incorporated into the kernel nearly 20 years ago
for performance reasons, without the people involved (I was one of them)
realising the implications for locking. At the time we were - literally
- "kids in bedrooms". It needs to be in userspace, which is where it is
in every other OS and it needs to be there for good reasons.
Cheap, fast, reliable - pick any two. :)
Exactly. NFS is unsuitable for enterprise deployments, but it's
difficult to convince people of that.
I think it more depends on what sort of a use-case you are talking
about. For /home it's generally fine because there will be little or no
concurrent access to any individual file. Then again, I have seen
production databases live on NFS... That was a concerning experience...
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Because Linux NFS is in kernel space, locks can't be passed to other
cluster nodes. That means the only way to serve any given filesystem is
strictly "one node only" - and even the best NFS server struggles when
clients are pulling a lot of data - the old 30:1 rule of thumb was based
on low duty cycles.
There is nothing to stop something like DLM being used to sync the locks
on NFS the way it happens on GFS. But the performance penalty is
_substantial_, especially when you have concurrent access to same files.
Some of the locks aren't passed out of the kernel for DLM to get hold of.
I never said knfsd wouldn't need modifying. :)
Post by Uncle Stoatwarbler
Post by Gordan Bobic
The other issue with it is that it has no notion of fencing and is thus
liable to splitbrain if there is a network outage between the nodes.
I know, but a standalone gluster server has far better performance than
a similar NFS one. :)
Not so - see the link I posted:

http://lists.gnu.org/archive/html/gluster-devel/2010-01/msg00043.html

It's about 4x slower than userspace nfsd, and about 8x slower than knfsd.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Outsourcing email is actually something I'm seeing a lot of at
universities. The place where I went to university did the same thing.
Google, Yahoo and Microsoft are all offering to do it for free and
universities love "free"
The existing core mailsystem was rotten, mainly down to not having had
anything spent on it for 5 years. Our departmental server was in a lot
better shape. :)
Apart from the fact that so far the university has spent about $3
million and 2 years (was supposed to be 3 months) in preparatory work
(which would have made a pretty kickass system for 100k mail users),
there's no such thing as a free lunch and I am highly suspicious of the
motives behind the offers.
Sounds familiar.

Gordan
Uncle Stoatwarbler
2012-07-05 22:46:19 UTC
Permalink
Post by Gordan Bobic
Post by Uncle Stoatwarbler
ZFS does different things and it does what I need, much better :)
ZFS does more things and does them better. :)
That too :)
Post by Gordan Bobic
Post by Uncle Stoatwarbler
I do too, but there's no substitute for putting the FS into readonly
mode for the final pass.
That's kind of the point - you don't have to do that if you have lsyncd
in place. If it's exported over NFS with lsyncd running underneath, just
stop the NFS service and wait a few seconds for lsyncd queue to drain.
Switch the new NAS IP over and you're good to go.
Usually the arrays are attached to the same server. This is all FC
stuff. Most of the time it's not a major issue, but lsyncd does get slow
to do things on the large FSes we're dealing with.
Post by Gordan Bobic
Post by Uncle Stoatwarbler
OTOH I wouldn't trust a >2Tb drive, simply because of the sector failure
rates we're seeing.
I'm seeing pretty disappointing failure rates on 1TB drives, not just on
sectors but on complete disks. 3 out of 13 failures within the warranty
period, and there's sill a fair bit of the warranty period to go.
I've had 2 of 12 Samsung green drives RMAed so far. The other 2Tb drives
seem a bit more reliable.

Nonetheless, I keep a few 2Tb drives onhand "just in case" and am
seriously considering RAIDZ3
Post by Gordan Bobic
And
that doesn't count the disks that were due for the RMA that were
resurrected by doing a full secure erase to get those unbudging pending
sectors to remap (I can usually get away with that once or twice before
the disk gives up and bricks itself during the secure erase - but at
least then I know they won't send me the same duff disk back as "no
fault found").
"hdparm --repair-sector" usually works for me, but I've had to resort to
secure-erase a couple of times

One RMAed drive smoked itself (literally) and the other would clear all
the pending sectors on a secure erase, but they'd come back immediately
when the drive resilvered. I had to run secure-erase a few times to
convince myself I wasn't imagining it.
Post by Gordan Bobic
Actually, if the ZFS is quite full (over 50% or so), resilvering with MD
RAID is actually quicker because it does the whole resilvering pass
linearly. Mind you, with reshaping I'm not sure how it compares.
It takes about 6 days to resilver my array at the moment. Online
performance is marginally slower but still quite usable (8Gb, dual 1.9Gb
Xeons from 2004) - reads can still be done in excess of 20Mb/s and
that's all I need for home use.
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Spinning arrays such as Nexsan Satabeasts and Zyratex Sumos. 96 drives
of goodness in 8U
Ah, OK. That's not all that "near-line" then. Just slow on-line. But
with 96 disks you should be able to squeeze some serious IOPS (maybe 10K
or so) out of it assuming the layout and conviguration are sensible.
Yup, but now divide that among all the clients using the things
simultaneously and things don't look so rosy. These big data drawers are
optimized for things like video serving.
Post by Gordan Bobic
Post by Uncle Stoatwarbler
The "fast" array is planned to be ~3Tb of RAID6 SSD
Not really that big a deal any more now that you can get 500-1000GB
SSDs. Kingston onesa are _really_ good. Extremely compatible. Unlike all
other SSDs I've tried (Intel, OCZ, Integral) they actually work in
proper server enclosures, e.g. HP MSA70. Most others would error out in
seconds. I've not managed to shake the Kingstons loose. And the
performance is pretty decent, too.
I was looking at 256Gb Samsung830s as they have good reviews but it's
really a matter of what the vendors support. The big issue is that a
good dual-redundant active-active FC storage shelf runs to about $8000
as a bare unit and that's what causes resistance to purchase.
Post by Gordan Bobic
1) People who give you good advice about things they know nothing about.
2) The word consultant came about when a journalist didn't know how to
spell charlatan.
"we" (the IT staff) always test consultants to see how much they really
do know, before we let them near management.
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Moving from RHEL4 to 5 was a hellish experience for us. Redhat broke a
lot of GFS stuff at about 5.2 and GFS1 copes poorly with Tb-size
partitions that have 6 million files in them in any case (deep sky
galactic surveys, in case you were wondering)
Ah - I never tried partitions that big. I always used multiple smaller
ones.
Which is why you probably never noticed the issues :)

At that point we had about 50 1Tb partitions and something in the region
of 60 million files in the system.
Post by Gordan Bobic
I completely skipped RHEL4 (went straight from 3 to 5, 4 just
seemed too flaky in too many places until very late in the production
run). I think you are talking about the switch of default from GFS1 to
GFS2 in RHEL5.2.
Nope. We simply moved OS and kept GFS1. GFS2 wasn't stable enough to
deploy until RHEL5.4 or so - but it does work better than GFS1 for our
needs.
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Or "Engineering have declined your request" - this happens more often
than not.
Indeed, but that typically only kicks it up to the next release rather
than can the bug report completely.
I've had one bug open since RHEL4 and it's been postponed to RHEL7
Post by Gordan Bobic
It usually takes 2-3 of those
postponements before somebody does an audit, finds there are enough open
bugs to make them look bad and goes and closes them without fixing them.
We won't let them do that, which has made us some enemies within the
company.
Post by Gordan Bobic
I think it more depends on what sort of a use-case you are talking
about. For /home it's generally fine because there will be little or no
concurrent access to any individual file. Then again, I have seen
production databases live on NFS... That was a concerning experience...
You _hope_ there's no concurrent access.

That all goes out the window when you find a user logged in on 6
workstations and sshing to 20 other nodes which all use the same NFS
sources.
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Post by Gordan Bobic
There is nothing to stop something like DLM being used to sync the locks
on NFS the way it happens on GFS. But the performance penalty is
_substantial_, especially when you have concurrent access to same files.
Some of the locks aren't passed out of the kernel for DLM to get hold of.
I never said knfsd wouldn't need modifying. :)
Exactly. I've got a bugzilla in for that too.
Post by Gordan Bobic
Post by Uncle Stoatwarbler
I know, but a standalone gluster server has far better performance than
a similar NFS one. :)
http://lists.gnu.org/archive/html/gluster-devel/2010-01/msg00043.html
Interesting... but it is NFS+FSCache vs Gluster in those tests. A lot of
our clients are RHEL5 and can't use FSCache.
Post by Gordan Bobic
It's about 4x slower than userspace nfsd, and about 8x slower than knfsd.
My prime concern is handling the concurrent access without risk of data
corruption. Tests here usually showed them on par with NFS performamce
falling away rapidly on a per-client basis as the load cranked up.

(You can't just test with one client. That's not a real world
environment for us. It'd need 10 to approach realistic testing)

Another factor which doesn't get discussed much when tuning for file
serving is dcache and inode cache bucket sizes. They can be cranked up
using dhash and ihash kernel parameters at bootup, but there's a hard
coded limit to 5% of total memory which might have made sense on sub-1Gb
class machines but doesn't on a 48-256Gb one.
Gordan Bobic
2012-07-05 23:11:47 UTC
Permalink
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
I do too, but there's no substitute for putting the FS into readonly
mode for the final pass.
That's kind of the point - you don't have to do that if you have lsyncd
in place. If it's exported over NFS with lsyncd running underneath, just
stop the NFS service and wait a few seconds for lsyncd queue to drain.
Switch the new NAS IP over and you're good to go.
Usually the arrays are attached to the same server. This is all FC
stuff. Most of the time it's not a major issue, but lsyncd does get slow
to do things on the large FSes we're dealing with.
If you have lots of directories to watch and multiple of CPU cores, it
might help to run multiple lsyncd instances, each watching a subtree.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
OTOH I wouldn't trust a >2Tb drive, simply because of the sector failure
rates we're seeing.
I'm seeing pretty disappointing failure rates on 1TB drives, not just on
sectors but on complete disks. 3 out of 13 failures within the warranty
period, and there's sill a fair bit of the warranty period to go.
I've had 2 of 12 Samsung green drives RMAed so far. The other 2Tb drives
seem a bit more reliable.
Nonetheless, I keep a few 2Tb drives onhand "just in case" and am
seriously considering RAIDZ3
I've documented some of my tales of woe with Samsung disks here:
http://www.altechnative.net/2011/03/21/the-appalling-design-of-hard-disks/
Post by Uncle Stoatwarbler
Post by Gordan Bobic
And
that doesn't count the disks that were due for the RMA that were
resurrected by doing a full secure erase to get those unbudging pending
sectors to remap (I can usually get away with that once or twice before
the disk gives up and bricks itself during the secure erase - but at
least then I know they won't send me the same duff disk back as "no
fault found").
"hdparm --repair-sector" usually works for me, but I've had to resort to
secure-erase a couple of times
Usually what works for me (if anything is going to work) is:

dd if=/dev/zero of=/dev/$disk bs=512 count=1 seek=$bad_sector oflag=direct

If that fails (and sometimes it does) usually nothing will get it
working again short of a secure erase. Note: oflag=direct seems to be
important - a lot of the time that works when without it the operations
just hangs for ages until it times out.
Post by Uncle Stoatwarbler
One RMAed drive smoked itself (literally) and the other would clear all
the pending sectors on a secure erase, but they'd come back immediately
when the drive resilvered. I had to run secure-erase a few times to
convince myself I wasn't imagining it.
Indeed - Samsungs and WDs actually lie about their sector remaping. When
you have pending sectors and you write to them to remap them, the
pending count goes to 0 and the reallocated count stays at 0. It is a
really sad situation when we chose disks not because they are more
reliable (they are all equally terrible nowdays) but purely because at
least they don't lie in their SMART. I'm sticking with Seagates for now.
Plus, they seem to be the only ones that support write-read-verify
feature set (and yes, I have it enabled on all disks that support it
(ZFS or not), write performance be damned when disks are this unreliable).
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Actually, if the ZFS is quite full (over 50% or so), resilvering with MD
RAID is actually quicker because it does the whole resilvering pass
linearly. Mind you, with reshaping I'm not sure how it compares.
It takes about 6 days to resilver my array at the moment. Online
performance is marginally slower but still quite usable (8Gb, dual 1.9Gb
Xeons from 2004) - reads can still be done in excess of 20Mb/s and
that's all I need for home use.
Heh, fair. My resilvering time is about a day. Scrub time is about 13 hours.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Spinning arrays such as Nexsan Satabeasts and Zyratex Sumos. 96 drives
of goodness in 8U
Ah, OK. That's not all that "near-line" then. Just slow on-line. But
with 96 disks you should be able to squeeze some serious IOPS (maybe 10K
or so) out of it assuming the layout and conviguration are sensible.
Yup, but now divide that among all the clients using the things
simultaneously and things don't look so rosy. These big data drawers are
optimized for things like video serving.
Fair. ZFS + L2ARC? :)
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
The "fast" array is planned to be ~3Tb of RAID6 SSD
Not really that big a deal any more now that you can get 500-1000GB
SSDs. Kingston onesa are _really_ good. Extremely compatible. Unlike all
other SSDs I've tried (Intel, OCZ, Integral) they actually work in
proper server enclosures, e.g. HP MSA70. Most others would error out in
seconds. I've not managed to shake the Kingstons loose. And the
performance is pretty decent, too.
I was looking at 256Gb Samsung830s as they have good reviews but it's
really a matter of what the vendors support. The big issue is that a
good dual-redundant active-active FC storage shelf runs to about $8000
as a bare unit and that's what causes resistance to purchase.
You could get SAS MSA70s and fill them up with the mentioned Kingstons.
Sure HP won't support it, but it works very well. In fact, Kingston are
quite good about providing temporary test samples. They sent us 14 SSDs
to test for a couple of weeks until we were confident they would work in
our setup. Then we bought 25 and sent the 14 test ones back. Granted,
this was for a major high-profile broadcast client that looks good on
their list of customers, but they are reasonably ameanable if you are
serious about getting 10-20 disks. And they are easily the most
compatible SATA SSD I have seen. Most others seriously flake out withing
seconds when you use them on SAS expanders with NCQ enabled. But the
Kingstons never skipped a beat no matter how hard we hammered them.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
1) People who give you good advice about things they know nothing about.
2) The word consultant came about when a journalist didn't know how to
spell charlatan.
"we" (the IT staff) always test consultants to see how much they really
do know, before we let them near management.
That assumes you are allowed into the loop. :)
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Moving from RHEL4 to 5 was a hellish experience for us. Redhat broke a
lot of GFS stuff at about 5.2 and GFS1 copes poorly with Tb-size
partitions that have 6 million files in them in any case (deep sky
galactic surveys, in case you were wondering)
Ah - I never tried partitions that big. I always used multiple smaller
ones.
Which is why you probably never noticed the issues :)
At that point we had about 50 1Tb partitions and something in the region
of 60 million files in the system.
Millions of files isn't too bad (I used it for a mail cluster with
Maildir on it). But that wasn't anywhere near 50TB. :)
Post by Uncle Stoatwarbler
Post by Gordan Bobic
I completely skipped RHEL4 (went straight from 3 to 5, 4 just
seemed too flaky in too many places until very late in the production
run). I think you are talking about the switch of default from GFS1 to
GFS2 in RHEL5.2.
Nope. We simply moved OS and kept GFS1. GFS2 wasn't stable enough to
deploy until RHEL5.4 or so - but it does work better than GFS1 for our
needs.
Strange that I never noticed any issues. And it was on OSR clusters,
too, which tend to be more sensitive when it comes to such things.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Or "Engineering have declined your request" - this happens more often
than not.
Indeed, but that typically only kicks it up to the next release rather
than can the bug report completely.
I've had one bug open since RHEL4 and it's been postponed to RHEL7
LOL!
Post by Uncle Stoatwarbler
Post by Gordan Bobic
It usually takes 2-3 of those
postponements before somebody does an audit, finds there are enough open
bugs to make them look bad and goes and closes them without fixing them.
We won't let them do that, which has made us some enemies within the
company.
Fair, but I don't usually have a big stack of licences to wave at them
as being "put at risk". ;)

But I'm pretty sure I've made some enemies there when I beat them to an
ARM port with RedSleeve. :) (You wouldn't believe how many people from
RH suddenly browsed my LinkedIn profile when RedSleeve got el reg-ed.)
Post by Uncle Stoatwarbler
Post by Gordan Bobic
I think it more depends on what sort of a use-case you are talking
about. For /home it's generally fine because there will be little or no
concurrent access to any individual file. Then again, I have seen
production databases live on NFS... That was a concerning experience...
You _hope_ there's no concurrent access.
That all goes out the window when you find a user logged in on 6
workstations and sshing to 20 other nodes which all use the same NFS
sources.
Well - yes.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Post by Gordan Bobic
There is nothing to stop something like DLM being used to sync the locks
on NFS the way it happens on GFS. But the performance penalty is
_substantial_, especially when you have concurrent access to same files.
Some of the locks aren't passed out of the kernel for DLM to get hold of.
I never said knfsd wouldn't need modifying. :)
Exactly. I've got a bugzilla in for that too.
That's not a bug, it's a feature request. :)
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
I know, but a standalone gluster server has far better performance than
a similar NFS one. :)
http://lists.gnu.org/archive/html/gluster-devel/2010-01/msg00043.html
Interesting... but it is NFS+FSCache vs Gluster in those tests. A lot of
our clients are RHEL5 and can't use FSCache.
No fscache in my tests. The "cache" in those results refers to a GLFS
internal caching accelerator module.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
It's about 4x slower than userspace nfsd, and about 8x slower than knfsd.
My prime concern is handling the concurrent access without risk of data
corruption. Tests here usually showed them on par with NFS performamce
falling away rapidly on a per-client basis as the load cranked up.
(You can't just test with one client. That's not a real world
environment for us. It'd need 10 to approach realistic testing)
I'd still be wary of it, but if you have actual test numbers that show
otherwise, I'd be interested in seeing them. I would expect that a
heavily concurrent test case to show a worse performance degradation if
anything.
Post by Uncle Stoatwarbler
Another factor which doesn't get discussed much when tuning for file
serving is dcache and inode cache bucket sizes. They can be cranked up
using dhash and ihash kernel parameters at bootup, but there's a hard
coded limit to 5% of total memory which might have made sense on sub-1Gb
class machines but doesn't on a 48-256Gb one.
Interesting stuff. Link to a tuning guide of some description, perhaps?

Gordan
Uncle Stoatwarbler
2012-07-06 02:41:01 UTC
Permalink
Post by Gordan Bobic
If you have lots of directories to watch and multiple of CPU cores, it
might help to run multiple lsyncd instances, each watching a subtree.
I'll look at it on the next move, but for the most part it really is
just as easy to toggle the FS readonly for a couple of hours (unless
it's something critical like /home)
Post by Gordan Bobic
http://www.altechnative.net/2011/03/21/the-appalling-design-of-hard-disks/
Yes, I've read that. Of course, Samsung no longer make HDDs (They sold
to Seagate earlier this year).
Post by Gordan Bobic
Indeed - Samsungs and WDs actually lie about their sector remaping.
The 2Tb drives didn't. I have a few dozen remapped sectors on them and
the pending sectors all got remapped, except on this particular drive.

What makes the point about HDD reliability is that WD and Seagate have
both reduced warranty periods to between 1 and 2 years (1 year violates
EU consumer protection laws but they did it anyway), whilst SSDs are
coming with 3-5 year warranties as standard.
Post by Gordan Bobic
Heh, fair. My resilvering time is about a day. Scrub time is about 13 hours.
How big is your array though?
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Yup, but now divide that among all the clients using the things
simultaneously and things don't look so rosy. These big data drawers are
optimized for things like video serving.
Fair. ZFS + L2ARC? :)
No, GFS. I _want_ to use ZFS, but the hurdles are currently 1: Not
supported by Redhat and 2: extreme reluctance to move to another distro
or to Solaris.

For the moment that means ZFS stays at home and I bang on it to learn as
much about its foibles as I can without annoying SWMBO.
Post by Gordan Bobic
You could get SAS MSA70s and fill them up with the mentioned Kingstons.
Possibly, but HP left a pretty sour taste in our throats after selling
us cluster fileserver software which didn't work (Steeleye) and a
distribution which we couldn't get support for (Suse - they wouldn't
even return calls from Novell UK's management)

Nexsan are keen to lend us some ssd kit too. I'll have to see where it goes.
Post by Gordan Bobic
Post by Uncle Stoatwarbler
"we" (the IT staff) always test consultants to see how much they really
do know, before we let them near management.
That assumes you are allowed into the loop. :)
If we don't approve it, it doesn't get into the server room. A stack of
hot, noisy hardware in the corner of your non-air-conditioned office
makes a good reminder to consult with us first.
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Nope. We simply moved OS and kept GFS1. GFS2 wasn't stable enough to
deploy until RHEL5.4 or so - but it does work better than GFS1 for our
needs.
Strange that I never noticed any issues. And it was on OSR clusters,
too, which tend to be more sensitive when it comes to such things.
Different needs, different loads, etc. RH had a lot of trouble
replicating problems until they setup a cluster with similar sized
storage to ours.
Post by Gordan Bobic
Post by Uncle Stoatwarbler
We won't let them do that, which has made us some enemies within the
company.
Fair, but I don't usually have a big stack of licences to wave at them
as being "put at risk". ;)
We don't have that many (only a couple of hundred). What we do have is
the clout of being one of the world's leading laboratories in our field
and they've found out in the past that gripes from high profile
customers can seriously affect sales.

On an economic front, RH only care about their _large_ clients - the
ones with several tens of thousands of licenses. These are mostly banks,
etc using RH on all the terminals.
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Exactly. I've got a bugzilla in for that too.
That's not a bug, it's a feature request. :)
They go in the same area :) It's one of the RFEs that Engineering keep
declining.
Post by Gordan Bobic
No fscache in my tests. The "cache" in those results refers to a GLFS
internal caching accelerator module.
Hm, ok
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Another factor which doesn't get discussed much when tuning for file
serving is dcache and inode cache bucket sizes. They can be cranked up
using dhash and ihash kernel parameters at bootup, but there's a hard
coded limit to 5% of total memory which might have made sense on sub-1Gb
class machines but doesn't on a 48-256Gb one.
Interesting stuff. Link to a tuning guide of some description, perhaps?
There are a bunch, but many are out of date. Most networking and FS
stuff is autotuning, thankfully.

We have a RFE in for removing the 5% limit or making it tunable but that
might take years to sort out (That part of the kernel is maintained by RH).
Fajar A. Nugraha
2012-07-06 04:31:09 UTC
Permalink
Received: by 10.68.241.162 with SMTP id wj2mr11921475pbc.2.1341549072334;
Thu, 05 Jul 2012 21:31:12 -0700 (PDT)
X-BeenThere: zfs-discuss-VKpPRiiRko4/***@public.gmane.org
Received: by 10.68.229.8 with SMTP id sm8ls2224772pbc.0.gmail; Thu, 05 Jul
2012 21:31:10 -0700 (PDT)
Received: by 10.66.84.105 with SMTP id x9mr42227785pay.13.1341549070082;
Thu, 05 Jul 2012 21:31:10 -0700 (PDT)
Received: by 10.66.84.105 with SMTP id x9mr42227783pay.13.1341549070074;
Thu, 05 Jul 2012 21:31:10 -0700 (PDT)
Received: from mail-pb0-f48.google.com (mail-pb0-f48.google.com [209.85.160.48])
by mx.google.com with ESMTPS id hw8si41232455pbc.176.2012.07.05.21.31.09
(version=TLSv1/SSLv3 cipher=OTHER);
Thu, 05 Jul 2012 21:31:09 -0700 (PDT)
Received-SPF: neutral (google.com: 209.85.160.48 is neither permitted nor denied by best guess record for domain of fajar-***@public.gmane.org) client-ip=209.85.160.48;
Received: by pbbrq8 with SMTP id rq8so15620703pbb.7
for <zfs-discuss-VKpPRiiRko4/***@public.gmane.org>; Thu, 05 Jul 2012 21:31:09 -0700 (PDT)
Received: by 10.68.233.132 with SMTP id tw4mr33461361pbc.61.1341549069692;
Thu, 05 Jul 2012 21:31:09 -0700 (PDT)
Received: by 10.68.213.230 with HTTP; Thu, 5 Jul 2012 21:31:09 -0700 (PDT)
In-Reply-To: <4FF6503D.1030306-***@public.gmane.org>
X-Gm-Message-State: ALoCoQnOk0pVRVWMBD9NuzDaY+dnEH2CCtvC/Lsmz+xvstLvksomZSAuE+3f+p1vdxEwXGpm96ze
X-Original-Sender: list-***@public.gmane.org
X-Original-Authentication-Results: mx.google.com; spf=neutral (google.com:
209.85.160.48 is neither permitted nor denied by best guess record for domain
of fajar-***@public.gmane.org) smtp.mail=fajar-***@public.gmane.org
Precedence: list
Mailing-list: list zfs-discuss-VKpPRiiRko4/***@public.gmane.org; contact zfs-discuss+owners-VKpPRiiRko4/***@public.gmane.org
List-ID: <zfs-discuss.zfsonlinux.org>
X-Google-Group-Id: 321364807731
List-Post: <http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/post?hl=en_US>,
<mailto:zfs-discuss-VKpPRiiRko4/***@public.gmane.org>
List-Help: <http://support.google.com/a/zfsonlinux.org/bin/topic.py?hl=en_US&topic=25838>,
<mailto:zfs-discuss+help-VKpPRiiRko4/***@public.gmane.org>
List-Archive: <http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/?hl=en_US>
List-Subscribe: <http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/subscribe?hl=en_US>,
<mailto:zfs-discuss+subscribe-VKpPRiiRko4/***@public.gmane.org>
List-Unsubscribe: <http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/subscribe?hl=en_US>,
<mailto:googlegroups-manage+321364807731+unsubscribe-/***@public.gmane.org>
Archived-At: <http://permalink.gmane.org/gmane.linux.file-systems.zfs.user/3723>
What makes the point about HDD reliability is that WD and Seagate have both
reduced warranty periods to between 1 and 2 years (1 year violates EU
consumer protection laws but they did it anyway), whilst SSDs are coming
with 3-5 year warranties as standard.
fuisonio has an unusual (though make sense) warranty clause:
http://community.fusionio.com/general_topics1/f/20/t/195.aspx

basically it's the number of years or how much data written, whichever
comes first.
--
Fajar
Gordan Bobic
2012-07-06 06:07:10 UTC
Permalink
Post by Uncle Stoatwarbler
Post by Gordan Bobic
http://www.altechnative.net/2011/03/21/the-appalling-design-of-hard-disks/
Yes, I've read that. Of course, Samsung no longer make HDDs (They sold
to Seagate earlier this year).
Sure, but it'll be a while before their old product lines disappear /
get absorbed. I just hope the amalgam doesn't end up being the worst of
both woth worlds...
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Indeed - Samsungs and WDs actually lie about their sector remaping.
The 2Tb drives didn't. I have a few dozen remapped sectors on them and
the pending sectors all got remapped, except on this particular drive.
Good to know that sometimes things actually improve...
Post by Uncle Stoatwarbler
What makes the point about HDD reliability is that WD and Seagate have
both reduced warranty periods to between 1 and 2 years (1 year violates
EU consumer protection laws but they did it anyway), whilst SSDs are
coming with 3-5 year warranties as standard.
I lucked out - most of my Seagates were bought about 2 years ago back
when they still came with 5 year warranty. :)

But the thing about SSDs is that they have a much more predictable
failure rate. Is wear-out covered by the warranty on SSDs?

Having said that, it's really hard to wear out a proper SSD, if these
results are anything to go by:
http://www.xtremesystems.org/forums/showthread.php?271063-SSD-Write-Endurance-25nm-Vs-34nm

Funnily enough Samsung is the only one they managed to kill.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Heh, fair. My resilvering time is about a day. Scrub time is about 13 hours.
How big is your array though?
11+2 RAIDZ2, 1TB disks.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Yup, but now divide that among all the clients using the things
simultaneously and things don't look so rosy. These big data drawers are
optimized for things like video serving.
Fair. ZFS + L2ARC? :)
No, GFS. I _want_ to use ZFS, but the hurdles are currently 1: Not
supported by Redhat and 2: extreme reluctance to move to another distro
or to Solaris.
Oh don't tell me you're in the "not allowed to use anything not
supported by the distro vendor" boat... Seriously, what's the point if
"supported" actually means "If you find a problem we might look into
fixing it in 12+ months"?
Post by Uncle Stoatwarbler
For the moment that means ZFS stays at home and I bang on it to learn as
much about its foibles as I can without annoying SWMBO.
Post by Gordan Bobic
You could get SAS MSA70s and fill them up with the mentioned Kingstons.
Possibly, but HP left a pretty sour taste in our throats after selling
us cluster fileserver software which didn't work (Steeleye) and a
distribution which we couldn't get support for (Suse - they wouldn't
even return calls from Novell UK's management)
Indeed, but that's down to trusting vendor's sales/marketing pitch. In
this case you have somebody without a vested interest that actually
tested this particular combination. :)
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
We won't let them do that, which has made us some enemies within the
company.
Fair, but I don't usually have a big stack of licences to wave at them
as being "put at risk". ;)
We don't have that many (only a couple of hundred). What we do have is
the clout of being one of the world's leading laboratories in our field
and they've found out in the past that gripes from high profile
customers can seriously affect sales.
On an economic front, RH only care about their _large_ clients - the
ones with several tens of thousands of licenses. These are mostly banks,
etc using RH on all the terminals.
In fairness, I don't think that's just RH. Most big vendors are like
that. This is why 90%+ of my clients run CentOS or Scientific Linux.
Vendor support contracts aren't worth the electrons wasted on sending
the PDFs in any case I have seen to date.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Another factor which doesn't get discussed much when tuning for file
serving is dcache and inode cache bucket sizes. They can be cranked up
using dhash and ihash kernel parameters at bootup, but there's a hard
coded limit to 5% of total memory which might have made sense on sub-1Gb
class machines but doesn't on a 48-256Gb one.
Interesting stuff. Link to a tuning guide of some description, perhaps?
There are a bunch, but many are out of date. Most networking and FS
stuff is autotuning, thankfully.
We have a RFE in for removing the 5% limit or making it tunable but that
might take years to sort out (That part of the kernel is maintained by RH).
Yeah, I have some kernel bugs (actual bona fide bugs) filed, but I doubt
any will get fixed any time soon. It's surprising how many things you
find are mispatched and broken when you start building the kernel for a
platform that they haven't tested it on. It's nothing short of a miracle
that it builds on x86 - sheer luck.

Gordan
Uncle Stoatwarbler
2012-07-06 09:35:58 UTC
Permalink
Post by Gordan Bobic
But the thing about SSDs is that they have a much more predictable
failure rate. Is wear-out covered by the warranty on SSDs?
Some do.
Post by Gordan Bobic
Having said that, it's really hard to wear out a proper SSD, if these
http://www.xtremesystems.org/forums/showthread.php?271063-SSD-Write-Endurance-25nm-Vs-34nm
Funnily enough Samsung is the only one they managed to kill.
Still.... 2PiB of writes. I doubt I'll be writing the entire capacity of
the disk 10,000+ times (and the 820s are significantly better in WA
stats than 470s)

The backup server has a stripe of Intel X25-E 64Gb drives. They've so
far taken about 2PiB of writes apiece and haven't developed any remapped
sectors.

What's pretty clear is that the specs for SSD endurance for the most
part are conservative.
Post by Gordan Bobic
Oh don't tell me you're in the "not allowed to use anything not
supported by the distro vendor" boat... Seriously, what's the point if
"supported" actually means "If you find a problem we might look into
fixing it in 12+ months"?
That's my argument. On the flipside we have gear in space probes which
might be in service for 25+ years and needs long term support.

Ironically it's the FOSS-based packages which keep going that long. The
code acquired for Cassini will only run on increasingly unreliable
Sparc10/20 boxes as a f'instance.
Post by Gordan Bobic
Indeed, but that's down to trusting vendor's sales/marketing pitch. In
this case you have somebody without a vested interest that actually
tested this particular combination. :)
It worked until scaled up (As usual)
Post by Gordan Bobic
In fairness, I don't think that's just RH. Most big vendors are like
that. This is why 90%+ of my clients run CentOS or Scientific Linux.
Vendor support contracts aren't worth the electrons wasted on sending
the PDFs in any case I have seen to date.
I agree, but academia is a political minefield and having the support
contracts gives us someone to point a finger at.
Post by Gordan Bobic
Yeah, I have some kernel bugs (actual bona fide bugs) filed, but I doubt
any will get fixed any time soon. It's surprising how many things you
find are mispatched and broken when you start building the kernel for a
platform that they haven't tested it on. It's nothing short of a miracle
that it builds on x86 - sheer luck.
I've found what works best is to contact the author of the section. Once
it's fixed upstream then it's easier to get things in the downstream
software.
Gordan Bobic
2012-07-06 10:20:14 UTC
Permalink
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Oh don't tell me you're in the "not allowed to use anything not
supported by the distro vendor" boat... Seriously, what's the point if
"supported" actually means "If you find a problem we might look into
fixing it in 12+ months"?
That's my argument. On the flipside we have gear in space probes which
might be in service for 25+ years and needs long term support.
Ironically it's the FOSS-based packages which keep going that long. The
code acquired for Cassini will only run on increasingly unreliable
Sparc10/20 boxes as a f'instance.
I was just about to say something along those lines. No commercial
company will provide you with 25 years of support and maintenance. But
with FOSS you at least have an option of doing something about it yourself.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Indeed, but that's down to trusting vendor's sales/marketing pitch. In
this case you have somebody without a vested interest that actually
tested this particular combination. :)
It worked until scaled up (As usual)
Well, considering that the infrastructure in question is shifting around
40Gb/s of traffic (biggest pipes we could get), I guess it's a question
of how far do you need it to scale. I couldn't break it with the most
harsh load I could generate, with about 50 threads doing mixed random
and sequential reads/writes. Whatever I did the throughput was maxing
out dual SAS links.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
In fairness, I don't think that's just RH. Most big vendors are like
that. This is why 90%+ of my clients run CentOS or Scientific Linux.
Vendor support contracts aren't worth the electrons wasted on sending
the PDFs in any case I have seen to date.
I agree, but academia is a political minefield and having the support
contracts gives us someone to point a finger at.
Sure - as long as you accept that that's all a support contract is good
for. :)
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Yeah, I have some kernel bugs (actual bona fide bugs) filed, but I doubt
any will get fixed any time soon. It's surprising how many things you
find are mispatched and broken when you start building the kernel for a
platform that they haven't tested it on. It's nothing short of a miracle
that it builds on x86 - sheer luck.
I've found what works best is to contact the author of the section. Once
it's fixed upstream then it's easier to get things in the downstream
software.
Except in this case it's actually to do with some of RH's many patches
mis-patching.

Gordan
Christ Schlacta
2012-07-05 20:26:14 UTC
Permalink
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Agreed, but in a hobbyist environment people are likely to be reshaping
into a "better than it was" layout.
Only if there is a capability to do so. If you think about it in the
ext* terms, for example, you cannot change things like stride and
stripe-width once the FS is created, even though you can reshape the
underlying MD RAID.
Yes and if it was created today there might be different design
considerations. Btrfs is quite different in a number of areas aimed at
addressing ext* shortcomings.
Post by Gordan Bobic
Now, I accept that we are talking about adding a new feature to ZFS
here, and in that sense, you would obviously make sure that if you're
going to be re-writing all the data to re-shape the array, you also have
the ability to change any of the FS alignment parameters available. The
1) A power failure can be fatal for your data during reshaping
Only if shortcuts are taken. Reshaping has to be a completely
synchronous operation.
I have been asking for this off and on for a while, too. I'd gladly
take "Offline reshaping", because I quite imagine it would be a lot
easier to implement than online equivalent.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
2) A disk error/failure can be fatal for your data during reshaping
Ditto.
Post by Gordan Bobic
3) With a large array, probability of either is relatively high, so to
play it safe you are again in the boat of it being safer to backup,
re-create and then restore.
Raid is no substitute for backups and backups are not there for data
migration. I've seen backups get trashed too (or someone's BOFHed them
to /dev/null)
Backups are a luxury in the home market. In fact, my 5TB data pool IS
MY BACKUP, along with bulk data storage.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Not to mention that the performance during
the reshape is going to be pretty awful and it may take longer than
re-mirroring from a peer (OK, maybe I'm biased by having my data always
distributed to more than one host).
Given a choice of rotten performance for 7 days or being down for 3
days, most people will choose the poor performance option.
I regularly migrate multi-Tb datasets data from old to new storage in
enterprise environments. It often takes 2-3 weeks to get everything
mirrored before I can cutover and even then the 2 hours down for final
checks causes outrage.
Post by Gordan Bobic
Indeed. But in reality there are other considerations. If you want to
future proof yourself against disk size increases and plan to upgrade in
situ you really don't want more than 4+2 disks in RAIDZ2. You might have
only 1TB disks now, but once you up that to 4TB disks even 4+2 is going
to be borderline in terms of data safety. And if you're in that boat,
you might as well just add another 4+2 pool rather than mucking round
with re-shaping.
That's debatable. I think 12 drive raidz2 is acceptable for large
drive pools where speed isn't critical and to be honest I suspect that
SSD will eat Seagate/WD's lunch in the next 3 years - if not sooner.
Post by Gordan Bobic
Yes, reshaping is a cool feature, but it is only really for those that
can't handle thinking ahead, but that way lies pain, poor performance
and data loss. Thinking ahead is something that should really be encouraged.
Of course but in a hobby environment thinking ahead often comes second
to "what can I afford to buy this month?". Being able to reshape
doesn't mean you have to do it, but being able to means those on tight
budgets aren't put through copying-hell periodically.
Even in production, enterprise environments, mistakes happen. We've all
heard tales of someone accidentally adding a single disk to a pool of
raidzn. It would be nice to be able to fix that, by either subsequently
removing the inappropriate vdev, or by converting it to, or replacing it
with, an appropriate vdev.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Network /home has vastly different requirements to 30Tb of
near-archival data and putting them both on the same near-line array for
cost reasons is a recipe for pain (My employer is a university and this
is exactly what happened for budgetary reasons - the result is that the
storage cluster and the centrally managed desktops have a very bad
reputation among users)
ROFL! Near-line for /home? Seriously?
Seriously. It's only taken 7 years to convince the academics with the
purse strings that this is a bad idea. As far as they're concerned I'm
a blithering idiot wanting to spend a lot of money on "Not very much
storage".
Let's not even go into the suggestions being put forward that backups
of ~300Tb martian imaging data be performed on Bluray disks by a PhD
student.
on an unrelated note, what would you suggest for that backup? 300TB of
data is insanely expensive and fragile on tape, exceedingly slow and
fragile on blu-ray, and not very durable on hard-drive.
Post by Uncle Stoatwarbler
There are a hell of a lot of people (not just academics) who believe
that because they can use a mouse, they're computer experts and the IT
staff are mostly bigging things up in order to get a pay rise.
Tell them to configure a multi terabyte tiered storage pool, and see how
well it goes.
Post by Uncle Stoatwarbler
Post by Gordan Bobic
Post by Uncle Stoatwarbler
That's before you even consider the problems caused by beta-quality
clustering software like Redhat's GFS and Linux's badly broken NFS
implementation.
Oh come on - GFS (GFS1 at least, the two times I used GFS2 I reverted
back to GFS1 because I could reliably break it within a few hours of
operation) is pretty damn solid, and it's been a few years since I
actually had any NFS problems (and most of my networks have /home on
NFS, and a number of farms with / on NFS).
NFS doesn't play nice with anything else accessing the exported disks,
including server-local processes (locking...) and there's a slight but
provable chance of file/data corruption if anything else does (We've
had this happen).
NFS is used read-only around here. Good to know that I've prevented
some issues doing that!
Post by Uncle Stoatwarbler
NFS + GFS(1/2) + slowish drives + heavy use = occasional kernel
panics. We're one of Redhat's "regular phone chats with RH management"
customers as a result of the number of bugs that have been shown up.
There's a userspace nfs driver. It's a lot slower. You're welcome to
use it.
Post by Uncle Stoatwarbler
As you say: You can break GFS. We can break it faster.
Linux NFS serving was incorporated into the kernel nearly 20 years ago
for performance reasons, without the people involved (I was one of
them) realising the implications for locking. At the time we were -
literally - "kids in bedrooms". It needs to be in userspace, which is
where it is in every other OS and it needs to be there for good reasons.
Why can't you add a compatibility layer to pass locks up to and retrieve
them from userspace? It would only affect initial load times, not reads
or writes.
Post by Uncle Stoatwarbler
Because Linux NFS is in kernel space, locks can't be passed to other
cluster nodes. That means the only way to serve any given filesystem
is strictly "one node only" - and even the best NFS server struggles
when clients are pulling a lot of data - the old 30:1 rule of thumb
was based on low duty cycles.
Luster and Gluster have promise but the university's central IT group
recently returned a $5million fileserving setup based on them to the
vendor after failing to have it work reliably in the 18 months since
it had been installed and setup by that vendor. When used as a central
filestore it spent more time down or impaired than up and repeated
vendor failure is one of the factors which led to email being
outsourced to Microsoft Live (Shudder).
Who in their right mind relies on a Microsoft product. I've never had
one work right. At least with Linux, you have the tools to fix it if it
breaks.
Also: Was that lustre built atop ZFS? I've been waiting for
lustre+zfsonlinux+a-respectable-linux-distro for a while. You can
request a ubuntu lustre+zfs ppa with me, if you want. It would be very
nice to finally get that brought into being.
Vendors suck anyway. They rip out the "It just works" that the
developers build in, and replace it with proprietary crap :)
Post by Uncle Stoatwarbler
A lot of Linux stuff works well in test environments and then breaks
badly under heavy load. ZFS - so far - has been one of the nice exceptions.
I've noticed that too. COuldn't break redundant failover routers at ALL
in trials, even tried forcibly shutting them down (Akin to powering off
by pulling the cord), but somehow, they manage to break in production.
Uncle Stoatwarbler
2012-07-06 20:46:35 UTC
Permalink
Post by Christ Schlacta
on an unrelated note, what would you suggest for that backup? 300TB of
data is insanely expensive and fragile on tape, exceedingly slow and
fragile on blu-ray, and not very durable on hard-drive.
LTO and DLT are not fragile and both have storage lives well beyond my
retirement (20+ years, maybe 30 when I get there).

The current hardware is a Neo 8000 robot with 9 LTO-5 FC drives in it,
fed by a dedicated server running Bacula. It's fully capable of feeding
all 9 drives simultaneously at 160Mb/s and can back up about 12 Tb a day
(limited by the storage media, not the tape drives)

DATs have given tape a bad rep. They're not even good to use as dental
floss. Seriously, it only takes a slight defect in the base to ruin a
tape because they're so thin and the drives themselves are unreliable.

AIT is much better for home use, if you can get hold of 'em and if
you've got enough cash, think about buying a LTO4 drive (800Gb/tape) as
the cartridges are only $20 each. Just remember that you have to feed
them at warp speed or the tape starts "shoe shining" and throughput
drops to virtually zero (ssd spool disks are a must)

CD-R have a 2% failure rate after 5 years thanks to dye fade
CDRW last a lot longer because they rely on crystalline phase change
that needs heat to set and even more heat to unset.
DVD recordables are incredibly fragile because of their sandwich
contruction. Any flexing will result in the discs being unreadable after
18 months.
BR are highly touchy about surface cleanliness and scratching. They're
"cotton glove" material.

Most backup options are unpalatable for the average home user, even
offsite internet backups are hamstrung by link speeds and security
concerns. That all adds up to a world of pain when the inevitable happens.
Post by Christ Schlacta
NFS is used read-only around here. Good to know that I've prevented
some issues doing that!
:-)
Post by Christ Schlacta
There's a userspace nfs driver. It's a lot slower. You're welcome to
use it.
I know it, it was last worked on in 1994. :)
Post by Christ Schlacta
Post by Uncle Stoatwarbler
Linux NFS serving was incorporated into the kernel nearly 20 years ago
for performance reasons, without the people involved (I was one of
them) realising the implications for locking. At the time we were -
literally - "kids in bedrooms". It needs to be in userspace, which is
where it is in every other OS and it needs to be there for good reasons.
Why can't you add a compatibility layer to pass locks up to and retrieve
them from userspace? It would only affect initial load times, not reads
or writes.
I'm well out of the loop and the current NFS guys are only interested in
NFSv4/pNFS
Post by Christ Schlacta
Who in their right mind relies on a Microsoft product.
The decision wasn't made by people with technical knowledge...
Post by Christ Schlacta
Also: Was that lustre built atop ZFS?
No. I'm not sure what it was built on, but my suspicion is Ext3
Post by Christ Schlacta
I've been waiting for
lustre+zfsonlinux+a-respectable-linux-distro for a while. You can
request a ubuntu lustre+zfs ppa with me, if you want. It would be very
nice to finally get that brought into being.
It's an interesting thought. :)
Post by Christ Schlacta
Vendors suck anyway. They rip out the "It just works" that the
developers build in, and replace it with proprietary crap :)
Not always. There are a number of vendors committed to opensource
environments but the main problem is that the average developer doesn't
have access to enough "kit that makes the lights go dim" to actually put
real-world loads on the test rig.
Post by Christ Schlacta
I've noticed that too. COuldn't break redundant failover routers at ALL
in trials, even tried forcibly shutting them down (Akin to powering off
by pulling the cord), but somehow, they manage to break in production.
See comment above. This is why you end up paying big money for Cicso or
Juniper, etc. Zebra's good, but being able to pick up the phone and get
vendor support when things break is what makes all the difference.
devsk
2012-07-05 16:14:47 UTC
Permalink
Basically, I ended up recreating the pool. As expected, all the copies,
offlines, removes, nested destroys, creates went smoothly. Moved about 6TB
of data around without any loss over the course of 2-3 days, and finally
have a configuration which I will not need to touch for another 2 years.

Memory handling has become much better in rc9. I wouldn't have lasted these
many ZFS operations without swapping most of my desktop out. So far, no
swap usage. ZFS on Linux is looking real solid.

-devsk
Post by Fajar A. Nugraha
Post by devsk
you mean I couldn't have done what I intended to do anyway?
Yep
Post by devsk
Basically, my choice is to back it up, destroy the pool and recreate?
Man,
Post by devsk
that's a lot work!
Yep
You can't remove top-level vdev, or have nested vdev, or reshape an
existing vdev (e.g. to increase the number of disks in a raidz).
A hardware vendor that I met actually used zfs-unable-to-reshape
characteristic as a selling point on how their storage was "the next
generation", and the way zfs did it was "1st generation" :P
--
Fajar
ledj
2012-07-03 05:50:25 UTC
Permalink
Post by devsk
you mean I couldn't have done what I intended to do anyway?
Basically, my choice is to back it up, destroy the pool and recreate? Man,
that's a lot work!
Maybe
https://groups.google.com/a/zfsonlinux.org/d/msg/zfs-discuss/c9aGVzKBY-o/5yZ8rlCQ0UMJ
can help you. It didn't i my case. I need to move data and create a new
pool.
Steve Costaras
2012-07-06 23:18:14 UTC
Permalink
Post by Uncle Stoatwarbler
Post by Christ Schlacta
on an unrelated note, what would you suggest for that backup? 300TB of
data is insanely expensive and fragile on tape, exceedingly slow and
fragile on blu-ray, and not very durable on hard-drive.
LTO and DLT are not fragile and both have storage lives well beyond my
retirement (20+ years, maybe 30 when I get there).
The current hardware is a Neo 8000 robot with 9 LTO-5 FC drives in it,
fed by a dedicated server running Bacula. It's fully capable of feeding
all 9 drives simultaneously at 160Mb/s and can back up about 12 Tb a day
(limited by the storage media, not the tape drives)
Just to add here, I have about the same size (~224TB upgrading to 448TB in a bit) for the array here and am using LTO4 tapes (yeah, going to LTO6 when released). Anyway, tape is /very/ viable for large backup as long as you avoid the low-end tape formats (Dat, and some extent AIT especially for the sizes we're talking about here). Only problem is keeping the tape fed, you will need a decent sized staging area (and NOT on the same pool if you're using ZFS or even LVM) but a dedicated 'scratch' location to spool data too. Keeping a datastream going at 120 (lto4), 160(lt5) or ~200+ (LTO6) is NOT a simple task and even harder when using banks of drives.

Generally, I find having 1.5-2x the number of backup streams (dependent on how fast your client can spool) than I have drives is a good balance and a spool location large enough to hold # drives native capacity * backup streams + 100GB or so) so that you can always be spooling while de-spooling to a drive. This also usually means good quality drives not 'green' or slow speed ones.

Plus tape generally has multiple orders of magnitude better data integrity than hard drives. LTO4/LTO5 for example have 10^17 BER's, T10000's have 10^19. 'cheap' hard drives that are usually used in comparison against tape are usually in the range of 10^14. Even the best HD is only 10^16 (sas enterprise but capacity is much lower than tape).
Continue reading on narkive:
Loading...