Discussion:
[zfs-discuss] Large disk system 240 6T drives
t***@gmail.com
2015-02-18 23:11:56 UTC
Permalink
I am configuring a 1P of storage using Dell hardware (r730 and md3060e 60
drive enclosures) with 240 6T drives.

My thought is to build 2 ~500T pools with 120 drives each. Then I could
break one pool off to another server if i ever wanted to for additional
performance. The server will have 4 hot spare drives, in addition to the
240 drives.

What I need help on is deciding on how many drives per raidz3? I am trying
to decide between 15 and 20 drive per raidz3. Is that absolutely insane
having that many drives together? Most advise I see is around 10 drive
sets, but the overhead for 10 drive sets it too large for this size box.

Thank you for any advise you can give.

-Tom

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoatwarbler
2015-02-18 23:21:34 UTC
Permalink
Post by t***@gmail.com
What I need help on is deciding on how many drives per raidz3? I am
trying to decide between 15 and 20 drive per raidz3. Is that absolutely
insane having that many drives together? Most advise I see is around 10
drive sets, but the overhead for 10 drive sets it too large for this
size box.
The geometry for raidZ3 gives best performance with 5, 7, 11, 19 or 35
drives.

that's 2^n + redundant drives

Smaller sets will have higher performance. I'd shoot for 19 drive sets
personally (that's what I'm about to build out in a 392Tb setup based
around 4Tb drives - 6Tb ones are still not worth it on a $/Gb basis.)

Make absolutely sure you have enough memory or performance will suffer.


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Thomas Wakefield
2015-02-18 23:26:51 UTC
Permalink
I have 256gb of ram in the box, I think that's a good start. There are still free slots so I can add more if needed.

I will run the math on 19 drive sets.


Tom
Post by Uncle Stoatwarbler
Post by t***@gmail.com
What I need help on is deciding on how many drives per raidz3? I am
trying to decide between 15 and 20 drive per raidz3. Is that absolutely
insane having that many drives together? Most advise I see is around 10
drive sets, but the overhead for 10 drive sets it too large for this
size box.
The geometry for raidZ3 gives best performance with 5, 7, 11, 19 or 35
drives.
that's 2^n + redundant drives
Smaller sets will have higher performance. I'd shoot for 19 drive sets
personally (that's what I'm about to build out in a 392Tb setup based
around 4Tb drives - 6Tb ones are still not worth it on a $/Gb basis.)
Make absolutely sure you have enough memory or performance will suffer.
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Luke Olson
2015-02-18 23:58:09 UTC
Permalink
The number of drives (or geometry as you put it) only matters if
compression is not used. I can't think of a good reason to not use
compression at this point, especially LZ4 compression. This blog post
discusses that and other considerations that might be helpful for the
original poster.

http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/

How is the PowerEdge R730 configured? If it's low balled then it might be
good to split the drives into two pools so one could be split off later for
more performance if needed. Though that machine is fully capable of
handling that many drives if configured with the higher end processors,
enough memory, and a couple of controllers (i.e. $20,000 to $30,000 on one
machine).

Luke
Post by Uncle Stoatwarbler
Post by t***@gmail.com
What I need help on is deciding on how many drives per raidz3? I am
trying to decide between 15 and 20 drive per raidz3. Is that absolutely
insane having that many drives together? Most advise I see is around 10
drive sets, but the overhead for 10 drive sets it too large for this
size box.
The geometry for raidZ3 gives best performance with 5, 7, 11, 19 or 35
drives.
that's 2^n + redundant drives
Smaller sets will have higher performance. I'd shoot for 19 drive sets
personally (that's what I'm about to build out in a 392Tb setup based
around 4Tb drives - 6Tb ones are still not worth it on a $/Gb basis.)
Make absolutely sure you have enough memory or performance will suffer.
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-19 10:06:09 UTC
Permalink
Usefulness of compression depends largely on what you are storing. In the
majority of cases, data that is large is already compressed (e.g. various
media, game texture packs, some databases), so using FS compression will be
unproductive.

Also, in many cases the stripe optimization certainly is still valid even
with compression enabled. As an example, consider a MySQL server with
InnoDB tables. The page size on those is 16KB. We cannot have ashift=14,
the maximum is ashift=13, so the most we can abuse ashift to act as a RAID
block size is to use 8KB sectors, no more. As it is a database server, we
use a stripe of mirrors (RAID10 equivalent), but we don't want each page
write to straddle two disks, as that will waste half of our IOPS capacity.
If the data is compressible, we can enable FS level compression, which will
hopefully compress each page down to 8KB or less. That means that (most of)
our 16KB pages now fit into a single disk sector ("RAID chunk"), and we are
now back to using 1 IOP instead of 2 for each page read.

If performance is important you really do have to think about things like
this. (It is one of the reasons why "cloud" sucks so badly on performance
compared to similarly specced bare metal.)
Post by Luke Olson
The number of drives (or geometry as you put it) only matters if
compression is not used. I can't think of a good reason to not use
compression at this point, especially LZ4 compression. This blog post
discusses that and other considerations that might be helpful for the
original poster.
http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/
How is the PowerEdge R730 configured? If it's low balled then it might be
good to split the drives into two pools so one could be split off later for
more performance if needed. Though that machine is fully capable of
handling that many drives if configured with the higher end processors,
enough memory, and a couple of controllers (i.e. $20,000 to $30,000 on one
machine).
Luke
Post by Uncle Stoatwarbler
Post by t***@gmail.com
What I need help on is deciding on how many drives per raidz3? I am
trying to decide between 15 and 20 drive per raidz3. Is that absolutely
insane having that many drives together? Most advise I see is around 10
drive sets, but the overhead for 10 drive sets it too large for this
size box.
The geometry for raidZ3 gives best performance with 5, 7, 11, 19 or 35
drives.
that's 2^n + redundant drives
Smaller sets will have higher performance. I'd shoot for 19 drive sets
personally (that's what I'm about to build out in a 392Tb setup based
around 4Tb drives - 6Tb ones are still not worth it on a $/Gb basis.)
Make absolutely sure you have enough memory or performance will suffer.
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Fajar A. Nugraha
2015-02-19 10:30:11 UTC
Permalink
Post by Gordan Bobic
Also, in many cases the stripe optimization certainly is still valid even
with compression enabled. As an example, consider a MySQL server with InnoDB
tables. The page size on those is 16KB. We cannot have ashift=14, the
maximum is ashift=13, so the most we can abuse ashift to act as a RAID block
size is to use 8KB sectors, no more. As it is a database server, we use a
stripe of mirrors (RAID10 equivalent), but we don't want each page write to
straddle two disks, as that will waste half of our IOPS capacity. If the
data is compressible, we can enable FS level compression, which will
hopefully compress each page down to 8KB or less. That means that (most of)
our 16KB pages now fit into a single disk sector ("RAID chunk"), and we are
now back to using 1 IOP instead of 2 for each page read.
... or, if you don't need foreign key, just switch to tokudb engine.
Your applications should continue to work as usual, and you can simply
disable compression on the dataset use for db data. Tokudb should
handle the rest in a write-efficient way.

Sure, there are cases where stripe optimization is valid. However
nowadays I tend to agree with the delphix blog: use compression (in
tokudb's case, it's in the app instead of the fs), and don't spend too
much time super-optimizing your RAID-Z stripe width.
--
Fajar

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-19 10:44:08 UTC
Permalink
Post by Gordan Bobic
Post by Gordan Bobic
Also, in many cases the stripe optimization certainly is still valid even
with compression enabled. As an example, consider a MySQL server with
InnoDB
Post by Gordan Bobic
tables. The page size on those is 16KB. We cannot have ashift=14, the
maximum is ashift=13, so the most we can abuse ashift to act as a RAID
block
Post by Gordan Bobic
size is to use 8KB sectors, no more. As it is a database server, we use a
stripe of mirrors (RAID10 equivalent), but we don't want each page write
to
Post by Gordan Bobic
straddle two disks, as that will waste half of our IOPS capacity. If the
data is compressible, we can enable FS level compression, which will
hopefully compress each page down to 8KB or less. That means that (most
of)
Post by Gordan Bobic
our 16KB pages now fit into a single disk sector ("RAID chunk"), and we
are
Post by Gordan Bobic
now back to using 1 IOP instead of 2 for each page read.
... or, if you don't need foreign key, just switch to tokudb engine.
Your applications should continue to work as usual, and you can simply
disable compression on the dataset use for db data. Tokudb should
handle the rest in a write-efficient way.
Sure, there are cases where stripe optimization is valid. However
nowadays I tend to agree with the delphix blog: use compression (in
tokudb's case, it's in the app instead of the fs), and don't spend too
much time super-optimizing your RAID-Z stripe width.
TokuDB is good at some things, but not so good at others. It uses a large
page size which allows it to achieve very good compression ratios. It's
good for streaming inserts, but things don't fare that well for some other
workload profiles. Or to put it differently - test with your own workload,
rather than assuming that some synthetic benchmark you have seen
is a good approximation of anything.

As for foreign key constraints, you can implement those using triggers
anyway - that's what PostgreSQL does. As a MySQL DBA, I often fall
back on the following rule of thumb:
"Ask yourself - what would PostgreSQL do."

I agree that in many cases stripe optimization it doesn't matter enough
to warrant much thought being put into things, but in some cases it can
make a huge difference in performance (e.g. double in the case I
described earlier).

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoat
2015-02-19 12:36:41 UTC
Permalink
Post by Gordan Bobic
As for foreign key constraints, you can implement those using triggers
anyway - that's what PostgreSQL does. As a MySQL DBA, I often fall
"Ask yourself - what would PostgreSQL do."
Closely followed by "Why not just install PostgreSQL?"

MySQL is very good for what it's optimised for, but PostgreSQL really
does work better in large or complex environments.

(A real world example, switching from MySQL to PgSQL on one database
here resulted in memory consumption on the server dropping by at least
70% and query times reducing by around 90%)

At some point it's easier to move to PgSQL than continue beating
yourself up trying to get further optimizations out of MySQL.

"Appropriate tool for the job" and all that stuff...


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-19 12:49:30 UTC
Permalink
Post by Gordan Bobic
As for foreign key constraints, you can implement those using triggers
Post by Gordan Bobic
anyway - that's what PostgreSQL does. As a MySQL DBA, I often fall
"Ask yourself - what would PostgreSQL do."
Closely followed by "Why not just install PostgreSQL?"
It is the obvious follow-up question, isn't it. :)
Post by Gordan Bobic
MySQL is very good for what it's optimised for, but PostgreSQL really does
work better in large or complex environments.
There are things MySQL is better at, and one obvious example is the power
and simplicity of it's built in replication (including multi-source) and
clustering (Galera, NDB has a distinctly narrow use case). Different
engines are also extremely handy for some uses (e.g. blackhole engine for
converting large volume ingress directly to binlogs). And believe it or
not, having non-transactional engines is also useful for some related use
cases.
Post by Gordan Bobic
(A real world example, switching from MySQL to PgSQL on one database here
resulted in memory consumption on the server dropping by at least 70% and
query times reducing by around 90%)
Much as I like PostgreSQL for it's tendency toward "doing the right thing",
there is distinctly plausible chance that the MySQL solution you mention
could have been adjusted to deliver similar performance levels.
Post by Gordan Bobic
At some point it's easier to move to PgSQL than continue beating yourself
up trying to get further optimizations out of MySQL.
"Appropriate tool for the job" and all that stuff...
True, but if you got a 10x improvement from switching to PostgreSQL, there
was something pretty fundamentally wrong with your MySQL solution (other
than that you were using MySQL, that is).

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoat
2015-02-19 13:00:39 UTC
Permalink
Post by Gordan Bobic
Much as I like PostgreSQL for it's tendency toward "doing the right
thing", there is distinctly plausible chance that the MySQL solution you
mention could have been adjusted to deliver similar performance levels.
Bacula database. Hundreds of millions of entries....


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Durval Menezes
2015-02-18 23:23:09 UTC
Permalink
Howdy Tom,

You don't mention your I/O workload. You seem to be considering only
redundancy x reliability, but these are far from the only considerations.
Depending on what your I/O needs are, performance for your specific I/O
workload is something you really should be thinking about...

Cheers,
--
Durval.
Post by t***@gmail.com
I am configuring a 1P of storage using Dell hardware (r730 and md3060e 60
drive enclosures) with 240 6T drives.
My thought is to build 2 ~500T pools with 120 drives each. Then I could
break one pool off to another server if i ever wanted to for additional
performance. The server will have 4 hot spare drives, in addition to the
240 drives.
What I need help on is deciding on how many drives per raidz3? I am
trying to decide between 15 and 20 drive per raidz3. Is that absolutely
insane having that many drives together? Most advise I see is around 10
drive sets, but the overhead for 10 drive sets it too large for this size
box.
Thank you for any advise you can give.
-Tom
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Thomas Wakefield
2015-02-18 23:31:39 UTC
Permalink
It's a mixed workload, so IO isn't my main factor. I am sure that 1-2GB/s will cover what I need, and I will easily get that out of this box.

I already have another similar box, but it has a controller based raid setup, not zfs doing the raid. That box is plenty fast, and will get rebuilt without raid controllers when this new box is online.

That's why the focus of this thread is on best practices for raidz3 set sizes.

Thomas
Post by Durval Menezes
Howdy Tom,
You don't mention your I/O workload. You seem to be considering only redundancy x reliability, but these are far from the only considerations. Depending on what your I/O needs are, performance for your specific I/O workload is something you really should be thinking about...
Cheers,
--
Durval.
I am configuring a 1P of storage using Dell hardware (r730 and md3060e 60 drive enclosures) with 240 6T drives.
My thought is to build 2 ~500T pools with 120 drives each. Then I could break one pool off to another server if i ever wanted to for additional performance. The server will have 4 hot spare drives, in addition to the 240 drives.
What I need help on is deciding on how many drives per raidz3? I am trying to decide between 15 and 20 drive per raidz3. Is that absolutely insane having that many drives together? Most advise I see is around 10 drive sets, but the overhead for 10 drive sets it too large for this size box.
Thank you for any advise you can give.
-Tom
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Durval Menezes
2015-02-18 23:38:10 UTC
Permalink
Hi Tom,
Post by Thomas Wakefield
It's a mixed workload, so IO isn't my main factor. I am sure that 1-2GB/s
will cover what I need, and I will easily get that out of this box.
1-2Gbps should be reachable *easily* on your setup... that is, for
sequential read performance. How many read IOPS do you need?

Also, please keep in mind that each raidz3 will give you the write
performance (both MB/s and IOPS) of just a single drive... so if some of
your workload consists of heavy writing, and you can separate that from the
rest, I would consider setting up a couple of disks for a mirror zpool.

I already have another similar box, but it has a controller based raid
Post by Thomas Wakefield
setup, not zfs doing the raid. That box is plenty fast, and will get
rebuilt without raid controllers when this new box is online.
ZFS is much slower for writes than most RAID{5,6} controllers due to the
emphasis on reliability (no write hole, for example, which is something
most controller RAIDs can't claim).
Post by Thomas Wakefield
That's why the focus of this thread is on best practices for raidz3 set sizes.
Think about your writing needs. It sucks to have to redo everything once
its ready due to bad write performance (ask me how I know...)

Cheers.
--
Durval.
Post by Thomas Wakefield
Thomas
Howdy Tom,
You don't mention your I/O workload. You seem to be considering only
redundancy x reliability, but these are far from the only considerations.
Depending on what your I/O needs are, performance for your specific I/O
workload is something you really should be thinking about...
Cheers,
--
Durval.
Post by t***@gmail.com
I am configuring a 1P of storage using Dell hardware (r730 and md3060e 60
drive enclosures) with 240 6T drives.
My thought is to build 2 ~500T pools with 120 drives each. Then I could
break one pool off to another server if i ever wanted to for additional
performance. The server will have 4 hot spare drives, in addition to the
240 drives.
What I need help on is deciding on how many drives per raidz3? I am
trying to decide between 15 and 20 drive per raidz3. Is that absolutely
insane having that many drives together? Most advise I see is around 10
drive sets, but the overhead for 10 drive sets it too large for this size
box.
Thank you for any advise you can give.
-Tom
To unsubscribe from this group and stop receiving emails from it, send
To unsubscribe from this group and stop receiving emails from it, send
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Kash Pande
2015-02-18 23:39:51 UTC
Permalink
Post by Thomas Wakefield
That's why the focus of this thread is on best practices for raidz3 set sizes.
Think about your writing needs. It sucks to have to redo everything
once its ready due to bad write performance (ask me how I know...)
And then when the server is so slow you can't copy the data off in a
timely fashion to a properly configured array you're REALLY screwed! :-)

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Durval Menezes
2015-02-18 23:46:42 UTC
Permalink
Hello Kash,
Post by Thomas Wakefield
That's why the focus of this thread is on best practices for raidz3
Post by Thomas Wakefield
set sizes.
Think about your writing needs. It sucks to have to redo everything once
its ready due to bad write performance (ask me how I know...)
And then when the server is so slow you can't copy the data off in a
timely fashion to a properly configured array you're REALLY screwed! :-)
Well, I don't need to ask you how you know :-) Seriously, one of the best
parts of posting to this list is becoming aware of folks in even more
fscked-up situations than I had to go through... makes for a major relief,
and even some belief that after all the Universe isn't after me
*personally*... ;-)

That said, when it happened to me I had to do the copying during a
extended-weekend-plus-holiday 3-day maintenance window... if I wasn't given
that window, things probably would have *really* sucked, as performance was
disastrous... :-(

Cheers,
--
Durval.
Post by Thomas Wakefield
To unsubscribe from this group and stop receiving emails from it, send
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Hajo Möller
2015-02-18 23:46:49 UTC
Permalink
Read: don't use dedup on that pool.
Sometimes I'm getting less than 5 MB/s from one of the archive pools, my
client insisted on enabling dedup. Moving 50 TB off that pool won't be fun.
--
Regards,
Hajo Möller

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Kash Pande
2015-02-18 23:50:58 UTC
Permalink
Post by Hajo Möller
Read: don't use dedup on that pool.
Sometimes I'm getting less than 5 MB/s from one of the archive pools,
my client insisted on enabling dedup. Moving 50 TB off that pool won't
be fun.
My boss, and close friend, will always bring up how I killed 900GB of
random data because I used dedup and it simply became... 'inaccessible'.
There's myths of Oracle loaning (presumably for $$$$$) some servers with
massive ram capacity just to import pools in those situations. Luckily
we had drives we could composite the missing data from.. but it still
happened. It was an important lesson.



Did I ever tell you about the old Sun E250 that had 15 disk RAID5? No
similar controllers around, no replacement disks available, oh, and it
was the only copy of all the university email... Policy meant there was
no expiry policy on data so it must be stored indefinitely. The server
was too slow to allow offloading the data in < the SLA window. I think
my colleagues were just waiting for it to die - I had several solutions
they didn't want to implement. Sometimes when I see people building
200+TB pools I'm wondering if they have the same sadistic urges that
coworker did.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Blake Dunlap
2015-02-18 23:54:50 UTC
Permalink
Those who do not understand what IOPS are, are doomed to learn...


Film at 11...
Post by Kash Pande
Post by Hajo Möller
Read: don't use dedup on that pool.
Sometimes I'm getting less than 5 MB/s from one of the archive pools,
my client insisted on enabling dedup. Moving 50 TB off that pool won't
be fun.
My boss, and close friend, will always bring up how I killed 900GB of
random data because I used dedup and it simply became... 'inaccessible'.
There's myths of Oracle loaning (presumably for $$$$$) some servers with
massive ram capacity just to import pools in those situations. Luckily
we had drives we could composite the missing data from.. but it still
happened. It was an important lesson.
Did I ever tell you about the old Sun E250 that had 15 disk RAID5? No
similar controllers around, no replacement disks available, oh, and it
was the only copy of all the university email... Policy meant there was
no expiry policy on data so it must be stored indefinitely. The server
was too slow to allow offloading the data in < the SLA window. I think
my colleagues were just waiting for it to die - I had several solutions
they didn't want to implement. Sometimes when I see people building
200+TB pools I'm wondering if they have the same sadistic urges that
coworker did.
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Durval Menezes
2015-02-19 14:03:29 UTC
Permalink
Hi Blake,
Post by Blake Dunlap
Those who do not understand what IOPS are, are doomed to learn...
Film at 11...
LOL, nicely put... :-) I would only add that oftentimes this kind of film
*starts* at 11PM but goes on for the whole night ;-)

Cheers,
--
Durval.
Post by Blake Dunlap
Post by Kash Pande
Post by Hajo Möller
Read: don't use dedup on that pool.
Sometimes I'm getting less than 5 MB/s from one of the archive pools,
my client insisted on enabling dedup. Moving 50 TB off that pool won't
be fun.
My boss, and close friend, will always bring up how I killed 900GB of
random data because I used dedup and it simply became... 'inaccessible'.
There's myths of Oracle loaning (presumably for $$$$$) some servers with
massive ram capacity just to import pools in those situations. Luckily
we had drives we could composite the missing data from.. but it still
happened. It was an important lesson.
Did I ever tell you about the old Sun E250 that had 15 disk RAID5? No
similar controllers around, no replacement disks available, oh, and it
was the only copy of all the university email... Policy meant there was
no expiry policy on data so it must be stored indefinitely. The server
was too slow to allow offloading the data in < the SLA window. I think
my colleagues were just waiting for it to die - I had several solutions
they didn't want to implement. Sometimes when I see people building
200+TB pools I'm wondering if they have the same sadistic urges that
coworker did.
To unsubscribe from this group and stop receiving emails from it, send
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoatwarbler
2015-02-19 00:14:08 UTC
Permalink
Post by Hajo Möller
Read: don't use dedup on that pool.
Sometimes I'm getting less than 5 MB/s from one of the archive pools, my
client insisted on enabling dedup. Moving 50 TB off that pool won't be fun.
It's pretty clear that people think that dedup is some magical thing and
forget about the costs that come with it.

I've specifically omitted dedup for the systems I'm speccing up at
$orkplace, because it would add significantly to the costs. I still have
to justify it to the not-overly-technical staff who see dedup as part of
the abilities and want it on.

Telling them that performance will suffer badly unless they spend an
extra $50k on the setup simply doesn't sink in - and for the kind of
material we're storing (planetary and deep space imaging files) there's
no advantage to be had anyway. People see the "doubles your storage"
hype and don't think about compressibility of their data.






To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Cédric Lemarchand
2015-02-19 00:36:46 UTC
Permalink
Post by Uncle Stoatwarbler
Post by Hajo Möller
Read: don't use dedup on that pool.
Sometimes I'm getting less than 5 MB/s from one of the archive pools, my
client insisted on enabling dedup. Moving 50 TB off that pool won't be fun.
It's pretty clear that people think that dedup is some magical thing and
forget about the costs that come with it.
I've specifically omitted dedup for the systems I'm speccing up at
$orkplace, because it would add significantly to the costs. I still have
to justify it to the not-overly-technical staff who see dedup as part of
the abilities and want it on.
Telling them that performance will suffer badly unless they spend an
extra $50k on the setup simply doesn't sink in - and for the kind of
material we're storing (planetary and deep space imaging files) there's
no advantage to be had anyway. People see the "doubles your storage"
hype and don't think about compressibility of their data.
+1
The only way I can think dedup could be usefull is for long term full
backups archival, let's says one full take 10To and only a low
percentage (<15%) of data changes between them, it could be benefic ;-)

Cheers
--
Cédric Lemarchand
IT Infrastructure Manager
iXBlue
52, avenue de l'Europe
78160 Marly le Roi
France
Tel. +33 1 30 08 88 88
Mob. +33 6 37 23 40 93
Fax +33 1 30 08 88 00
www.ixblue.com <http://www.ixblue.com>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-19 10:09:43 UTC
Permalink
Yup. And this is why we really need something like issue #3020 as a "poor
man's dedup", without the huge overhead of the sledgehammer approach of
block level deduplication.
Post by Uncle Stoatwarbler
Post by Hajo Möller
Read: don't use dedup on that pool.
Sometimes I'm getting less than 5 MB/s from one of the archive pools, my
client insisted on enabling dedup. Moving 50 TB off that pool won't be
fun.
It's pretty clear that people think that dedup is some magical thing and
forget about the costs that come with it.
I've specifically omitted dedup for the systems I'm speccing up at
$orkplace, because it would add significantly to the costs. I still have
to justify it to the not-overly-technical staff who see dedup as part of
the abilities and want it on.
Telling them that performance will suffer badly unless they spend an
extra $50k on the setup simply doesn't sink in - and for the kind of
material we're storing (planetary and deep space imaging files) there's
no advantage to be had anyway. People see the "doubles your storage"
hype and don't think about compressibility of their data.
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Durval Menezes
2015-02-19 13:53:36 UTC
Permalink
Hello Gordan,
Post by Gordan Bobic
Yup. And this is why we really need something like issue #3020 as a "poor
man's dedup", without the huge overhead of the sledgehammer approach of
block level deduplication.
I wonder whether it would be so difficult to fix ZFS deduplication to begin
with. I think it's a great idea, it's just the current implementation that
seriously sucks.

Cheers,
--
Durval.
Post by Gordan Bobic
Post by Uncle Stoatwarbler
Post by Hajo Möller
Read: don't use dedup on that pool.
Sometimes I'm getting less than 5 MB/s from one of the archive pools, my
client insisted on enabling dedup. Moving 50 TB off that pool won't be
fun.
It's pretty clear that people think that dedup is some magical thing and
forget about the costs that come with it.
I've specifically omitted dedup for the systems I'm speccing up at
$orkplace, because it would add significantly to the costs. I still have
to justify it to the not-overly-technical staff who see dedup as part of
the abilities and want it on.
Telling them that performance will suffer badly unless they spend an
extra $50k on the setup simply doesn't sink in - and for the kind of
material we're storing (planetary and deep space imaging files) there's
no advantage to be had anyway. People see the "doubles your storage"
hype and don't think about compressibility of their data.
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-19 14:15:49 UTC
Permalink
Post by Durval Menezes
Hello Gordan,
Post by Gordan Bobic
Yup. And this is why we really need something like issue #3020 as a "poor
man's dedup", without the huge overhead of the sledgehammer approach of
block level deduplication.
I wonder whether it would be so difficult to fix ZFS deduplication to
begin with. I think it's a great idea, it's just the current implementation
that seriously sucks.
There's nothing to "fix" because it isn't "broken". There is fundamentally
no way around the fact that you need to keep the block index quickly
accessible so that you can do fast in-line duplicate checking as soon as a
data block arrives. There is similarly no way around the fact that
deduplicated data is in many cases going to end up being much more
fragmented.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Durval Menezes
2015-02-19 13:51:38 UTC
Permalink
Hello Hajo,
Post by Hajo Möller
Read: don't use dedup on that pool.
Don't use dedup *ever*...
Post by Hajo Möller
Sometimes I'm getting less than 5 MB/s from one of the archive pools, my
client insisted on enabling dedup.
Your fault for letting them know about it... :-)
Post by Hajo Möller
Moving 50 TB off that pool won't be fun.
I'd bet. It isn't fun even for a *normal*, fast, 500MB/s pool...

Cheers,
--
Durval.
Post by Hajo Möller
--
Regards,
Hajo Möller
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-19 14:13:22 UTC
Permalink
Post by Durval Menezes
Hello Hajo,
Post by Hajo Möller
Read: don't use dedup on that pool.
Don't use dedup *ever*...
I find dedup works fine on some use cases, but I have only ever even
considered using it on solid state storage. And being solid state, also
small enough to be able to back up and restore it relatively quickly should
it all go badly pear shaped.

Moving 50 TB off that pool won't be fun.
I'd bet. It isn't fun even for a *normal*, fast, 500MB/s pool...


And this is why I maintain that enormous pools are not a great idea.
Multiple smaller pools often provide a more manageable solution if you are
planning ahead or potential disaster recovery requirements.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
t***@gmail.com
2015-02-19 00:04:35 UTC
Permalink
The bulk of the workload will be reading data sets with total volumes in
the 10's of TBs, and file sizes ranging from 1MB to 10+ GB. We have other
disks that are optimized for frequent read / writes. This will store large
data sets that are written once, and then read from for years. The large
data sets are brought in over the internet, so the disks are never the
bottleneck for writes.

It sounds like I will do test builds of 19 drives and see if it's faster
than the 15 and 20 drive builds I have done.

Is there a good way to see iops on zfs?

This is my iostat during a scrub. (shared1 15 drive raidz3, shared2 20
drive raidz3)
[***@disk5 ~]# zpool iostat 2
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
shared1 41.7T 610T 1010 538 124M 65.4M
shared2 35.8T 618T 2.56K 1.29K 322M 159M
---------- ----- ----- ----- ----- ----- -----
shared1 41.7T 610T 10.6K 0 1.31G 0
shared2 35.8T 618T 8.89K 0 1.09G 0
---------- ----- ----- ----- ----- ----- -----
shared1 41.7T 610T 12.3K 0 1.52G 0
shared2 35.8T 618T 9.34K 0 1.15G 0
---------- ----- ----- ----- ----- ----- -----
shared1 41.7T 610T 9.04K 185 1.11G 367K
shared2 35.8T 618T 7.29K 324 918M 644K
Post by Durval Menezes
Hi Tom,
Post by Thomas Wakefield
It's a mixed workload, so IO isn't my main factor. I am sure that 1-2GB/s
will cover what I need, and I will easily get that out of this box.
1-2Gbps should be reachable *easily* on your setup... that is, for
sequential read performance. How many read IOPS do you need?
Also, please keep in mind that each raidz3 will give you the write
performance (both MB/s and IOPS) of just a single drive... so if some of
your workload consists of heavy writing, and you can separate that from the
rest, I would consider setting up a couple of disks for a mirror zpool.
I already have another similar box, but it has a controller based raid
Post by Thomas Wakefield
setup, not zfs doing the raid. That box is plenty fast, and will get
rebuilt without raid controllers when this new box is online.
ZFS is much slower for writes than most RAID{5,6} controllers due to the
emphasis on reliability (no write hole, for example, which is something
most controller RAIDs can't claim).
Post by Thomas Wakefield
That's why the focus of this thread is on best practices for raidz3 set sizes.
Think about your writing needs. It sucks to have to redo everything once
its ready due to bad write performance (ask me how I know...)
Cheers.
--
Durval.
Post by Thomas Wakefield
Thomas
Howdy Tom,
You don't mention your I/O workload. You seem to be considering only
redundancy x reliability, but these are far from the only considerations.
Depending on what your I/O needs are, performance for your specific I/O
workload is something you really should be thinking about...
Cheers,
--
Durval.
Post by t***@gmail.com
I am configuring a 1P of storage using Dell hardware (r730 and md3060e
60 drive enclosures) with 240 6T drives.
My thought is to build 2 ~500T pools with 120 drives each. Then I could
break one pool off to another server if i ever wanted to for additional
performance. The server will have 4 hot spare drives, in addition to the
240 drives.
What I need help on is deciding on how many drives per raidz3? I am
trying to decide between 15 and 20 drive per raidz3. Is that absolutely
insane having that many drives together? Most advise I see is around 10
drive sets, but the overhead for 10 drive sets it too large for this size
box.
Thank you for any advise you can give.
-Tom
To unsubscribe from this group and stop receiving emails from it, send
To unsubscribe from this group and stop receiving emails from it, send
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-19 09:52:37 UTC
Permalink
Post by Durval Menezes
Also, please keep in mind that each raidz3 will give you the write
performance (both MB/s and IOPS) of just a single drive...
Not strictly true. On linear reads with non-fragmented files you will get
up to the MB/s of the data bearing drives. Partial writes can be higher on
IOPS than a single drive, e.g. if you have a 19-disk RAIDZ3, and you are
writing a single ashift size block, you will consume an IOP on 4 drives
(and suffer inefficient space usage), leaving the other 15 free for other
workloads. Similar applies to reads of small blocks. Even on large reads,
the parity part of your set will not be touched unless the stripe checksum
comparison fails, so on a 19-disk RAIDZ3 you will only use 16 disks to read
the stripe, unless recovery is required, leaving the other 3 disks free to
work on any other requests they can fulfil.

IMO, however, 16+3 is too big with 6TB drives. Even if your backup system
is good, the restore time is so prohibitively expensive that the system
really needs to be engineered to never need it under plausible
circumstances. You should still have it - you should just also make sure
you never need it. This is also one good reason why many small pools is
hugely advantageous over one large pool.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Thomas Wakefield
2015-02-19 12:18:01 UTC
Permalink
If you say 16+3 is to big, what size would make you comfortable? would 8+2 sets be more comfortable?

I am going to test rebuild times on a full 16+3 (after i write enough data to fill it).

And I do run compression, i am averaging about 1.30x on my other large system. I tested dedupe a couple years ago on zfs, and got almost no benefit in space saving, but all the negatives. So no dedupe for this build.

Thanks for all the advise.
IMO, however, 16+3 is too big with 6TB drives. Even if your backup system is good, the restore time is so prohibitively expensive that the system really needs to be engineered to never need it under plausible circumstances. You should still have it - you should just also make sure you never need it. This is also one good reason why many small pools is hugely advantageous over one large pool.
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-19 12:34:54 UTC
Permalink
With 6TB drives I might go as far as 8+3, if you can afford the storage
overhead.

Note that using large vdevs (many drives) has a direct impact on
performance because it implicitly reduces the I/O size. Since currently the
ZFS' maximum block size is 128KB, if you have 16 data bearing disks and 4KB
sectors, a large write will be broken up into chunks no bigger than 8KB per
disk. This is rather wasteful with spinning rust which favours large linear
operations over small random operations.

This will help a lot:
https://github.com/zfsonlinux/zfs/issues/354
But it hasn't been released yet.
Post by Thomas Wakefield
If you say 16+3 is to big, what size would make you comfortable? would 8+2
sets be more comfortable?
I am going to test rebuild times on a full 16+3 (after i write enough data to fill it).
And I do run compression, i am averaging about 1.30x on my other large
system. I tested dedupe a couple years ago on zfs, and got almost no
benefit in space saving, but all the negatives. So no dedupe for this
build.
Thanks for all the advise.
IMO, however, 16+3 is too big with 6TB drives. Even if your backup system
is good, the restore time is so prohibitively expensive that the system
really needs to be engineered to never need it under plausible
circumstances. You should still have it - you should just also make sure
you never need it. This is also one good reason why many small pools is
hugely advantageous over one large pool.
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Durval Menezes
2015-02-19 13:55:10 UTC
Permalink
Hello,
Post by Gordan Bobic
With 6TB drives I might go as far as 8+3, if you can afford the storage
overhead.
+1. With 6TB and raidz2, I wouldn't go beyond 4+2.

Cheers,
--
Durval.
Post by Gordan Bobic
Note that using large vdevs (many drives) has a direct impact on
performance because it implicitly reduces the I/O size. Since currently the
ZFS' maximum block size is 128KB, if you have 16 data bearing disks and 4KB
sectors, a large write will be broken up into chunks no bigger than 8KB per
disk. This is rather wasteful with spinning rust which favours large linear
operations over small random operations.
https://github.com/zfsonlinux/zfs/issues/354
But it hasn't been released yet.
Post by Thomas Wakefield
If you say 16+3 is to big, what size would make you comfortable? would
8+2 sets be more comfortable?
I am going to test rebuild times on a full 16+3 (after i write enough data to fill it).
And I do run compression, i am averaging about 1.30x on my other large
system. I tested dedupe a couple years ago on zfs, and got almost no
benefit in space saving, but all the negatives. So no dedupe for this
build.
Thanks for all the advise.
IMO, however, 16+3 is too big with 6TB drives. Even if your backup system
is good, the restore time is so prohibitively expensive that the system
really needs to be engineered to never need it under plausible
circumstances. You should still have it - you should just also make sure
you never need it. This is also one good reason why many small pools is
hugely advantageous over one large pool.
To unsubscribe from this group and stop receiving emails from it, send
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Thomas Wakefield
2015-02-19 14:17:47 UTC
Permalink
I think 4+2 is overkill, but maybe I am thinking about this wrong. I have been running large disk systems for years and never had 2 drives fail in the same raid set. I am thinking that a 24-36 hour rebuild time is acceptable. What would you guys consider an acceptable rebuild time?

I am not in a rush to get this box setup, so I can do some test builds and rebuilds to come up with the best balance.

Thanks again for the help. It’s been really useful so far.
Post by Durval Menezes
Hello,
With 6TB drives I might go as far as 8+3, if you can afford the storage overhead.
+1. With 6TB and raidz2, I wouldn't go beyond 4+2.
Cheers,
--
Durval.
Note that using large vdevs (many drives) has a direct impact on performance because it implicitly reduces the I/O size. Since currently the ZFS' maximum block size is 128KB, if you have 16 data bearing disks and 4KB sectors, a large write will be broken up into chunks no bigger than 8KB per disk. This is rather wasteful with spinning rust which favours large linear operations over small random operations.
https://github.com/zfsonlinux/zfs/issues/354 <https://github.com/zfsonlinux/zfs/issues/354>
But it hasn't been released yet.
If you say 16+3 is to big, what size would make you comfortable? would 8+2 sets be more comfortable?
I am going to test rebuild times on a full 16+3 (after i write enough data to fill it).
And I do run compression, i am averaging about 1.30x on my other large system. I tested dedupe a couple years ago on zfs, and got almost no benefit in space saving, but all the negatives. So no dedupe for this build.
Thanks for all the advise.
IMO, however, 16+3 is too big with 6TB drives. Even if your backup system is good, the restore time is so prohibitively expensive that the system really needs to be engineered to never need it under plausible circumstances. You should still have it - you should just also make sure you never need it. This is also one good reason why many small pools is hugely advantageous over one large pool.
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Ryan How
2015-02-19 14:21:37 UTC
Permalink
I can't poop while it is rebuilding and has no redundancy. 36 hours then
becomes a bit problematic. :P
Post by Thomas Wakefield
I think 4+2 is overkill, but maybe I am thinking about this wrong. I
have been running large disk systems for years and never had 2 drives
fail in the same raid set. I am thinking that a 24-36 hour rebuild
time is acceptable. What would you guys consider an acceptable
rebuild time?
I am not in a rush to get this box setup, so I can do some test builds
and rebuilds to come up with the best balance.
Thanks again for the help. It’s been really useful so far.
Post by Durval Menezes
Hello,
On Thu, Feb 19, 2015 at 10:34 AM, Gordan Bobic
With 6TB drives I might go as far as 8+3, if you can afford the storage overhead.
+1. With 6TB and raidz2, I wouldn't go beyond 4+2.
Cheers,
--
Durval.
Note that using large vdevs (many drives) has a direct impact on
performance because it implicitly reduces the I/O size. Since
currently the ZFS' maximum block size is 128KB, if you have 16
data bearing disks and 4KB sectors, a large write will be broken
up into chunks no bigger than 8KB per disk. This is rather
wasteful with spinning rust which favours large linear operations
over small random operations.
https://github.com/zfsonlinux/zfs/issues/354
But it hasn't been released yet.
On Thu, Feb 19, 2015 at 12:18 PM, Thomas Wakefield
If you say 16+3 is to big, what size would make you
comfortable? would 8+2 sets be more comfortable?
I am going to test rebuild times on a full 16+3 (after i
write enough data to fill it).
And I do run compression, i am averaging about 1.30x on my
other large system. I tested dedupe a couple years ago on
zfs, and got almost no benefit in space saving, but all the
negatives. So no dedupe for this build.
Thanks for all the advise.
On Feb 19, 2015, at 4:52 AM, Gordan Bobic
IMO, however, 16+3 is too big with 6TB drives. Even if your
backup system is good, the restore time is so prohibitively
expensive that the system really needs to be engineered to
never need it under plausible circumstances. You should
still have it - you should just also make sure you never
need it. This is also one good reason why many small pools
is hugely advantageous over one large pool.
To unsubscribe from this group and stop receiving emails from
To unsubscribe from this group and stop receiving emails from it,
To unsubscribe from this group and stop receiving emails from it, send
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-19 14:24:04 UTC
Permalink
Post by Thomas Wakefield
I have been running large disk systems for years and never had 2 drives
fail in the same raid set.
You got lucky. I have experienced numerous cases where this happened.

IMO, n+1 redundancy has crossed the line toward being sufficiently bad that
I treat is as any setup that has no redundancy at all - treat the data as
already deleted, and make sure it is small enough that you can restore
backups quickly and efficiently, and that the data loss scale since the
most recent backup is acceptable.

So for me it's minimum of n+2.

It's not whether you're paranoid, it's whether you're paranoid enough. And
any disaster case where you are going to have to deal with the fallout is
almost certainly going to be more expensive in your time alone than the
cost of adding an extra disk per vdev, even before considering other (e.g.
commercial) costs that may or may not be associated with downtime.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoat
2015-02-19 12:38:26 UTC
Permalink
Post by Thomas Wakefield
If you say 16+3 is to big, what size would make you comfortable? would
8+2 sets be more comfortable?
Not for me. I've seen raid6 sets of this geometry fail with 2TB drives
and wouldn't want to take the risk on larger drives.


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
a***@whisperpc.com
2015-02-19 22:23:58 UTC
Permalink
Post by t***@gmail.com
I am configuring a 1P of storage using Dell hardware (r730 and md3060e 60
drive enclosures) with 240 6T drives.
I would like to suggest that you get another md3060e, and only put 48
drives in each. When properly configured with RAID-Z2 8+2, this will put
two drives from each array into each tray. With that configuration, even
if a tray drops out for some reason (i.e. failed expander chip), you won't
lose data.

Even better would be if you could use 10 (11 for RAID-Z3 8+3) 24-drive
units. This would allow a tray to fail and still leave all the arrays
redundant, even at RAID-Z2.

Looking at that disk tray, it appears to cost $11K empty. Is your rack
space really that tight? Using more, smaller, trays is probably a better
choice for reliability. There are also Supermicro disk trays that might
work well for this use, but cost significantly less.
Post by t***@gmail.com
Most advise I see is around 10 drive
sets, but the overhead for 10 drive sets it too large for this size box.
There's a reason that you're seeing advice along those lines. It delivers
the best performance for a large capacity system, while maintaining a high
degree of reliability. Smaller arrays will improve random I/O
performance, but the overhead climbs too fast for most people to be
willing to accept. Larger arrays have too much of a performance penalty
when a drive goes bad. Your best bet would be to stick to 10 (RAID-Z2) or
11 (RAID-Z3) drives per VDEV.
Post by t***@gmail.com
My thought is to build 2 ~500T pools with 120 drives each. Then I could
break one pool off to another server if i ever wanted to for additional
performance.
There are better ways to improve performance. The largest problem with
multiple pools on a single system is that it doesn't allow the system to
make optimum use of all the disks, which will have a performance impact.
The second issue is that you won't be able to tune the system as well, as
ZFS doesn't have separate performance counters for each pool.
Post by t***@gmail.com
Post by Thomas Wakefield
If you say 16+3 is to big, what size would make you comfortable? would
8+2 sets be more comfortable?
Not for me. I've seen raid6 sets of this geometry fail with 2TB drives
and wouldn't want to take the risk on larger drives.
I have as well, when using Desktop drives or WD RE drives. With
Enterprise SATA drives or SAS Nearline drives, I've never seen data loss
with dual-parity arrays.
Post by t***@gmail.com
Multiple smaller pools often provide a more manageable solution if you
are planning ahead or potential disaster recovery requirements.
While they are slightly more manageable, they are not more adjustable, and
they will usually deliver lower performance than a single large pool.
With a large pool, use multiple file-systems. Moving them to a different
server is as simple as a zfs send/receive.

If you feel you absolutely must have multiple pools on a single physical
system, virtualization might be a good idea.

Peter Ashford

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-20 09:53:42 UTC
Permalink
Post by a***@whisperpc.com
Post by Gordan Bobic
Multiple smaller pools often provide a more manageable solution if you
are planning ahead or potential disaster recovery requirements.
While they are slightly more manageable, they are not more adjustable, and
they will usually deliver lower performance than a single large pool.
With a large pool, use multiple file-systems. Moving them to a different
server is as simple as a zfs send/receive.
Another thing to consider is that as your pool gets more similar vdevs, the
failure of any one vdev will cause the failure of the entire pool. If you
have 10 vdevs,
each with 8+2 disks, the probability of complete failure is 10x that of a
single 8+2
set.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
a***@whisperpc.com
2015-02-20 18:58:18 UTC
Permalink
Post by Gordan Bobic
Post by a***@whisperpc.com
Post by Gordan Bobic
Multiple smaller pools often provide a more manageable solution if you
are planning ahead or potential disaster recovery requirements.
While they are slightly more manageable, they are not more adjustable, and
they will usually deliver lower performance than a single large pool.
With a large pool, use multiple file-systems. Moving them to a different
server is as simple as a zfs send/receive.
Another thing to consider is that as your pool gets more similar vdevs, the
failure of any one vdev will cause the failure of the entire pool. If you
have 10 vdevs,
each with 8+2 disks, the probability of complete failure is 10x that of a
single 8+2
set.
While this is absolutely correct, it also applies to every RAID level.

In addition, it's meaningless unless we can express it as actual numbers.
It could be that the 10x failure rate of an 8+2 RAID is perfectly
acceptable, but without the actual rates, it has no real meaning.

Peter Ashford

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
r***@roedie.nl
2015-02-21 10:30:36 UTC
Permalink
Post by Gordan Bobic
Post by Gordan Bobic
Post by Gordan Bobic
Multiple smaller pools often provide a more manageable solution
if you
Post by Gordan Bobic
are planning ahead or potential disaster recovery requirements.
While they are slightly more manageable, they are not more
adjustable, and
they will usually deliver lower performance than a single large pool.
With a large pool, use multiple file-systems. Moving them to a
different
server is as simple as a zfs send/receive.
Another thing to consider is that as your pool gets more similar vdevs, the
failure of any one vdev will cause the failure of the entire pool. If
you have 10 vdevs,
each with 8+2 disks, the probability of complete failure is 10x that
of a single 8+2
set.
Uhm, I don't think that's how statistics work...With 10 vdevs the chance
of having 3 disk failures within the same vdev is lower than with 1
vdev. With 10 8+2 vdevs the chance for a disk failure rises though.

Because the chance of a vdev failure drops when adding more disks, the
chance of having a pool failure due to vdev failure lowers as well.

Sander

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Ryan How
2015-02-21 11:03:21 UTC
Permalink
Gordan is right.
Post by r***@roedie.nl
Post by Gordan Bobic
Post by Gordan Bobic
Post by Gordan Bobic
Multiple smaller pools often provide a more manageable solution
if you
Post by Gordan Bobic
are planning ahead or potential disaster recovery requirements.
While they are slightly more manageable, they are not more
adjustable, and
they will usually deliver lower performance than a single large pool.
With a large pool, use multiple file-systems. Moving them to a
different
server is as simple as a zfs send/receive.
Another thing to consider is that as your pool gets more similar vdevs, the
failure of any one vdev will cause the failure of the entire pool. If
you have 10 vdevs,
each with 8+2 disks, the probability of complete failure is 10x that
of a single 8+2
set.
Uhm, I don't think that's how statistics work...With 10 vdevs the
chance of having 3 disk failures within the same vdev is lower than
with 1 vdev. With 10 8+2 vdevs the chance for a disk failure rises
though.
Because the chance of a vdev failure drops when adding more disks, the
chance of having a pool failure due to vdev failure lowers as well.
Sander
To unsubscribe from this group and stop receiving emails from it, send
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Andreas Dilger
2015-02-22 23:23:36 UTC
Permalink
Post by Ryan How
Gordan is right.
It isn't very useful to make statements like this without any kind of explanation.

Looking at the facts here, it isn't clear that having 10x 8+2 VDEVs is 10x as likely to result in a whole pool failure than having fewer VDEVs with more disks. Sure, it is true if there are 10x as many VDEVs then if any one of the VDEVs fails it will make the pool unusable, but this doesn't take into account the probability of any single VDEV failing.

Taking this to the extreme, a pool with a 100-disk 97+3 VDEV would fail completely as soon as any 4 disks failed, while the 10x 8+2 VDEVs would require 3 disks to fail in the _same_ VDEV (which has 1/10 as many disks), which is much less likely.

While one could claim that the 10x 8+2 setup
has more parity disks and that gives it an unfair advantage, it also isn't possible to have a single 80+20 VDEV either, only RAID-Z3, so the truth is that the 10x 8+2 setup _does_ have an advantage by using much more parity. At best a 7x 12+3 (= 105 disks) might give a slight advantage in terms of reliability.

Cheers, Andreas
Post by Ryan How
Post by Gordan Bobic
Post by a***@whisperpc.com
Post by Gordan Bobic
Multiple smaller pools often provide a more manageable solution if
your are planning ahead or potential disaster recovery requirements.
While they are slightly more manageable, they are not more
adjustable, and they will usually deliver lower performance than a
single large pool. With a large pool, use multiple file-systems.
Moving them to a different server is as simple as a zfs send/receive.
Another thing to consider is that as your pool gets more similar
vdevs, the failure of any one vdev will cause the failure of the
entire pool. If you have 10 vdevs, each with 8+2 disks, the
probability of complete failure is 10x that of a single 8+2 set.
Uhm, I don't think that's how statistics work...With 10 vdevs the chance of having 3 disk failures within the same vdev is lower than with 1 vdev. With 10 8+2 vdevs the chance for a disk failure rises though.
Because the chance of a vdev failure drops when adding more disks, the chance of having a pool failure due to vdev failure lowers as well.
Sander
Cheers, Andreas





To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Kash Pande
2015-02-22 23:55:08 UTC
Permalink
Post by Andreas Dilger
Post by Ryan How
Gordan is right.
It isn't very useful to make statements like this without any kind of explanation.
Looking at the facts here, it isn't clear that having 10x 8+2 VDEVs is 10x as likely to result in a whole pool failure than having fewer VDEVs with more disks. Sure, it is true if there are 10x as many VDEVs then if any one of the VDEVs fails it will make the pool unusable, but this doesn't take into account the probability of any single VDEV failing.
Taking this to the extreme, a pool with a 100-disk 97+3 VDEV would fail completely as soon as any 4 disks failed, while the 10x 8+2 VDEVs would require 3 disks to fail in the _same_ VDEV (which has 1/10 as many disks), which is much less likely.
While one could claim that the 10x 8+2 setup
has more parity disks and that gives it an unfair advantage, it also isn't possible to have a single 80+20 VDEV either, only RAID-Z3, so the truth is that the 10x 8+2 setup _does_ have an advantage by using much more parity. At best a 7x 12+3 (= 105 disks) might give a slight advantage in terms of reliability.
Cheers, Andreas
How does the performance scale when you use 2, 3, or 4-way mirrors with
240x6T drives?

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-23 09:55:44 UTC
Permalink
Post by Andreas Dilger
Post by Ryan How
Gordan is right.
It isn't very useful to make statements like this without any kind of explanation.
Looking at the facts here, it isn't clear that having 10x 8+2 VDEVs is 10x
as likely to result in a whole pool failure than having fewer VDEVs with
more disks. Sure, it is true if there are 10x as many VDEVs then if any
one of the VDEVs fails it will make the pool unusable, but this doesn't
take into account the probability of any single VDEV failing.
No, it doesn't. My original statement was limited to saying that with 10x
the number of vdevs, the probability of losing the lot increases by 10x.

The key point is that as you are adding more vdevs, the probability of
total failure increases. 8+2 arragement may be just fine for a single vdev.
As you start increasing the number of vdevs, you will also have to start
increasing the redundancy level to ensure that the probability of failure
of the whole pool stays approximately similar.

Or to put it differently, 1x(8+2) may be OK, but 10x(8+2) may not be, and
if you are going to upgrade to may vdevs you may need to also look at
increasing the redundancy level to, say, 8+3 per vdev.

Couple that with the fact that the bigger your data the less practically
feasible the backup and restore process is, and the problem gets out of
control pretty quickly.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoat
2015-02-20 14:01:58 UTC
Permalink
Post by a***@whisperpc.com
There are also Supermicro disk trays that might
work well for this use, but cost significantly less.
Don't.... Just don't.

Supermicro chassis don't pay enough attention to vibration coupling,
which in my esperience results in shorter drive lives than in dedicated
JBODs once you hammer on arrays.

The Dell unit is a rebadged EMC.

I have one with a SGI label on it (FC raid controller). They're
surprisngly good at isolating drives despite (or perhaps because of) the
trays being insubstantial plastic sheets.

They are expensive (although as expensive as a Nexsan E60).

I'd be really interested to hear about experience with Infortrend
JB2060s as I'm considering using these in a large array.



To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoat
2015-02-20 14:11:19 UTC
Permalink
Post by a***@whisperpc.com
Post by Uncle Stoat
Post by Thomas Wakefield
If you say 16+3 is to big, what size would make you comfortable? would
8+2 sets be more comfortable?
Not for me. I've seen raid6 sets of this geometry fail with 2TB drives
and wouldn't want to take the risk on larger drives.
I have as well, when using Desktop drives or WD RE drives.
Nope.

In one case it was a set of HP Ultrascsi 140Gb drives (a long time ago
in a MSA1000 array) and in another it was a set of Seagate
Constellations (Enterprise SATA)

There is no meaningful difference in failure rates between "enterprise"
and "domestic" drives. If anything the enterprise drives seem to die faster.

I have to make the point again that enclosure quality makes an enormous
difference to drive longevity, in particular how well they isolate seek
vibration from each other.

Couple that with continuing deteriorating drive quality and the fact
that you can't tell in advance how reliable any given batch of drives
will be, it's safer to err on the side of paranoia.

Backups are a given, but restoring this amount of data from backups is
highly disruptive on day-to-day operations.


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
a***@whisperpc.com
2015-02-20 19:22:33 UTC
Permalink
Post by Uncle Stoat
There is no meaningful difference in failure rates between "enterprise"
and "domestic" drives. If anything the enterprise drives seem to die faster.
Evidence please.

My information shows Enterprise disks as being 10x more reliable than
Desktop drives. When you make adjustments for the relatively small number
of drives I'm dealing with (~200), I would be surprised if the Enterprise
drives weren't at least 5x more reliable in a apples-to-apples comparison
(same data patterns and rates, same chassis, same environment).
Post by Uncle Stoat
I have to make the point again that enclosure quality makes an enormous
difference to drive longevity, in particular how well they isolate seek
vibration from each other.
Just as important, and possibly more important in a large datacenter, is
isolation from the vibrations of the internal fans, and from the noise of
the data center. The 60-drive enclosure can do no better on those two
points than the Supermicro enclosures.



The above link shows what happens when vibration becomes excessive.
You'll note that the test gear was from Sun, which, IIRC, uses a plastic
tray.
Post by Uncle Stoat
Couple that with continuing deteriorating drive quality and the fact
that you can't tell in advance how reliable any given batch of drives
will be, it's safer to err on the side of paranoia.
I'll agree with that. The question isn't whether or not we're paranoid.
The question is whether nor not we're paranoid enough.
Post by Uncle Stoat
Backups are a given, but restoring this amount of data from backups is
highly disruptive on day-to-day operations.
That depends on the backup methodology you choose. For large pools, the
backup I suggest for business-critical data is to mirror it on another
server (zfs send/receive). The time to get this data back is quite short
(change a CNAME and reboot the clients).

Peter Ashford

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Linda Kateley
2015-02-20 19:59:57 UTC
Permalink
I can't find it right now, but there was a massive study done by google
and cmu to see what the differences were between commodity and
enterprise drives. Their conclusion was that there was very little
difference.

Linda
Post by a***@whisperpc.com
Post by Uncle Stoat
There is no meaningful difference in failure rates between "enterprise"
and "domestic" drives. If anything the enterprise drives seem to die faster.
Evidence please.
My information shows Enterprise disks as being 10x more reliable than
Desktop drives. When you make adjustments for the relatively small number
of drives I'm dealing with (~200), I would be surprised if the Enterprise
drives weren't at least 5x more reliable in a apples-to-apples comparison
(same data patterns and rates, same chassis, same environment).
Post by Uncle Stoat
I have to make the point again that enclosure quality makes an enormous
difference to drive longevity, in particular how well they isolate seek
vibration from each other.
Just as important, and possibly more important in a large datacenter, is
isolation from the vibrations of the internal fans, and from the noise of
the data center. The 60-drive enclosure can do no better on those two
points than the Supermicro enclosures.
http://youtu.be/tDacjrSCeq4
The above link shows what happens when vibration becomes excessive.
You'll note that the test gear was from Sun, which, IIRC, uses a plastic
tray.
Post by Uncle Stoat
Couple that with continuing deteriorating drive quality and the fact
that you can't tell in advance how reliable any given batch of drives
will be, it's safer to err on the side of paranoia.
I'll agree with that. The question isn't whether or not we're paranoid.
The question is whether nor not we're paranoid enough.
Post by Uncle Stoat
Backups are a given, but restoring this amount of data from backups is
highly disruptive on day-to-day operations.
That depends on the backup methodology you choose. For large pools, the
backup I suggest for business-critical data is to mirror it on another
server (zfs send/receive). The time to get this data back is quite short
(change a CNAME and reboot the clients).
Peter Ashford
--
Linda Kateley
Kateley Company
Skype ID-kateleyco
http://kateleyco.com

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Luke Olson
2015-02-20 20:37:55 UTC
Permalink
This is the paper from Google published 8 years ago.

http://static.googleusercontent.com/media/research.google.com/en/us/archive/disk_failures.pdf

And the Storage Mojo take on it.

http://storagemojo.com/2007/02/19/googles-disk-failure-experience/

This paper covers similar territory in terms of enterprise drive
reliability.

https://www.cs.cmu.edu/~bianca/fast07.pdf

And the Storage Mojo take on that one too.

http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/

Luke
Post by Linda Kateley
I can't find it right now, but there was a massive study done by google
and cmu to see what the differences were between commodity and enterprise
drives. Their conclusion was that there was very little difference.
Linda
Post by Uncle Stoat
There is no meaningful difference in failure rates between "enterprise"
Post by Uncle Stoat
and "domestic" drives. If anything the enterprise drives seem to die faster.
Evidence please.
My information shows Enterprise disks as being 10x more reliable than
Desktop drives. When you make adjustments for the relatively small number
of drives I'm dealing with (~200), I would be surprised if the Enterprise
drives weren't at least 5x more reliable in a apples-to-apples comparison
(same data patterns and rates, same chassis, same environment).
I have to make the point again that enclosure quality makes an enormous
Post by Uncle Stoat
difference to drive longevity, in particular how well they isolate seek
vibration from each other.
Just as important, and possibly more important in a large datacenter, is
isolation from the vibrations of the internal fans, and from the noise of
the data center. The 60-drive enclosure can do no better on those two
points than the Supermicro enclosures.
http://youtu.be/tDacjrSCeq4
The above link shows what happens when vibration becomes excessive.
You'll note that the test gear was from Sun, which, IIRC, uses a plastic
tray.
Couple that with continuing deteriorating drive quality and the fact
Post by Uncle Stoat
that you can't tell in advance how reliable any given batch of drives
will be, it's safer to err on the side of paranoia.
I'll agree with that. The question isn't whether or not we're paranoid.
The question is whether nor not we're paranoid enough.
Backups are a given, but restoring this amount of data from backups is
Post by Uncle Stoat
highly disruptive on day-to-day operations.
That depends on the backup methodology you choose. For large pools, the
backup I suggest for business-critical data is to mirror it on another
server (zfs send/receive). The time to get this data back is quite short
(change a CNAME and reboot the clients).
Peter Ashford
To unsubscribe from this group and stop receiving emails from it, send an
--
Linda Kateley
Kateley Company
Skype ID-kateleyco
http://kateleyco.com
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Luke Olson
2015-02-20 20:45:33 UTC
Permalink
The data here is less comprehensive but suggests things haven't changed
since 2007.

https://www.backblaze.com/blog/enterprise-drive-reliability/

Luke
Post by Luke Olson
This is the paper from Google published 8 years ago.
http://static.googleusercontent.com/media/research.google.com/en/us/archive/disk_failures.pdf
And the Storage Mojo take on it.
http://storagemojo.com/2007/02/19/googles-disk-failure-experience/
This paper covers similar territory in terms of enterprise drive
reliability.
https://www.cs.cmu.edu/~bianca/fast07.pdf
And the Storage Mojo take on that one too.
http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/
Luke
Post by Linda Kateley
I can't find it right now, but there was a massive study done by google
and cmu to see what the differences were between commodity and enterprise
drives. Their conclusion was that there was very little difference.
Linda
Post by Uncle Stoat
There is no meaningful difference in failure rates between "enterprise"
Post by Uncle Stoat
and "domestic" drives. If anything the enterprise drives seem to die faster.
Evidence please.
My information shows Enterprise disks as being 10x more reliable than
Desktop drives. When you make adjustments for the relatively small number
of drives I'm dealing with (~200), I would be surprised if the Enterprise
drives weren't at least 5x more reliable in a apples-to-apples comparison
(same data patterns and rates, same chassis, same environment).
I have to make the point again that enclosure quality makes an enormous
Post by Uncle Stoat
difference to drive longevity, in particular how well they isolate seek
vibration from each other.
Just as important, and possibly more important in a large datacenter, is
isolation from the vibrations of the internal fans, and from the noise of
the data center. The 60-drive enclosure can do no better on those two
points than the Supermicro enclosures.
http://youtu.be/tDacjrSCeq4
The above link shows what happens when vibration becomes excessive.
You'll note that the test gear was from Sun, which, IIRC, uses a plastic
tray.
Couple that with continuing deteriorating drive quality and the fact
Post by Uncle Stoat
that you can't tell in advance how reliable any given batch of drives
will be, it's safer to err on the side of paranoia.
I'll agree with that. The question isn't whether or not we're paranoid.
The question is whether nor not we're paranoid enough.
Backups are a given, but restoring this amount of data from backups is
Post by Uncle Stoat
highly disruptive on day-to-day operations.
That depends on the backup methodology you choose. For large pools, the
backup I suggest for business-critical data is to mirror it on another
server (zfs send/receive). The time to get this data back is quite short
(change a CNAME and reboot the clients).
Peter Ashford
To unsubscribe from this group and stop receiving emails from it, send
--
Linda Kateley
Kateley Company
Skype ID-kateleyco
http://kateleyco.com
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Kash Pande
2015-02-20 22:02:01 UTC
Permalink
For completeness, Backblaze:

https://www.backblaze.com/blog/hard-drive-smart-stats/
Post by Luke Olson
This is the paper from Google published 8 years ago.
http://static.googleusercontent.com/media/research.google.com/en/us/archive/disk_failures.pdf
And the Storage Mojo take on it.
http://storagemojo.com/2007/02/19/googles-disk-failure-experience/
This paper covers similar territory in terms of enterprise drive
reliability.
https://www.cs.cmu.edu/~bianca/fast07.pdf
<https://www.cs.cmu.edu/%7Ebianca/fast07.pdf>
And the Storage Mojo take on that one too.
http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/
Luke
I can't find it right now, but there was a massive study done by
google and cmu to see what the differences were between commodity
and enterprise drives. Their conclusion was that there was very
little difference.
Linda
There is no meaningful difference in failure rates between
"enterprise"
and "domestic" drives. If anything the enterprise drives
seem to die
faster.
Evidence please.
My information shows Enterprise disks as being 10x more reliable than
Desktop drives. When you make adjustments for the relatively small number
of drives I'm dealing with (~200), I would be surprised if the Enterprise
drives weren't at least 5x more reliable in a apples-to-apples comparison
(same data patterns and rates, same chassis, same environment).
I have to make the point again that enclosure quality
makes an enormous
difference to drive longevity, in particular how well they
isolate seek
vibration from each other.
Just as important, and possibly more important in a large datacenter, is
isolation from the vibrations of the internal fans, and from the noise of
the data center. The 60-drive enclosure can do no better on those two
points than the Supermicro enclosures.
http://youtu.be/tDacjrSCeq4
The above link shows what happens when vibration becomes excessive.
You'll note that the test gear was from Sun, which, IIRC, uses a plastic
tray.
Couple that with continuing deteriorating drive quality
and the fact
that you can't tell in advance how reliable any given
batch of drives
will be, it's safer to err on the side of paranoia.
I'll agree with that. The question isn't whether or not we're paranoid.
The question is whether nor not we're paranoid enough.
Backups are a given, but restoring this amount of data
from backups is
highly disruptive on day-to-day operations.
That depends on the backup methodology you choose. For large pools, the
backup I suggest for business-critical data is to mirror it on another
server (zfs send/receive). The time to get this data back is quite short
(change a CNAME and reboot the clients).
Peter Ashford
To unsubscribe from this group and stop receiving emails from
--
Linda Kateley
Kateley Company
Skype ID-kateleyco
http://kateleyco.com
To unsubscribe from this group and stop receiving emails from it,
To unsubscribe from this group and stop receiving emails from it, send
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Durval Menezes
2015-02-19 14:02:26 UTC
Permalink
Hi Gordan,
Post by Gordan Bobic
Post by Durval Menezes
Also, please keep in mind that each raidz3 will give you the write
performance (both MB/s and IOPS) of just a single drive...
Not strictly true. On linear reads with non-fragmented files you will get
up to the MB/s of the data bearing drives. Partial writes can be higher on
IOPS than a single drive, e.g. if you have a 19-disk RAIDZ3, and you are
writing a single ashift size block, you will consume an IOP on 4 drives
(and suffer inefficient space usage), leaving the other 15 free for other
workloads. Similar applies to reads of small blocks. Even on large reads,
the parity part of your set will not be touched unless the stripe checksum
comparison fails, so on a 19-disk RAIDZ3 you will only use 16 disks to read
the stripe, unless recovery is required, leaving the other 3 disks free to
work on any other requests they can fulfil.
Yeah, I was simplifying and the one write case you mention is very
borderline IMO.... and the other cases you mention are basically read cases
(notice that I specifically said "write performance" above).
Post by Gordan Bobic
IMO, however, 16+3 is too big with 6TB drives.
No doubt. 8+3 in my mind is already more than dangerous enough. These 6TB
drives are still too new and I don't think they have been tested enough to
be fully debugged and no one knows yet how much and in what manner they are
going to fail (but going to fail they will).

As they say, better safe than sorry, specially with 1PB of data.
Post by Gordan Bobic
Even if your backup system is good, the restore time is so prohibitively
expensive that the system really needs to be engineered to never need it
under plausible circumstances. You should still have it - you should just
also make sure you never need it. This is also one good reason why many
small pools is hugely advantageous over one large pool.
Interesting proposition, never really thought about that this way.

Cheers,
--
Durval.
Post by Gordan Bobic
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Linda Kateley
2015-02-19 01:19:05 UTC
Permalink
this is a great blog on the subject..

http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/
Post by Durval Menezes
Howdy Tom,
You don't mention your I/O workload. You seem to be considering only
redundancy x reliability, but these are far from the only
considerations. Depending on what your I/O needs are, performance for
your specific I/O workload is something you really should be thinking
about...
Cheers,
--
Durval.
I am configuring a 1P of storage using Dell hardware (r730 and
md3060e 60 drive enclosures) with 240 6T drives.
My thought is to build 2 ~500T pools with 120 drives each. Then I
could break one pool off to another server if i ever wanted to for
additional performance. The server will have 4 hot spare drives,
in addition to the 240 drives.
What I need help on is deciding on how many drives per raidz3? I
am trying to decide between 15 and 20 drive per raidz3. Is that
absolutely insane having that many drives together? Most advise I
see is around 10 drive sets, but the overhead for 10 drive sets it
too large for this size box.
Thank you for any advise you can give.
-Tom
To unsubscribe from this group and stop receiving emails from it,
To unsubscribe from this group and stop receiving emails from it, send
--
Linda Kateley
Kateley Company
Skype ID-kateleyco
http://kateleyco.com

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Marion Hakanson
2015-02-19 22:56:34 UTC
Permalink
I'll just comment on our experience. I agree that the 16-drive vdev
size is too big. The largest I've been comfortable going with is
a 12-drive, 9+3 configuration (2, 3, and 4-TB drives). A lot of our
customers are used to going with 11-drive RAID-6 with hot spare (fits
nicely in a Dell MD1200), so it's an easy sell to make a 12-drive raidz3
out of the same number of drives, with ~10x better MTTDL (by Richard
Elling's charts).

However, we have some setups here that have pushed it to a 13-drive,
10+3 config, using 4TB drives, and still get adequate performance for
the task (mostly-sequential genomics workloads). Fits pretty well in
a 40- or 45-slot JBOD, giving room for a hot (or cold) spare or a few.
Expect up to a 36-hour resilver time on a failed 4TB drive, if the pool
is close to full.

Regards,

Marion


=================================================================
Subject: Re: [zfs-discuss] Large disk system 240 6T drives
From: <***@whisperpc.com>
Date: Thu, 19 Feb 2015 14:23:58 -0800
Post by t***@gmail.com
I am configuring a 1P of storage using Dell hardware (r730 and md3060e 60
drive enclosures) with 240 6T drives.
I would like to suggest that you get another md3060e, and only put 48
drives in each. When properly configured with RAID-Z2 8+2, this will put
two drives from each array into each tray. With that configuration, even
if a tray drops out for some reason (i.e. failed expander chip), you won't
lose data.

Even better would be if you could use 10 (11 for RAID-Z3 8+3) 24-drive
units. This would allow a tray to fail and still leave all the arrays
redundant, even at RAID-Z2.

Looking at that disk tray, it appears to cost $11K empty. Is your rack
space really that tight? Using more, smaller, trays is probably a better
choice for reliability. There are also Supermicro disk trays that might
work well for this use, but cost significantly less.
Post by t***@gmail.com
Most advise I see is around 10 drive
sets, but the overhead for 10 drive sets it too large for this size box.
There's a reason that you're seeing advice along those lines. It delivers
the best performance for a large capacity system, while maintaining a high
degree of reliability. Smaller arrays will improve random I/O
performance, but the overhead climbs too fast for most people to be
willing to accept. Larger arrays have too much of a performance penalty
when a drive goes bad. Your best bet would be to stick to 10 (RAID-Z2) or
11 (RAID-Z3) drives per VDEV.
Post by t***@gmail.com
My thought is to build 2 ~500T pools with 120 drives each. Then I could
break one pool off to another server if i ever wanted to for additional
performance.
There are better ways to improve performance. The largest problem with
multiple pools on a single system is that it doesn't allow the system to
make optimum use of all the disks, which will have a performance impact.
The second issue is that you won't be able to tune the system as well, as
ZFS doesn't have separate performance counters for each pool.
Post by t***@gmail.com
Post by Thomas Wakefield
If you say 16+3 is to big, what size would make you comfortable? would
8+2 sets be more comfortable?
Not for me. I've seen raid6 sets of this geometry fail with 2TB drives
and wouldn't want to take the risk on larger drives.
I have as well, when using Desktop drives or WD RE drives. With
Enterprise SATA drives or SAS Nearline drives, I've never seen data loss
with dual-parity arrays.
Post by t***@gmail.com
Multiple smaller pools often provide a more manageable solution if you
are planning ahead or potential disaster recovery requirements.
While they are slightly more manageable, they are not more adjustable, and
they will usually deliver lower performance than a single large pool.
With a large pool, use multiple file-systems. Moving them to a different
server is as simple as a zfs send/receive.

If you feel you absolutely must have multiple pools on a single physical
system, virtualization might be a good idea.

Peter Ashford

To unsubscribe from this group and stop receiving emails from it, send an
email to zfs-discuss+***@zfsonlinux.org.


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Thomas Mieslinger
2015-02-20 09:02:21 UTC
Permalink
Hi Tom,

Am 19.02.2015 um 00:11 schrieb ***@gmail.com:

I am configuring a 1P of storage using Dell hardware (r730 and md3060e
60 drive enclosures) with 240 6T drives.

Could you tell which SAS HBAs you are using? Have you done something
special with the iDRAC or the Lifecycle Controller?

I try get a R920 + 8 MD1220 (JBOB with 24 2,5" Drives) and 4 LSI
9206-16e HBAs running, but I see SCSI Bus errors, System freezes
sometimes without a reason in the log, and sometimes the box even
doesn't boot.

Regards Thomas

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Continue reading on narkive:
Loading...