Discussion:
SSD for ZFS acceleration
(too old to reply)
Swâmi Petaramesh
2012-07-30 07:55:48 UTC
Permalink
Hi there,

I have a ZFS machine which is extremely slow, probably due to
deduplication tables that don't fit into main memory.

I have purchased a 120 GB SSD that I plan to use as an ARC L2 cache, but
I am wondering if the best setup would be to use the complete SSD as an
ARC cache (it looks a bit overkill for that) or if I should partition it
and use one partition as an ARC cache and another as a write intent log.

My ZPOOL itself is on ZFS mirror.

Any advice selcome :-)

TIA.
--
Swâmi Petaramesh <swami-***@public.gmane.org> http://petaramesh.org PGP 9076E32E
Ne cherchez pas : Je ne suis pas sur Facebook.
Uncle Stoatwarbler
2012-07-30 08:31:25 UTC
Permalink
Post by Swâmi Petaramesh
Hi there,
I have a ZFS machine which is extremely slow, probably due to
deduplication tables that don't fit into main memory.
I have purchased a 120 GB SSD that I plan to use as an ARC L2 cache, but
I am wondering if the best setup would be to use the complete SSD as an
ARC cache (it looks a bit overkill for that) or if I should partition it
and use one partition as an ARC cache and another as a write intent log.
You only need 512Mb as WIL and you'll find that 120Gb L2ARC isn't
overkill. Partitioning is OK.

If swap is currently on rotating media, then taking a chunk of SSD space
for that is also advisable.
Swâmi Petaramesh
2012-07-30 10:22:10 UTC
Permalink
Post by Uncle Stoatwarbler
You only need 512Mb as WIL and you'll find that 120Gb L2ARC isn't
overkill. Partitioning is OK.
Great. BTW (Yes, I know I might RTFM twice again ;-) how does one define
some partition as L2ARC and some other as WIL ?

Is that somethign like :

zpool add mypool cache /dev/sde1
or
zpool add mypool log /dev/sde2

etc ?

TIA
--
Swâmi Petaramesh <swami-***@public.gmane.org> http://petaramesh.org PGP 9076E32E
Uncle Stoatwarbler
2012-07-30 23:55:30 UTC
Permalink
Post by Swâmi Petaramesh
Post by Uncle Stoatwarbler
You only need 512Mb as WIL and you'll find that 120Gb L2ARC isn't
overkill. Partitioning is OK.
Great. BTW (Yes, I know I might RTFM twice again ;-) how does one define
some partition as L2ARC and some other as WIL ?
zpool add mypool cache /dev/sde1
or
zpool add mypool log /dev/sde2
Yes, exactly like that.

Although it's better to use /dev/disk/by-id/ paths than direct to the
disk. Drives might be detected in the wrong order at some point in future.
Swâmi Petaramesh
2012-08-02 07:40:46 UTC
Permalink
Post by Uncle Stoatwarbler
Post by Swâmi Petaramesh
zpool add mypool cache /dev/sde1
or
zpool add mypool log /dev/sde2
Yes, exactly like that.
zpool refused to add a separate log device to my root pool anyway....
But the cache works :-)
Uncle Stoatwarbler
2012-08-05 11:39:46 UTC
Permalink
Post by Swâmi Petaramesh
zpool refused to add a separate log device to my root pool anyway....
But the cache works :-)
This is documented.

Petter B
2012-07-30 08:57:24 UTC
Permalink
I also have a 120 GB SSD. I have partitioned it and I use it for swap, ZIL and cache.
Vladimír Drgoňa
2012-07-30 09:03:49 UTC
Permalink
Post by Petter B
I also have a 120 GB SSD. I have partitioned it and I use it for swap, ZIL and cache.
I have a 60GB SSD. How size of partition is good for ZIL, cache?
Rocky Shek
2012-07-30 16:54:50 UTC
Permalink
8G to 10G for ZIL and the rest for L2ARC is common practice I saw in the field

-----Original Message-----
From: Vladimír Drgoňa [mailto:vlado-VSOJHI9MUkeHXe+***@public.gmane.org]
Sent: Monday, July 30, 2012 2:04 AM
To: zfs-discuss-VKpPRiiRko4/***@public.gmane.org
Subject: Re: [zfs-discuss] SSD for ZFS acceleration
Post by Petter B
I also have a 120 GB SSD. I have partitioned it and I use it for swap, ZIL and cache.
I have a 60GB SSD. How size of partition is good for ZIL, cache?
Igor Hjelmstrom Vinhas Ribeiro
2012-07-31 03:35:02 UTC
Permalink
You might also want to ensure l2arc_noprefetch is FALSE, so that your streaming workloads are also put into l2arc.

At Mon, 30 Jul 2012 09:54:50 -0700,
Post by Rocky Shek
8G to 10G for ZIL and the rest for L2ARC is common practice I saw in the field
-----Original Message-----
Sent: Monday, July 30, 2012 2:04 AM
Subject: Re: [zfs-discuss] SSD for ZFS acceleration
Post by Petter B
I also have a 120 GB SSD. I have partitioned it and I use it for swap, ZIL and cache.
I have a 60GB SSD. How size of partition is good for ZIL, cache?
Petter B
2012-07-31 08:10:16 UTC
Permalink
Apparently the size of the ZIL is determined by the maximum throughput,
see
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Separate_Log_Devices.
Use the rest / as much as possible for the cache.
*
*

- *The maximum size of a log device should be approximately 1/2 the size
of physical memory because that is the maximum amount of potential in-play
data that can be stored. For example, if a system has 16 GB of physical
memory, consider a maximum log device size of 8 GB.*
- *For a target throughput of X MB/sec and given that ZFS pushes
transaction groups every 5 seconds (and have 2 outstanding), we also expect
the ZIL to not grow beyond X MB/sec * 10 sec. So to service 100MB/sec of
synchronous writes, 1 GB of log device should be sufficient.*
Post by Petter B
Post by Petter B
I also have a 120 GB SSD. I have partitioned it and I use it for swap,
ZIL and cache.
I have a 60GB SSD. How size of partition is good for ZIL, cache?
Salvatore Giudice
2012-07-31 08:21:08 UTC
Permalink
If you use a stec zeus ram for zil the sizing rule for having a zil that is 1/2 the size of ram no longer applies. With stec zeus ram you'd almost never need more than 8GB.

Cheers, tore
Apparently the size of the ZIL is determined by the maximum throughput, see http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Separate_Log_Devices. Use the rest / as much as possible for the cache.
The maximum size of a log device should be approximately 1/2 the size of physical memory because that is the maximum amount of potential in-play data that can be stored. For example, if a system has 16 GB of physical memory, consider a maximum log device size of 8 GB.
For a target throughput of X MB/sec and given that ZFS pushes transaction groups every 5 seconds (and have 2 outstanding), we also expect the ZIL to not grow beyond X MB/sec * 10 sec. So to service 100MB/sec of synchronous writes, 1 GB of log device should be sufficient.
Post by Petter B
I also have a 120 GB SSD. I have partitioned it and I use it for swap, ZIL and cache.
I have a 60GB SSD. How size of partition is good for ZIL, cache?
Uncle Stoatwarbler
2012-07-31 08:36:08 UTC
Permalink
Post by Petter B
Apparently the size of the ZIL is determined by the maximum throughput,
see
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Separate_Log_Devices.
Use the rest / as much as possible for the cache.
It's worth bearing in mind that ZIL is only applicable for synchronous
writes. Unless the workload has a high proportion of them, large sizes
will never be used.
Toan Hoang
2012-07-31 08:46:17 UTC
Permalink
This is very interesting, is there any recommendation of SSDs? TRIM or other feature I should be looking for?

Brgds,
Toan
Post by Petter B
Apparently the size of the ZIL is determined by the maximum throughput,
see
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Separate_Log_Devices.
Use the rest / as much as possible for the cache.
It's worth bearing in mind that ZIL is only applicable for synchronous writes. Unless the workload has a high proportion of them, large sizes will never be used.
e-t172
2012-07-31 09:20:33 UTC
Permalink
Post by Toan Hoang
This is very interesting, is there any recommendation of SSDs? TRIM or other feature I should be looking for?
ZFS doesn't support TRIM yet, but it's coming:
https://github.com/zfsonlinux/zfs/issues/598
--
Etienne Dechamps / e-t172 - AKE Group
Phone: +33 6 23 42 24 82
Christ Schlacta
2012-07-31 19:12:57 UTC
Permalink
Post by Uncle Stoatwarbler
Post by Petter B
Apparently the size of the ZIL is determined by the maximum throughput,
see
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Separate_Log_Devices.
Use the rest / as much as possible for the cache.
It's worth bearing in mind that ZIL is only applicable for synchronous
writes. Unless the workload has a high proportion of them, large sizes
will never be used.
I'd like to reintroduce the request for a "Write everything to zil" mode
that allows you to specify that everything should be written to the ssd
zil then flushed to disk as the zil fills.
I've suggested this before, for power saving benefits, and I still
believe it's an important future feature!
Jesus Cea
2012-07-31 18:12:51 UTC
Permalink
Post by Vladimír Drgoňa
Post by Petter B
I also have a 120 GB SSD. I have partitioned it and I use it for swap, ZIL and cache.
I have a 60GB SSD. How size of partition is good for ZIL, cache?
The short answer is "it depends" of your SYNCHRONOUS WRITE TRAFFIC. If
you are not using transactional databases, the size can be quite small.

I am using a mirrored 2GB partition for ZIL, and it is PLENTY of space
for my workload.

- --
Jesus Cea Avion _/_/ _/_/_/ _/_/_/
jcea-***@public.gmane.org - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/
jabber / xmpp:jcea-/eSpBmjxGS4dnm+***@public.gmane.org _/_/ _/_/ _/_/_/_/_/
. _/_/ _/_/ _/_/ _/_/ _/_/
"Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/
"My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
Jesus Cea
2012-07-31 18:20:25 UTC
Permalink
Post by Jesus Cea
I am using a mirrored 2GB partition for ZIL, and it is PLENTY of
space for my workload.
capacity operations bandwidth
pool alloc free read write read write
- ------------ ----- ----- ----- ----- ----- -----
datos 1.40T 392G 7 64 330K 790K
mirror 1.40T 392G 7 60 330K 591K
c4t2d0s1 - - 3 15 288K 591K
c4t3d0s1 - - 3 15 288K 591K
mirror 36.1M 1.96G 0 4 0 198K
c4t0d0s1 - - 0 4 0 198K
c4t1d0s1 - - 0 4 0 198K
cache - - - - - -
c4t0d0s2 27.2G 8M 0 0 11.9K 50.9K
c4t1d0s2 27.2G 8M 0 0 11.7K 50.8K


So of my 2GB mirrored ZIL, I am using, just now, around 36MB. This is
a real server with real database trafic, but mostly reads.

- --
Jesus Cea Avion _/_/ _/_/_/ _/_/_/
jcea-***@public.gmane.org - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/
jabber / xmpp:jcea-/eSpBmjxGS4dnm+***@public.gmane.org _/_/ _/_/ _/_/_/_/_/
. _/_/ _/_/ _/_/ _/_/ _/_/
"Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/
"My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
Christ Schlacta
2012-07-31 19:16:30 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Jesus Cea
I am using a mirrored 2GB partition for ZIL, and it is PLENTY of
space for my workload.
capacity operations bandwidth
pool alloc free read write read write
- ------------ ----- ----- ----- ----- ----- -----
datos 1.40T 392G 7 64 330K 790K
mirror 1.40T 392G 7 60 330K 591K
c4t2d0s1 - - 3 15 288K 591K
c4t3d0s1 - - 3 15 288K 591K
mirror 36.1M 1.96G 0 4 0 198K
c4t0d0s1 - - 0 4 0 198K
c4t1d0s1 - - 0 4 0 198K
cache - - - - - -
c4t0d0s2 27.2G 8M 0 0 11.9K 50.9K
c4t1d0s2 27.2G 8M 0 0 11.7K 50.8K
So of my 2GB mirrored ZIL, I am using, just now, around 36MB. This is
a real server with real database trafic, but mostly reads.
- --
Jesus Cea Avion _/_/ _/_/_/ _/_/_/
. _/_/ _/_/ _/_/ _/_/ _/_/
"Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/
"My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iQCVAwUBUBgh6Zlgi5GaxT1NAQLBbgP/WK5gZbu8Mjnnvl6RuGwh/JoiYolvkF+a
a4Br9/e/WAgiqknwvxycpxnrCaYPGSZGmTwUpGggHw49sbIMiSspu5E0ElqtK66Y
uvyhRz9ydGp8UYDhDyuG/yAKj/zYNtIm4RvGvT3V03VMvulJwHtSy3Hjbz3eVOeD
F+XsfUNId/M=
=a71I
-----END PGP SIGNATURE-----
Unless I'm horribly mistaken, your 2GB "log" device is in fact just
added to the pool as a mirror and is having data striped across it.
Unfortunately, there's no way to undo this mistake. You should consider
replacing the 2GB drives with some fresh 2TB drives and trying again to
install a log device, or you should destroy your pool and recreate from
backup.
Massimo Maggi
2012-07-31 20:24:40 UTC
Permalink
Unless I'm horribly mistaken, your 2GB "log" device is in fact just added
to the pool as a mirror and is having data striped across it.
Unfortunately, there's no way to undo this mistake. You should consider
replacing the 2GB drives with some fresh 2TB drives and trying again to
install a log device, or you should destroy your pool and recreate from
backup.
I think that it is a bug of "zpool iostat".
Take the output of my system as an example:

sigmaii ~ # zpool iostat -v
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
mypool 868G 828G 1 29 23,6K 168K
mirror 382G 466G 0 16 13,1K 89,1K
sda3 - - 0 10 6,65K 89,8K
sdd3 - - 0 12 6,60K 89,8K
mirror 486G 362G 0 10 10,5K 45,2K
sdb3 - - 0 8 5,37K 45,8K
sdc3 - - 0 8 5,28K 45,8K
sde1 260K 9,94G 0 2 6 33,4K
cache - - - - - -
sde2 16,4G 3,60G 0 1 1,69K 134K
---------- ----- ----- ----- ----- ----- -----

sigmaii ~ # zpool status
pool: mypool
state: ONLINE
scan: scrub repaired 0 in 3h0m with 0 errors on Tue Jul 17 05:15:44 2012
config:

NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda3 ONLINE 0 0 0
sdd3 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
sdb3 ONLINE 0 0 0
sdc3 ONLINE 0 0 0
logs
sde1 ONLINE 0 0 0
cache
sde2 ONLINE 0 0 0

errors: No known data errors

As you can see, sde1 is a log device, but in zpool iostat it appears as a
regular member of the pool.
ZFS code on this machine is quite old (December 2011, but works
wonderfully) so this *might* be fixed in later revisions.

Massimo Maggi
Jesus Cea
2012-07-31 23:41:52 UTC
Permalink
Post by Christ Schlacta
Unless I'm horribly mistaken, your 2GB "log" device is in fact
just added to the pool as a mirror and is having data striped
across it. Unfortunately, there's no way to undo this mistake. You
should consider replacing the 2GB drives with some fresh 2TB drives
and trying again to install a log device, or you should destroy
your pool and recreate from backup.
No, you are mistaken, but yes, the report is a bit ambiguous:

Disk usage:

"""
[***@stargate-host z]# zpool iostat -v datos
capacity operations bandwidth
pool alloc free read write read write
- ------------ ----- ----- ----- ----- ----- -----
datos 1.40T 392G 7 64 329K 789K
mirror 1.40T 392G 7 60 329K 591K
c4t2d0s1 - - 3 15 287K 591K
c4t3d0s1 - - 3 15 287K 591K
mirror 16.2M 1.98G 0 4 0 198K
c4t0d0s1 - - 0 4 0 198K
c4t1d0s1 - - 0 4 0 198K
cache - - - - - -
c4t0d0s2 27.2G 8M 0 0 11.9K 50.7K
c4t1d0s2 27.2G 8M 0 0 11.7K 50.6K
- ------------ ----- ----- ----- ----- ----- -----
"""

The topology:

"""
[***@stargate-host z]# zpool status -v datos
pool: datos
state: ONLINE
scan: scrub repaired 0 in 6h44m with 0 errors on Mon Jun 18 22:45:00 2012
config:

NAME STATE READ WRITE CKSUM
datos ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c4t2d0s1 ONLINE 0 0 0
c4t3d0s1 ONLINE 0 0 0
logs
mirror-1 ONLINE 0 0 0
c4t0d0s1 ONLINE 0 0 0
c4t1d0s1 ONLINE 0 0 0
cache
c4t0d0s2 ONLINE 0 0 0
c4t1d0s2 ONLINE 0 0 0

errors: No known data errors
"""

My ZIL is the mirror under "logs". The L2ARC is under "cache", not
mirrored but stripped.

- --
Jesus Cea Avion _/_/ _/_/_/ _/_/_/
jcea-***@public.gmane.org - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/
jabber / xmpp:jcea-/eSpBmjxGS4dnm+***@public.gmane.org _/_/ _/_/ _/_/_/_/_/
. _/_/ _/_/ _/_/ _/_/ _/_/
"Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/
"My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
Continue reading on narkive:
Loading...